Databricks Cluster Management via PowerShell

We have released a big update to the CI/CD Tools on GitHub today: These updates are for cluster management within Databricks. They allow for you to Create or Update Clusters. Stop/Start/Delete and Resize. There are also some new helper functions to get a list of available Spark versions and types of VM’s available to you. The full set of new commands is: Get-DatabricksClusters - Returns a list of all clusters in your workspace New-DatabricksCluster - Creates/Updates a cluster Start-DatabricksCluster Stop-DatabricksCluster Update-DatabricksClusterResize - Modify the number of scale workers Remove-DatabricksCluster - Deletes your cluster Get-DatabricksNodeTypes - returns a list of valid nodes type (such as DS3v2 etc) Get-DatabricksSparkVersions - returns a list of valid versions These will hopefully be added to the VSTS/Azure DevOps tasks in near future. [Read More]

Executing SQL Server Stored Procedures from Databricks (PySpark)

Databricks provides some nice connectors for reading and writing data to SQL Server. These are generally want you need as these act in a distributed fashion and support push down predicates etc etc. But sometimes you want to execute a stored procedure or a simple statement. I must stress this is not recommended - more on that at the end of this blog. I’m going to assume that as you made it here you really want to do this. [Read More]

Controlling the Databricks Resource Group Name

When you create a Databricks workspace using the Azure portal you obviously specify the Resource Group to create it in. But in the background a second resource group is created, this is known as the managed resource group - it is created with an almost random name. This is a pain if you have naming conventions or standards to adhere to. The managed resource group is used for networking of your clusters and for providing the DBFS storage account. [Read More]

Unpivot Data in PySpark

Problem I recently encountered a file similar to this: The data required “unpivoting” so that the measures became just three columns for Volume, Retail & Actual - and then we add 3 rows for each row as Years 16, 17 & 18. Their are various ways of doing this in Spark, using Stack is an interesting one. But I find this complex and hard to read. First lets setup our environment and create a function to extract our sample data: [Read More]

Databricks CI/CD Tools

A while back now I started to create some PowerShell modules for assisting with DevOps CI and CD scenarios. These can be found on GitHub here. Why? Firstly, the one thing I don’t like about Databricks is the CI/CD support. I think it is very lacking in support for data engineers and too focused on data science. Don’t get me wrong, Databricks is great - but it is also relatively young as a product and still ironing out the user experience part. [Read More]