Databricks

Databricks Cluster Management via PowerShell

Posted on November 2, 2018 | 1 minutes | 162 words | Simon D'Morias

We have released a big update to the CI/CD Tools on GitHub today: https://github.com/DataThirstLtd/azure.databricks.cicd.tools These updates are for cluster management within Databricks. They allow for you to Create or Update Clusters. Stop/Start/Delete and Resize. There are also some new helper functions to get a list of available Spark versions and types of VM’s available to you. The full set of new commands is: Get-DatabricksClusters - Returns a list of all clusters in your workspace New-DatabricksCluster - Creates/Updates a cluster Start-DatabricksCluster Stop-DatabricksCluster Update-DatabricksClusterResize - Modify the number of scale workers Remove-DatabricksCluster - Deletes your cluster Get-DatabricksNodeTypes - returns a list of valid nodes type (such as DS3v2 etc) Get-DatabricksSparkVersions - returns a list of valid versions These will hopefully be added to the VSTS/Azure DevOps tasks in near future. [Read More]

Databricks PowerShell

Executing SQL Server Stored Procedures from Databricks (PySpark)

Posted on October 12, 2018 | 4 minutes | 739 words | Simon D'Morias

Databricks provides some nice connectors for reading and writing data to SQL Server. These are generally want you need as these act in a distributed fashion and support push down predicates etc etc. But sometimes you want to execute a stored procedure or a simple statement. I must stress this is not recommended - more on that at the end of this blog. I’m going to assume that as you made it here you really want to do this. [Read More]

Databricks SQL Server PySpark

Controlling the Databricks Resource Group Name

Posted on September 26, 2018 | 2 minutes | 242 words | Simon D'Morias

When you create a Databricks workspace using the Azure portal you obviously specify the Resource Group to create it in. But in the background a second resource group is created, this is known as the managed resource group - it is created with an almost random name. This is a pain if you have naming conventions or standards to adhere to. The managed resource group is used for networking of your clusters and for providing the DBFS storage account. [Read More]

PowerShell Databricks REST API Azure

Unpivot Data in PySpark

Posted on September 13, 2018 | 2 minutes | 289 words | Simon D'Morias

Problem I recently encountered a file similar to this: The data required “unpivoting” so that the measures became just three columns for Volume, Retail & Actual - and then we add 3 rows for each row as Years 16, 17 & 18. Their are various ways of doing this in Spark, using Stack is an interesting one. But I find this complex and hard to read. First lets setup our environment and create a function to extract our sample data: [Read More]

PySpark Python Databricks

Databricks CI/CD Tools

Posted on September 13, 2018 | 2 minutes | 378 words | Simon D'Morias

A while back now I started to create some PowerShell modules for assisting with DevOps CI and CD scenarios. These can be found on GitHub here. Why? Firstly, the one thing I don’t like about Databricks is the CI/CD support. I think it is very lacking in support for data engineers and too focused on data science. Don’t get me wrong, Databricks is great - but it is also relatively young as a product and still ironing out the user experience part. [Read More]

Databricks DevOps PowerShell