Setup Databricks-Connect on Windows 10

UPDATE June 2020 - How about using a container instead? It’s much easier than installing all this stuff: Prebuilt container Having recently tried to get DBConnect working on a Windows 10 machine I’ve realised things are not as easy as you might think. These are the steps I have found to setup a new machine and get Databricks-Connect working. Install Java Download and install Java SE Runtime Version 8. It’s very important you only get version 8 nothing later. [Read More]

Databricks-Connect - FINALLY!

I’ve been waiting for this feature for what feels like forever. Databricks-Connect is here! You can download here. It allows you to develop using an IDE like VSCode, PyCharm, IntelliJ etc and connect to a remote Databricks cluster to execute the task. This means we can bring much better development experiences and best practices to data engineering workloads. Notebooks are great for exploring data, but they are not enterprise code for ETL jobs. [Read More]

Databricks-Connect Limitations

Databricks-Connect is the feature I’ve been waiting for. It is a complete game changer for developing data pipelines - previously you could develop locally using Spark but that meant you couldn’t get all the nice Databricks runtime features - like Delta, DBUtils etc. Databricks-Connect allows teams to start developing in a more Enterprise fashion than Notebooks allow. I’m a big fan. There are however a few limitations of what you can do - I’m sure more features will be added in the future so this list will probably change. [Read More]

Databricks Key Vault backed Secret Scopes

A few weeks ago now Databricks added the ability to have Azure Key Vault backed Secret Scopes. These are still in preview. Today the PowerShell tools for Databricks we maintain have been updated to support these! Technically the API is not documented, and as the service is in preview stop working in the future. But we will update the module if Databricks change the API. Example: Import-Module azure.databricks.cicd.Tools $BearerToken = "dapi1234567890" $Region = "westeurope" $ResID = "/subscriptions/{mysubscriptionid}/resourceGroups/{myResourceGroup}/providers/Microsoft. [Read More]

PowerShell for Azure Databricks

Last year we released a a PowerShell module called azure.databricks.cicd.tools on GitHub and PowerShell Gallery. What we never did is publish anything about what it can do. The original purpose was to help with CI/CD scenarios, so that you could create idempotent releases in Azure DevOps, Jenkins etc. But now it has almost full parity with the options available in the REST API. Databricks do offer a supported CLI (which requires Python installed), and a REST API - which is quite complex to use - but is what this PowerShell module uses. [Read More]

Databricks & Snowflake Python Errors

I’ve recently been playing with writing data to Snowflake from Databricks. Reading and Writing is pretty simple as per the instructions from Databricks. But if you want to execute SnowSQL commands using the snowflake-python-connector and Python 3 you will be greeted with this error when you try to import the module (despite attaching the library without error): cffi library '_openssl' has no function, constant or global variable named 'Cryptography_HAS_SET_ECDH_AUTO' [Read More]

PySpark Applications for Databricks

UPDATE April 2019: If you are interested in creating a PySpark application for Databricks you should consider using Databricks-Connect. More details here. Whilst notebooks are great, there comes a time and place when you just want to use Python and PySpark in it’s pure form. Databricks has the ability to execute Python jobs for when notebooks don’t feel very enterprise data pipeline ready - %run and widgets just look like schoolboy hacks. [Read More]

Databricks Cluster Management via PowerShell

We have released a big update to the CI/CD Tools on GitHub today: https://github.com/DataThirstLtd/azure.databricks.cicd.tools These updates are for cluster management within Databricks. They allow for you to Create or Update Clusters. Stop/Start/Delete and Resize. There are also some new helper functions to get a list of available Spark versions and types of VM’s available to you. The full set of new commands is: Get-DatabricksClusters - Returns a list of all clusters in your workspace New-DatabricksCluster - Creates/Updates a cluster Start-DatabricksCluster Stop-DatabricksCluster Update-DatabricksClusterResize - Modify the number of scale workers Remove-DatabricksCluster - Deletes your cluster Get-DatabricksNodeTypes - returns a list of valid nodes type (such as DS3v2 etc) Get-DatabricksSparkVersions - returns a list of valid versions These will hopefully be added to the VSTS/Azure DevOps tasks in near future. [Read More]

Executing SQL Server Stored Procedures from Databricks (PySpark)

Databricks provides some nice connectors for reading and writing data to SQL Server. These are generally want you need as these act in a distributed fashion and support push down predicates etc etc. But sometimes you want to execute a stored procedure or a simple statement. I must stress this is not recommended - more on that at the end of this blog. I’m going to assume that as you made it here you really want to do this. [Read More]

Controlling the Databricks Resource Group Name

When you create a Databricks workspace using the Azure portal you obviously specify the Resource Group to create it in. But in the background a second resource group is created, this is known as the managed resource group - it is created with an almost random name. This is a pain if you have naming conventions or standards to adhere to. The managed resource group is used for networking of your clusters and for providing the DBFS storage account. [Read More]