Series - Developing a PySpark Application

This is a series of blog post to demonstrate how PySpark applications can be developed specifically with Databricks in mind. Though the general principal applied here can be used with any Apache Spark setup (not just Databricks). Python has many complexities with regards to paths, packaging and deploying versions. These blogs posts are what we have learnt working with clients to build robust, reliable development processes. The goal is to have a local development environment but using Databricks-Connect to execute against, and then a solid CI and testing framework to support the development going forward using Azure DevOps pipelines. [Read More]

Databricks PowerShell Tools Update 1.1.21

A new release of azure.databricks.cicd.tools has gone out today. Changes include: Support for Cluster Log Path for: New-DatabricksCluster Add-DatabricksJarJob Add-DatabricksNotebookJob Add-DatabricksPythonJob Support for Instance Pool ID on the above as well Support from creating new instance pools will come soon New Command Restart-DatabricksCluster New Command Remove-DatabricksLibrary Add -RunImmediate to Add-DatabricksNotebookJob Fixed case sensitive issue preventing all commands importing on Linux All the docs have been updated in the wiki as well. [Read More]

Thoughts on Spark Summit Announcements

Yesterday saw the Spark Summit keynote take place in San Francisco. There were quite a few announcements, some expected, some not so much. In order of importance to how I see things today this is my summary and view of them. Kubernetes No real fan fare was made about this - and judging by the comments on Twitter people do not seem too excited by it. But for me this is massive - Spark 3. [Read More]

Setup Databricks-Connect on Windows 10

UPDATE June 2020 - How about using a container instead? It’s much easier than installing all this stuff: Prebuilt container Having recently tried to get DBConnect working on a Windows 10 machine I’ve realised things are not as easy as you might think. These are the steps I have found to setup a new machine and get Databricks-Connect working. Install Java Download and install Java SE Runtime Version 8. It’s very important you only get version 8 nothing later. [Read More]

Databricks-Connect - FINALLY!

I’ve been waiting for this feature for what feels like forever. Databricks-Connect is here! You can download here. It allows you to develop using an IDE like VSCode, PyCharm, IntelliJ etc and connect to a remote Databricks cluster to execute the task. This means we can bring much better development experiences and best practices to data engineering workloads. Notebooks are great for exploring data, but they are not enterprise code for ETL jobs. [Read More]

Databricks-Connect Limitations

Databricks-Connect is the feature I’ve been waiting for. It is a complete game changer for developing data pipelines - previously you could develop locally using Spark but that meant you couldn’t get all the nice Databricks runtime features - like Delta, DBUtils etc. Databricks-Connect allows teams to start developing in a more Enterprise fashion than Notebooks allow. I’m a big fan. There are however a few limitations of what you can do - I’m sure more features will be added in the future so this list will probably change. [Read More]

Databricks Key Vault backed Secret Scopes

A few weeks ago now Databricks added the ability to have Azure Key Vault backed Secret Scopes. These are still in preview. Today the PowerShell tools for Databricks we maintain have been updated to support these! Technically the API is not documented, and as the service is in preview stop working in the future. But we will update the module if Databricks change the API. Example: Import-Module azure.databricks.cicd.Tools $BearerToken = "dapi1234567890" $Region = "westeurope" $ResID = "/subscriptions/{mysubscriptionid}/resourceGroups/{myResourceGroup}/providers/Microsoft. [Read More]

PowerShell for Azure Databricks

Last year we released a a PowerShell module called azure.databricks.cicd.tools on GitHub and PowerShell Gallery. What we never did is publish anything about what it can do. The original purpose was to help with CI/CD scenarios, so that you could create idempotent releases in Azure DevOps, Jenkins etc. But now it has almost full parity with the options available in the REST API. Databricks do offer a supported CLI (which requires Python installed), and a REST API - which is quite complex to use - but is what this PowerShell module uses. [Read More]

Databricks & Snowflake Python Errors

I’ve recently been playing with writing data to Snowflake from Databricks. Reading and Writing is pretty simple as per the instructions from Databricks. But if you want to execute SnowSQL commands using the snowflake-python-connector and Python 3 you will be greeted with this error when you try to import the module (despite attaching the library without error): cffi library '_openssl' has no function, constant or global variable named 'Cryptography_HAS_SET_ECDH_AUTO' [Read More]

PySpark Applications for Databricks

UPDATE April 2019: If you are interested in creating a PySpark application for Databricks you should consider using Databricks-Connect. More details here. Whilst notebooks are great, there comes a time and place when you just want to use Python and PySpark in it’s pure form. Databricks has the ability to execute Python jobs for when notebooks don’t feel very enterprise data pipeline ready - %run and widgets just look like schoolboy hacks. [Read More]