Databricks-Connect in a Container

Today I have published a set of containers to Docker Hub to enable developers to create Databricks-Connect Dev environments really quickly and consistently. Docker Hub: https://hub.docker.com/r/datathirstltd/dbconnect Source Code: https://github.com/DataThirstLtd/databricksConnectDocker These are targeted at dev using PySpark in VSCode. Though I suspect it will work for Scala and Java development as well. Why? Because setting up Databricks-Connect (particularly on Windows is a PIA). This allow: A common setup between team members Multiple side by side versions Ability to reset your environment Even run the whole thing from a browser! [Read More]

PySpark in Visual Studio Online (Container)

A new product was recently released called Visual Studio Online (VSOnline). Not to be confused with the old Visual Studio Online (VSO) which was renamed VSTS and then renamed to the now Azure DevOps (that’s a rant for another day). Update May 2020: Microsoft have renamed it already - its now CodeSpaces! I quite like that name. VSOnline allows you to create a development environment in a container hosted in Azure for almost pennies. [Read More]

Part 5 - Developing a PySpark Application

This is the 5th and final part of a series of posts to show how you can develop PySpark applications for Databricks with Databricks-Connect and Azure DevOps. All source code can be found here. Configuration & Releasing We are now ready to deploy. I’m working on the assumption we have two further environments to deploy into - UAT and Production. Deploy.ps1 This script in the root folder will do all the work we need to release our Wheel and setup some Databricks Jobs for us. [Read More]

Part 4 - Developing a PySpark Application

This is the 4th part of a series of posts to show how you can develop PySpark applications for Databricks with Databricks-Connect and Azure DevOps. All source code can be found here. Create a CI Build Now that we have everything running locally we want to create a CI process to build our Wheel, publish that as an artefact and of course to test our code. In the root of the project is a file called azure-pipelines. [Read More]

Part 3 - Developing a PySpark Application

This is the 3rd part of a series of posts to show how you can develop PySpark applications for Databricks with Databricks-Connect and Azure DevOps. All source code can be found here. Packaging into a Wheel Before we can create a CI process we should ensure that we can build and package the application locally. By reusing the same scripts locally and on CI server minimises the chances of something breaking. [Read More]

Part 2 - Developing a PySpark Application

This is the 2nd part of a series of posts to show how you can develop PySpark applications for Databricks with Databricks-Connect and Azure DevOps. All source code can be found here. Adding Tests Now that we have a working local application we should add some tests. As with any data based project testing can be very difficult. Strictly speaking all tests should setup the data they need and be fully independent. [Read More]

Part 1 - Developing a PySpark Application

This is the 1st part of a series of posts to show how you can develop PySpark applications for Databricks with Databricks-Connect and Azure DevOps. All source code can be found here. Overview The goal of this post is to be able to create a PySpark application in Visual Studio Code using Databricks-Connect. This post focuses on creating an application in your local Development environment. Other posts in the series will look at CI & Testing. [Read More]

Series - Developing a PySpark Application

This is a series of blog post to demonstrate how PySpark applications can be developed specifically with Databricks in mind. Though the general principal applied here can be used with any Apache Spark setup (not just Databricks). Python has many complexities with regards to paths, packaging and deploying versions. These blogs posts are what we have learnt working with clients to build robust, reliable development processes. The goal is to have a local development environment but using Databricks-Connect to execute against, and then a solid CI and testing framework to support the development going forward using Azure DevOps pipelines. [Read More]

Databricks PowerShell Tools Update 1.1.21

A new release of azure.databricks.cicd.tools has gone out today. Changes include: Support for Cluster Log Path for: New-DatabricksCluster Add-DatabricksJarJob Add-DatabricksNotebookJob Add-DatabricksPythonJob Support for Instance Pool ID on the above as well Support from creating new instance pools will come soon New Command Restart-DatabricksCluster New Command Remove-DatabricksLibrary Add -RunImmediate to Add-DatabricksNotebookJob Fixed case sensitive issue preventing all commands importing on Linux All the docs have been updated in the wiki as well. [Read More]

Thoughts on Spark Summit Announcements

Yesterday saw the Spark Summit keynote take place in San Francisco. There were quite a few announcements, some expected, some not so much. In order of importance to how I see things today this is my summary and view of them. Kubernetes No real fan fare was made about this - and judging by the comments on Twitter people do not seem too excited by it. But for me this is massive - Spark 3. [Read More]