PySpark in Visual Studio Online (Container)

A new product was recently released called Visual Studio Online (VSOnline). Not to be confused with the old Visual Studio Online (VSO) which was renamed VSTS and then renamed to the now Azure DevOps (that’s a rant for another day). Update May 2020: Microsoft have renamed it already - its now CodeSpaces! I quite like that name. VSOnline allows you to create a development environment in a container hosted in Azure for almost pennies. [Read More]

Part 5 - Developing a PySpark Application

This is the 5th and final part of a series of posts to show how you can develop PySpark applications for Databricks with Databricks-Connect and Azure DevOps. All source code can be found here. Configuration & Releasing We are now ready to deploy. I’m working on the assumption we have two further environments to deploy into - UAT and Production. Deploy.ps1 This script in the root folder will do all the work we need to release our Wheel and setup some Databricks Jobs for us. [Read More]

Part 4 - Developing a PySpark Application

This is the 4th part of a series of posts to show how you can develop PySpark applications for Databricks with Databricks-Connect and Azure DevOps. All source code can be found here. Create a CI Build Now that we have everything running locally we want to create a CI process to build our Wheel, publish that as an artefact and of course to test our code. In the root of the project is a file called azure-pipelines. [Read More]

Part 3 - Developing a PySpark Application

This is the 3rd part of a series of posts to show how you can develop PySpark applications for Databricks with Databricks-Connect and Azure DevOps. All source code can be found here. Packaging into a Wheel Before we can create a CI process we should ensure that we can build and package the application locally. By reusing the same scripts locally and on CI server minimises the chances of something breaking. [Read More]

Part 2 - Developing a PySpark Application

This is the 2nd part of a series of posts to show how you can develop PySpark applications for Databricks with Databricks-Connect and Azure DevOps. All source code can be found here. Adding Tests Now that we have a working local application we should add some tests. As with any data based project testing can be very difficult. Strictly speaking all tests should setup the data they need and be fully independent. [Read More]

Part 1 - Developing a PySpark Application

This is the 1st part of a series of posts to show how you can develop PySpark applications for Databricks with Databricks-Connect and Azure DevOps. All source code can be found here. Overview The goal of this post is to be able to create a PySpark application in Visual Studio Code using Databricks-Connect. This post focuses on creating an application in your local Development environment. Other posts in the series will look at CI & Testing. [Read More]

Series - Developing a PySpark Application

This is a series of blog post to demonstrate how PySpark applications can be developed specifically with Databricks in mind. Though the general principal applied here can be used with any Apache Spark setup (not just Databricks). Python has many complexities with regards to paths, packaging and deploying versions. These blogs posts are what we have learnt working with clients to build robust, reliable development processes. The goal is to have a local development environment but using Databricks-Connect to execute against, and then a solid CI and testing framework to support the development going forward using Azure DevOps pipelines. [Read More]

Setup Databricks-Connect on Windows 10

UPDATE June 2020 - How about using a container instead? It’s much easier than installing all this stuff: Prebuilt container Having recently tried to get DBConnect working on a Windows 10 machine I’ve realised things are not as easy as you might think. These are the steps I have found to setup a new machine and get Databricks-Connect working. Install Java Download and install Java SE Runtime Version 8. It’s very important you only get version 8 nothing later. [Read More]

Databricks & Snowflake Python Errors

I’ve recently been playing with writing data to Snowflake from Databricks. Reading and Writing is pretty simple as per the instructions from Databricks. But if you want to execute SnowSQL commands using the snowflake-python-connector and Python 3 you will be greeted with this error when you try to import the module (despite attaching the library without error): cffi library '_openssl' has no function, constant or global variable named 'Cryptography_HAS_SET_ECDH_AUTO' [Read More]

PySpark Applications for Databricks

UPDATE April 2019: If you are interested in creating a PySpark application for Databricks you should consider using Databricks-Connect. More details here. Whilst notebooks are great, there comes a time and place when you just want to use Python and PySpark in it’s pure form. Databricks has the ability to execute Python jobs for when notebooks don’t feel very enterprise data pipeline ready - %run and widgets just look like schoolboy hacks. [Read More]