Series - Developing a PySpark Application

This is a series of blog post to demonstrate how PySpark applications can be developed specifically with Databricks in mind. Though the general principal applied here can be used with any Apache Spark setup (not just Databricks).

Python has many complexities with regards to paths, packaging and deploying versions. These blogs posts are what we have learnt working with clients to build robust, reliable development processes.

The goal is to have a local development environment but using Databricks-Connect to execute against, and then a solid CI and testing framework to support the development going forward using Azure DevOps pipelines. And then to provide a sample project demonstrating this.

This series can also be used to build more general libraries that can be shared amongst your users to ensure you have common tools across the teams.

We will be using:

  • Conda

  • Databricks-Connect

  • Databricks

  • Visual Studio Code

  • Azure DevOps