UPDATE April 2019: If you are interested in creating a PySpark application for Databricks you should consider using Databricks-Connect. More details here.
Whilst notebooks are great, there comes a time and place when you just want to use Python and PySpark in it’s pure form. Databricks has the ability to execute Python jobs for when notebooks don’t feel very enterprise data pipeline ready - %run and widgets just look like schoolboy hacks. Also the lack of debugging in Databricks is painful at times. By having a PySpark application we can debug locally in our IDE of choice (I’m using VSCode).
For some reason Python Jobs are not available in the Workspace UI today (but is available in the REST API and when executing via Azure Data Factory). The Workspace UI does have the ability to use Spark-submit jobs and Python, which oddly the Azure Data Factory tasks do not support. These inconsistencies make it hard to get started.
Developing in PySpark Locally
We are going to create a project that structurally looks like this the image on the right. The full project is available on GitHub.
This article will leave spark-submit for another day and focus on Python jobs. I will also assume you have PySpark working locally. We will focus on developing a PySpark application that you can execute locally and be debugged, and also deploy to a Databricks cluster with no changes.
When developing PySpark you need to be aware of the differences between notebooks and the differences between developing locally to a cluster. The key differences are summarised here:
|Feature||Notebooks||Python Job||Local PySpark|
|spark.SparkSession||Already available||Must find manually||Must create manually|
|dbutils||Already available||Already available||Never available|
|Import Notebooks||via %run||Not possible||Not possible|
|Import other modules||Using libraries only||If in PYTHONPATH||via PYTHONPATH or
Using PySpark we must getOrCreate the spark.SparkSession on each script. That is pretty simple:
The only thing to remember here is that the session must be called spark (lowercase).
Dbutils is handy in notebooks. And it is available in Python jobs, but it is not available to download and use as a local module. This means we want to avoid using it in our jobs. I have managed to do this with the exception of getting secrets which is a must have on the Databricks cluster unless you want to pass around plaintext passwords (which you don’t right?).
An important point to remember is to never run import dbutils in your Python script. This command succeeds but clobbers all the commands so nothing works. It is imported by default.
One other key difference is that you are probably not running Hadoop locally, although you can if you really want to. This means that you cannot mount/connect to Data Lake or Blob storage accounts. As this is local development - I recommend having a sample of files locally and essentially mocking your cloud storage. As a bonus put these in your repo - it makes writing tests so much simpler - in the sample project I have created a subfolder called DataLake.
Importing other Modules
Like my sample project the chances are that you will have several .py files containing the code you want to execute. It wouldn’t be good practice to create a PySpark application in a single file. You will split out helpers etc into other files. These need to be imported into the executing script.
Before we get too far into that it import to understand PYTHON_PATH is an environment variable which contains a list of locations to look for module when you run the import command. It is safer to always set this variable rather than rely on relative paths that may not work on clusters (and won’t on a Databricks cluster).
There is a simple fix for this, which allows you to add full paths from a relative path:
This adds a subfolder called Utils to my PYTHON_PATH variable so that when I execute import helpersfunctions (which is helperfunctions.py in the Utils folder) it succeeds.
When you execute your application you will probably want to pass in some parameters such as a file paths, dates to process etc. To do this we will use a library called argparse. This sample code reads in two arguments called job and slot.
The codes exists in the main.py file which will be the script our ADF pipeline or Python job will execute. The job parameter will tell it which module and method to execute, the slot is just a sample parameter.
By using the job parameter this script can be reused to execute any module and method (handy if you want loop in ADF with creating hundreds of ADF pipeline tasks).
It will be execute using this code:
Lastly we need some configuration to handle the differences between local and the Databricks cluster environment. I’m setting spark.conf in script, but you can also set this from ADF or in the Job:
The ADLS is the path to my Data Lake which is a local path when the Spark context is local and the Azure Data Lake when I’m on the cluster. I’m also using dbutils when in Databricks to get the secret connection details for the Lake.
You should now understand what is happening in the sample repo. So much you can execute main.py and understand what is happening. At this point I would suggest doing that if you have not already. Now you can also add breakpoints and debug your code (yay!).
Build & Deploy
When Databricks executes jobs it copies the file you specify to execute to a temporary folder which is a dynamic folder name. Unlike Spark-submit you cannot specify multiple files to copy. The easiest way to handle this is to zip up all of your dependant module files into a flat archive (no folders) and add the zip to the cluster from DBFS.
Eagle eyed readers may have noticed the line spark.sparkContext.addPyFile("dbfs:/MyApplication/Code/scripts.zip") in the last code snippet. This zip file contains our dependant modules. the addPyFile command allows all servers in the cluster to see the file.
All code must be deployed to DBFS in advance of your job running. I have created two scripts to handle this, build.ps1 and deploy.ps1 (yes PowerShell - Mac users don’t freak out I use a Mac too - get PowerShell Core - all is cool).
The build.ps1 script creates a bin directory (add to .gitignore) that contains the main script we will execute and a zip file of the dependant scripts.
The next script is deploy.ps1 which will upload the two files to DBFS where you can execute it from:
Note you will need to create the file MyBearerToken.txt in the same folder with your bearer token in. Also set the region to your region.
Execute as a Python Job
Using PowerShell you can create a Python job (the Databricks does not let you create Python jobs, but you can view them):
Before you execute this on Databricks you will need to create an Azure Data Lake and set the credentials and Secrets.
Once executed you should see the job in Databricks and be able to execute it with Success!
You can also execute from Azure Data Factory using the Databricks Python task. Just point to your script and pass parameters as normal:
Congratulations for making it this far in a very long post, I hope you find it useful.
The full GitRepo is here.