Before we can create a CI process we should ensure that we can build and package the application locally. By reusing the same scripts locally and on CI server minimises the chances of something breaking.
Now that we have a working local application we should add some tests. As with any data based project testing can be very difficult. Strictly speaking all tests should setup the data they need and be fully independent. This is actually a good case to use normal PySpark locally rather than Databricks-Connect as that forces you to use small local files which you could add to a repo
The goal of this post is to be able to create a PySpark application in Visual Studio Code using Databricks-Connect. This post focuses on creating an application in your local Development environment. Other posts in the series will look at CI & Testing.
This is a series of blog post to demonstrate how PySpark applications can be developed specifically with Databricks in mind. Though the general principal applied here can be used with any Apache Spark setup (not just Databricks).
Yesterday saw the Spark Summit keynote take place in San Francisco. There were quite a few announcements, some expected, some not so much. In order of importance to how I see things today this is my summary and view of them.
I’ve been waiting for this feature for what feels like forever.
Databricks-Connect is here! Well almost - it’s still preview, but the release looks imminent. It allows you to develop using an IDE like VSCode, PyCharm, IntelliJ etc and connect to a remote Databricks cluster to execute the task.