Yesterday saw the Spark Summit keynote take place in San Francisco. There were quite a few announcements, some expected, some not so much. In order of importance to how I see things today this is my summary and view of them.
I’ve been waiting for this feature for what feels like forever.
Databricks-Connect is here! Well almost - it’s still preview, but the release looks imminent. It allows you to develop using an IDE like VSCode, PyCharm, IntelliJ etc and connect to a remote Databricks cluster to execute the task.
Databricks-Connect is the feature I’ve been waiting for. It is a complete game changer for developing data pipelines - previously you could develop locally using Spark but that meant you couldn’t get all the nice Databricks runtime features - like Delta, DBUtils etc.
Whilst notebooks are great, there comes a time and place when you just want to use Python and PySpark in it’s pure form. Databricks has the ability to execute Python jobs for when notebooks don’t feel very enterprise data pipeline ready - %run and widgets just look like schoolboy hacks. Also the lack of debugging in Databricks is painful at times. By having a PySpark application we can debug locally in our IDE of choice (I’m using VSCode).
Databricks provides some nice connectors for reading and writing data to SQL Server. These are generally want you need as these act in a distributed fashion and support push down predicates etc etc. But sometimes you want to execute a stored procedure or a simple statement.