Databricks-Connect in a Container

Today I have published a set of containers to Docker Hub to enable developers to create Databricks-Connect Dev environments really quickly and consistently. Docker Hub: https://hub.docker.com/r/datathirstltd/dbconnect Source Code: https://github.com/DataThirstLtd/databricksConnectDocker These are targeted at dev using PySpark in VSCode. Though I suspect it will work for Scala and Java development as well. Why? Because setting up Databricks-Connect (particularly on Windows is a PIA). This allow: A common setup between team members Multiple side by side versions Ability to reset your environment Even run the whole thing from a browser! [Read More]

Series - Developing a PySpark Application

This is a series of blog post to demonstrate how PySpark applications can be developed specifically with Databricks in mind. Though the general principal applied here can be used with any Apache Spark setup (not just Databricks). Python has many complexities with regards to paths, packaging and deploying versions. These blogs posts are what we have learnt working with clients to build robust, reliable development processes. The goal is to have a local development environment but using Databricks-Connect to execute against, and then a solid CI and testing framework to support the development going forward using Azure DevOps pipelines. [Read More]

Setup Databricks-Connect on Windows 10

UPDATE June 2020 - How about using a container instead? It’s much easier than installing all this stuff: Prebuilt container Having recently tried to get DBConnect working on a Windows 10 machine I’ve realised things are not as easy as you might think. These are the steps I have found to setup a new machine and get Databricks-Connect working. Install Java Download and install Java SE Runtime Version 8. It’s very important you only get version 8 nothing later. [Read More]

Databricks-Connect - FINALLY!

I’ve been waiting for this feature for what feels like forever. Databricks-Connect is here! You can download here. It allows you to develop using an IDE like VSCode, PyCharm, IntelliJ etc and connect to a remote Databricks cluster to execute the task. This means we can bring much better development experiences and best practices to data engineering workloads. Notebooks are great for exploring data, but they are not enterprise code for ETL jobs. [Read More]

Databricks-Connect Limitations

Databricks-Connect is the feature I’ve been waiting for. It is a complete game changer for developing data pipelines - previously you could develop locally using Spark but that meant you couldn’t get all the nice Databricks runtime features - like Delta, DBUtils etc. Databricks-Connect allows teams to start developing in a more Enterprise fashion than Notebooks allow. I’m a big fan. There are however a few limitations of what you can do - I’m sure more features will be added in the future so this list will probably change. [Read More]