dab-draft: a better templating of Databricks Assets Bundles

dab-draft: a better templating of Databricks Assets Bundles
Photo by Ryan Ancill / Unsplash

TL;DR: This is still in a 0.x stage, but you can see dab-draft on Github, which is (IMO) a better and less error-prone way to define and maintain large Databricks Assets Bundles configurations.

Two things are true about me.

  1. I’ve always been a big fan of templatizing configuration. I adopted dbx as a beta configuration language for Databricks Jobs almost the day it was announced.
  2. I profoundly dislike YAML. I find the alias/anchor syntax unappealing, the syntax brittle, and the number of me-too language extensions infuriating. Its only redeeming quality (to me) is that well-formed JSON parses as a YAML document.

In the dbx days, I created a Jsonnet wrapper over their configuration language to have a friendlier dev experience. This worked beautifully: our team eliminated inconsistencies in cluster configuration and established conventions as “code”. It made onboarding and re-onboarding easier, as we took less time to read the config and make edits. It felt a little funny to use Jsonnet to augment a tool developed by Databricks, one of the largest users of Jsonnet in the first place.

Databricks transitioned to Databricks Assets Bundles (which I’ll refer to from now on as DAB or plainly assets bundles). There are many things to like about assets bundles. They work with every asset Databricks offers, such as workflows, schemas, dashboard, models, etc. It integrates seamlessly with the databricks cli tool.

Did they learn their lesson and offered a saner configuration language? NoooOOoOoOoo, they doubled down on YAML, even putting fun variable substitutions to keep us on our toes. They even offer the option to create Assets Bundles templates using a weird JSON-but-add-more-ways-to-substitute-variables. I can’t even create one without reading the documentation twice, carefully picking the example, and modifying it to my liking. This puts tremendous pressure on the documentation to be clear, easy, and complete.

Unfortunately, the documentation on the website isn’t quite there yet; they don’t have complete definitions of all the possible entities and attributes for a typical bundle, which means that you’re reliant on the schema for staying compliant. JSONSchema (or YAMLSchema) is an absolute monster to read and parse, and the one for DABs is full of references and variable expansions, meaning you have to walk around. Navigating between the documentation and the schema has been brutal.

Finally, bundles allow for certain configuration elements in multiple places. This means that screwing your YAML indentation might yield a legal bundle which won’t do what you expect it to do.

Enough.

Why not “draft” it?

In parallel, I’ve stumbled upon a configuration language whose syntax didn’t make me break in hive. KCL is a CNCF-hosted project which aims to reduce the cognitive load of working with configuration. Color me intrigued.

At the core, KCL is a pre-processor that can generate configuration in JSON, YAML, TOML, and others. The best way I explain KCL to myself is to treat it as prototype-oriented configuration. Prototype-oriented programming implements inheritance through the re-use of objects (which we can call prototypes. This, to me, is favourable to creating assets bundle description.

First, you treat any resource, permission, variable, or workspace as an object. In a way, they are objects in that they are tangible “things” with properties (JSON/YAML even uses the term object).

Second, when configuring an assets bundle, you have similar objects that differ from a few properties. You might have a few recommended cluster size/types for running jobs, have a set of libraries to install, or have standardized permissions. Using prototype objects, you can define a few base objects, inherit them, and then change what you need. For example, let’s consider a super-simple task definition where you standardize the imports. By “inheriting” from _task, you can make the changes on top of the prototype. Any object starting with an underscore won’t print on the generated YAML.

_task = tasks.SparkPythonTask {
    task_key = "CHANGE ME"
    libraries = [
        {
            pypi = {package = "scikit_learn"}
        }
        {
            pypi = {package = "pandas"}
        }
    ]
    spark_python_task = {
        python_file = "CHANGE_ME"
    }
}

task = _task | {
    task_key = "This is my task"
    spark_python_task = {
        python_file = "my_entry_point.py"
    }
}

# Returns
task:
  libraries:
  - pypi:
      package: scikit_learn
  - pypi:
      package: pandas
  task_key: This is my task
  spark_python_task:
    python_file: my_entry_point.py

Pretty neat eh? I’m not the only one who thinks this way; for instance, Nix also uses prototypes to configure OS/Unix packages. KCL offers many packages to ease the configuration of various tools, from k8s to argo.

I’m in, gimme gimme gimme!

Enter dab-draft, a way to declare DABs using the KCL configuration language.

A few elements that I personally prefer to the YAML configuration:

  1. I like the bracket-full configuration. It’s harder to mess up the indentation compared to YAML. JSON would also have the same effect.
  2. The object definitions are cleaner and can be more easily re-used without relying on special syntax. The tasks above can all inherit from the _task prototype, with the fields being defined (or overridden).
  3. Although not shown here, imports are explicit, which against reduce the cognitive load of guessing how they are being done in this specific flavour of DAB/YAML.

The trade-off is that the syntax is more verbose, but I think it’s reasonable.

As you scale, I think that the advantages become more important. The configuration of resources is something you create once and edit many times. With dab-draft, the package’s source can almost play the role of documentation, with the objects, attributes, and types laid out. Starting from this, you can compose your own smaller configuration language for what you care about and make more efficient changes.

Why not PyDABS? I briefly tried PyDABS and found the coupling with the Python code hard to figure out. I prefer an explicit split between code and configuration. This might change, but until then, I think’ dab-draft’ is the best compromise between expressibility and maintenance.

I created dab-draft because it scratched a personal itch configuring Databricks Assets Bundles. If you find yourself rewriting the same thing again and again, give it a try. If you don’t like it, you’re not locked in: the tool will write a correct databricks.yml that you can use without any problems.

Neat!

Subscribe to Ta-dah! Science

Sign up now to get access to the library of members-only issues.
Jamie Larson
Subscribe