Debugging your tools

A few weeks ago, I got the weirdest error using dbx to deploy a number of jobs to Databricks. dbx is an open-source tool used for developing and deploying databricks workflow. At my work, we mainly use it to deploy the workflow making our data lakehouse so I use the tool pretty much daily. One morning, it stopped working.

$ dbx deploy --environment=production --debug

[dbx][2023-01-09 10:21:50.591] πŸ› Debugging mode is on
[dbx][2023-01-09 10:21:50.593] πŸ”Ž Deployment file is not provided, searching in the conf directory
[dbx][2023-01-09 10:21:50.594] πŸ’‘ Auto-discovery found deployment file conf/deployment.json
[dbx][2023-01-09 10:21:50.594] πŸ†— Deployment file conf/deployment.json exists and will be used for deployment
[dbx][2023-01-09 10:21:50.594] Starting new deployment for environment production
[dbx][2023-01-09 10:21:50.600] Using profile provided from the project file
[dbx][2023-01-09 10:21:50.601] Found auth config from provider ProfileEnvConfigProvider, verifying it
[dbx][2023-01-09 10:21:50.602] Found auth config from provider ProfileEnvConfigProvider, verification successful
[dbx][2023-01-09 10:21:50.602] Profile DEFAULT will be used for deployment

send: b'POST /api/2.0/workspace/mkdirs HTTP/1.1\r\nHost: {production-server}.cloud.databricks.com[...]
reply: 'HTTP/1.1 200 OK\r\n'
[...]

send: b'GET /api/2.0/mlflow/experiments/get-by-name?experiment_name=[...] HTTP/1.1\r\nHost: {production-server}.cloud.databricks.com[...]
reply: 'HTTP/1.1 403 Forbidden\r\n'
header: x-databricks-reason-phrase: Invalid access token.

(You have to appreciate the emojis in the program logs.)

Based on the responses from dbx, we can see that my token worked for the databricks API server, but not for the databricks MLFlow server (dbx uses MLFlow to as a source registry – each workflow has a run within the experiment) . This didn't really make sense to me as I am using the same token for both, listed in ~/.databricks.cfg.

I then started iterating on different parameters to see if I could narrow without having to dig deep.

  1. Does dbx (both the databricks API and the MLFlow API calls) work in dev? Yes
  2. Does creating a new token for my production environment works? No
  3. Can my other team members deploy without any problem? Yes

Great.

Digging deeper: opening the source code

dbx and MLFlow are both open source projects; I can therefore open, read, and modify the source code. In this case, I wanted to follow the chain of authentication to see where it would break.

First, I narrowed down the exact call that would cause me to 403. Turns out any MLFlow call would error out, so I just picked an easy one. Β The stack trace is a little tough to read, but I assumed that the actual problem lied into the calls listed in line 32 and below as they are related to the REST utilities.

$ mlflow experiments search

Traceback (most recent call last):
  File "/Users/jonathan.rioux/miniforge3/envs/test/bin/mlflow", line 8, in <module>
    sys.exit(cli())
  File "/Users/jonathan.rioux/miniforge3/envs/test/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/Users/jonathan.rioux/miniforge3/envs/test/lib/python3.9/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/Users/jonathan.rioux/miniforge3/envs/test/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/jonathan.rioux/miniforge3/envs/test/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/jonathan.rioux/miniforge3/envs/test/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/jonathan.rioux/miniforge3/envs/test/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/Users/jonathan.rioux/miniforge3/envs/test/lib/python3.9/site-packages/mlflow/experiments.py", line 64, in search_experiments
    experiments = mlflow.search_experiments(view_type=view_type)
  File "/Users/jonathan.rioux/miniforge3/envs/test/lib/python3.9/site-packages/mlflow/tracking/fluent.py", line 1162, in search_experiments
    return get_results_from_paginated_fn(
  File "/Users/jonathan.rioux/miniforge3/envs/test/lib/python3.9/site-packages/mlflow/utils/__init__.py", line 248, in get_results_from_paginated_fn
    page_results = paginated_fn(max_results_per_page, next_page_token)
  File "/Users/jonathan.rioux/miniforge3/envs/test/lib/python3.9/site-packages/mlflow/tracking/fluent.py", line 1154, in pagination_wrapper_func
    return MlflowClient().search_experiments(
  File "/Users/jonathan.rioux/miniforge3/envs/test/lib/python3.9/site-packages/mlflow/tracking/client.py", line 380, in search_experiments
    return self._tracking_client.search_experiments(
  File "/Users/jonathan.rioux/miniforge3/envs/test/lib/python3.9/site-packages/mlflow/tracking/_tracking_service/client.py", line 200, in search_experiments
    return self.store.search_experiments(
  File "/Users/jonathan.rioux/miniforge3/envs/test/lib/python3.9/site-packages/mlflow/store/tracking/rest_store.py", line 75, in search_experiments
    response_proto = self._call_endpoint(SearchExperiments, req_body)
  File "/Users/jonathan.rioux/miniforge3/envs/test/lib/python3.9/site-packages/mlflow/store/tracking/rest_store.py", line 56, in _call_endpoint
    return call_endpoint(self.get_host_creds(), endpoint, method, json_body, response_proto)
  File "/Users/jonathan.rioux/miniforge3/envs/test/lib/python3.9/site-packages/mlflow/utils/rest_utils.py", line 281, in call_endpoint
    response = verify_rest_response(response, endpoint)
  File "/Users/jonathan.rioux/miniforge3/envs/test/lib/python3.9/site-packages/mlflow/utils/rest_utils.py", line 207, in verify_rest_response
    raise RestException(json.loads(response.text))
  File "/Users/jonathan.rioux/miniforge3/envs/test/lib/python3.9/site-packages/mlflow/exceptions.py", line 102, in __init__
    super().__init__(message, error_code=ErrorCode.Value(error_code))
  File "/Users/jonathan.rioux/miniforge3/envs/test/lib/python3.9/site-packages/google/protobuf/internal/enum_type_wrapper.py", line 82, in Value
    raise ValueError('Enum {} has no value defined for name {!r}'.format(
ValueError: Enum ErrorCode has no value defined for name '403'

Under the hood, MLFlow has a http_request method that calls the API using the popular Python package requests. I then unpacked the API call to the bare minimum request I could make that would still fail.

# This failed with a 403.

requests.request(method="GET",
            url="https://{production-server}.cloud.databricks.com/api/2.0/mlflow/experiments/get-by-name",
            params={
                "experiment_name": "...",
            },
            headers={
                "User-Agent": "mlflow-python-client/2.0.0",
                "Accept-Encoding": "gzip, deflate",
                "Connection": "keep-alive",
                "Authorization": "Basic ENCODED_TOKEN", <----- AUTH HERE
                "Accept": "*/*",
                "Content-Type": "text/json; charset=utf-8"
            },
        )

With this in hand, I started toying around with the authentication/authorization setup. By moving the authentication out of the header, I was able to make it work.

# This request worked.

requests.request(method="GET",
            url="https://{production-server}.cloud.databricks.com/api/2.0/mlflow/experiments/get-by-name",
            params={
                "experiment_name": "...",
            },
            headers={
                "User-Agent": "mlflow-python-client/2.0.0",
                "Accept-Encoding": "gzip, deflate",
                "Connection": "keep-alive",
                "Accept": "*/*",
                "Content-Type": "text/json; charset=utf-8"
            },
            auth=("token", "MY_TOKEN"), # <---- AUTH HERE

        )

At this point, I was nearly losing my mind (and precious hours of sleep). Via some well put print statements, I was able to see that my encoded token would be different when I put it within the header vs. in the auth parameter.

By bissecting the code up to the point where the token would "change", I ended up into that code piece.

# Set environment's basic authentication if not explicitly set.
auth = request.auth
if self.trust_env and not auth and not self.auth:
    auth = get_netrc_auth(request.url)

Previous me had setup a ~/.netrc file for an unrelated initiative. That file contained only the address of the production instance of databricks, along with a (now expired) token. Requests seems to prioritize that file when you don't specify the authentication explicitly, hence the impression of a token change mid-request. Turns out there is an issue opened for more than 6 years on the topic, where users of other requests-dependent libraries faced the same issue.

Bingo.

When open-source breaks, you are not alone

While this ended up taking a lot of time, frantic questions, and some mild panic, it also reminded me of one of the key aspects I love about open-source and free software tools: when they break, you are not left to your own devices. Although the error was 100% user driven, being able to

  • Dig through the code
  • Modify the source (even just to debug and explore)
  • Check current and previous issues

allowed me to solve my own problem. Should Databricks decided to have dbx and MLFlow under closed source, I would have depended on them to figure out the issue and prioritize it. As those tools aren't core to Databricks' business, I think that developing them in the open is a fantastic idea. Even if it means that I need to print() my way out of the tar pit.

I don't expect this error will happen to a lot of folks, but I wasn't able to find an answer by searching online. Hopefully, this will have some time to someone else.

Subscribe to Ta-dah! Science

Sign up now to get access to the library of members-only issues.
Jamie Larson
Subscribe