Swimming downstream: maintaining your data posture from AI to BI
(This is part 1 of a series of posts around data posture for data and AI products. The next sections will focus on key elements of this post in greater details. I welcome feedback and comments!)
Data quality always felt somehow esoteric to me. I’ve read plenty of reference material, blog posts, marchitecture, and it always brought me back to the same: data quality, what do you mean, data quality? Isn’t your data… your data? Talks about data quality always felt academical, where ultimately (at least, I ultimately) care about my data to reflect accurately what I’m examining. This is what I call here data posture, which is how confident are you that your data represents what you care about and is in a format that makes it easy to operate on.
Data is the language of the system you operate in.
Me, anytime anyone talks to me about data
Coming from a data science perspective, I’ve found a profound distance between the users of data and their tools, with a bigger chasm the higher-level your tools are. Data dictionaries which look amazing but take copious amounts of training to get proficient in, data engineering tools which create a beautiful assortment of data engineering pipelines, and BI tools that somehow make the process of creating a simple chart a multi-step process. Those tools reinforce the flawed concept that you need distinct skills and teams for each step of the data value chain. Then, when you realize that the data you have isn’t “good enough” for the work you want it to do, you painfully swim upstream for half-baked dashboard to misunderstood features before hitting a well nobody told us you needed THAT data specifically kind of discourse. Infuriating.
In practice, I’ve come to see this as almost of an anti-pattern. By organizing your data value chain, you end up with diluting the knowledge about what makes that data useful in the first place. You lose context, control, and precious definition that you end up recreating haphazardly later in the process. An insurance company ends up with conflicting definitions for a claim, a hospital confuses patient (table A) and patient (table B), a financial institution executive becomes annoyed because his teams are arguing about account, account, account, account, or account. An executive screams on a random meeting because he’s seeing his business grow and shrink on a same dashboard. Data teams end up complaining about data quality so much it becomes a practical meme in the organization.
I’m not the only one to think about this problem. Gable made this their primary pitch, advertising a shift-left data platform that reunites data teams (the primary consumers of data) and developers (usually, the primary data producers). I have yet to use their platform, but I really agree with their premise: if you don’t build it, it’s hard to take full responsibility for it. Why not make the data producers responsible for their data?
Credit: The profound programmer
When I was running a data team for a software startup, my biggest productivity gain for improving data quality and usage was to empower the developers into being explicit about the data they produce. From their technical documentation, we built with them a complete map of the data the application was producing and mapped it into Unity Catalog, creating documentation everyone could refer to easily. On top of that, we created a concept map (I think that Zhamak Dehgani refers to this as a polyseme, but startup vibes, we had the luxury of not having too many concepts shared by several teams) that explained the key concepts of the application (events, actions, observations, and checkpoints) and how they were reflected in the data.
In practice, this was articulated and facilitated by leveraging Databricks’ ecosystem. More specifically, we created a powerful, flexible, and dare I say fun process for developing downstream data products that would be consumed through BI tools or data science models.
i) For each event, action, observation, and checkpoint type, we created cards that explained their significance and associated data. Each deviation from this data model was tracked using the change data feed. Upon one week of implementing this, we completely eliminated schema anomalies.
ii) Created a DSL using Databricks’ SQL functions to remove the need for the data engineers to remember how to extract specific, often-used facets of the data. You want to know how many times an event fired during a specific time? how_many_times_did_this_event_fire(...)
iii) Used that DSL to establish great data confidence in record time by creating DQ tables, views, dashboards, and alerts.
iv) Finally, tested the heck out of those data products and came up with a zero-bug guarantee which stood for my entire tenure. We even made a badge for people who would find a concept/data bug!
The biggest challenge at the time ended up to pick a BI stack that felt natural with the team. Most people picked up the SQL DSL we created in a few hours and created ad-hoc queries for their customer, project, or interest.
If I were to do it all over again, I think it would be even easier. Databricks’ ecosystem is more mature, with AI through the value chain.
- Avoiding blank-page syndrome when creating new tables through suggested description and column comments. No excuse for leaving a table without a good description.
- Code assistant to write and correct SQL code, which makes it easier than ever to adopt.
- AI/BI and Genie workspaces to go from question to insights more quickly. Ask questions, get answers, get rewarded for tending to your data.
- Unity Catalog native connection to powerful BI tools, such as Sigma, PowerBI, and Tableau.
When it comes to data relevance, I think there is tremendous value in swimming in the same direction the data is produced. Bring power and accountability where the data is being produced and accelerate fronting as much of the usability of the data (documentation, concepts) as close as possible to inception. Then, swim downstream and reap the benefits, knowing that your data is useful and used. Data process as good only if they are being followed, why waste cycles swimming upstream if you can flow forward?