You Need More Data Architecture

Why medallion architecture is not enough

Oct 05, 2024

As you may know, I was recently on vacation in Switzerland. It was amazing! Switzerland is a stunning country, effortlessly blending jaw-dropping natural landscapes, classical architecture with a modern transportation system.

Here’s a couple photos from the trip.

This one is from an afternoon in Montreux a town near lake Geneva on the border with France.

This one is from earlier that day in Interlaken which is built on a narrow stretch of valley, between the emerald-colored waters of Lake Thun and Lake Brienz.

Anyways back to data matters.

When I came back to work, I was pleasantly surprised that my manager (who is also an engineer but has never written dbt code before) had developed 5 new models without needing my input. I did the code review and they were perfect. While our data system is still a work in progress, this was a clear indication to me that what we’ve built is a simple to understand, easy to maintain and expand architecture.

But what even is data architecture?

I’m not going to bother trying to define it. It’s too broad of a topic and everyone has an opinion. Since I’m a pragmatist, in my mind it’s represented as a logical organization systems for data artifacts which also includes the architecture of the platform around it, DataOps with its many facets of CI/CD, automated testing, decisions on tools, etc.

If that’s not broad enough I don’t know what is. 😀

Now I’m sure you’ve heard about things like medallion architecture and data mesh both of which attempt to apply some level of logical organization to data artifacts. I’m still diving into data mesh but I’ve got plenty of experience working with medallion architecture to explain the basics.

The idea (for the uninitiated) is that you load your data in a data lake using the Extract-Load-Transform (ELT) paradigm and you transform it in 3 layers. Bronze will contain raw data from operational systems, silver will contain cleaned and augmented data and gold will contain business-level aggregates often used in BI tools like Tableau, Looker, PowerBI, etc.

Building Reliable, Performant Data Pipelines with Delta Lake

This is a good start, but it’s not enough.

You need more architecture than you think

I’ve seen many data stacks with medallion architecture, a good amount of DataOps (analytics as code, CI/CD, data quality tests, monitoring and alerting) and yet there’s thousands of tables everywhere, queries are slow, dashboards are stale, metrics are not reliable and costs are through the roof.

What’s the problem here?

Even though they organize data artifacts based on purpose bronze, silver and gold layers don’t provide enough of a logical structure for organizing things further. You’re supposed to somehow know what constitutes a “business aggregate” vs an “augmented table”. Without this structure you get proliferation. Eventually you start to see queries joining tables across the layers without understanding upstream dependencies and then you’ll be in a world of hurt.

For example an analyst might build a dbt model that joins a table that goes through 5 intermediary models and 5 hours to build with another table that takes 3 hours, schedule the refresh to run hourly, build a Tableau dashboard on top of it, schedule the Tableau extract to also run hourly and then wonder why the dashboard is stale.

Now another analyst comes along, sees that new table and says “Oooh it has everything I want, I just need to add this one column” and builds a new model on top of it, schedules an hourly refresh and then wonders why their metrics don’t line up. Meanwhile you’re burning cash every time these queries run and by the time the CFO is hounding you to decrease costs you have an unmanageable mess on your hands.

You need a logical system for organizing your silver and gold layers that is followed religiously. For example the models that build your metrics should all follow the same conventions, ideally being driven by macros so there’s no deviation. Models designed for ad-how requests need to have a very specific rules and paths to be promoted to production.

I believe all these things are part of the data architecture. But in order to ensure success it needs to be both well-designed (logical) and regularly reinforced. Tools like data observability can help find problems but they won’t help you fix them. This leads me to my next point.

Statistical Control vs Intellectual Control

Data monitoring tools, automated tests, etc. belong in a category known as “statistical control" But what’s missing is “intellectual control” This is an idea from George Fairbanks and if you want to learn more check out his talk on the topic.

Intellectual control is the idea that you should be able to not only understand and explain your system’s architecture but also make sure it doesn’t start to devolve. Tests and monitoring are NOT enough. You need to make sure every data artifact you build has a clear purpose and fits nicely into the overall design.

To do that you need to dramatically improve your review process. Don’t limit the review to just code correctness. You must ensure that other guidelines are also followed, for example:

Make sure the code is DRY and modular
Make sure the query is performant
Check if the new artifact doesn’t have unnecessary dependencies
Make sure the new artifact has a clear purpose and place in your design
Ensure the refresh schedule makes sense
etc.

This is a topic I’ve only recently started to explore and I plan to dive into more deeply in future issues.

Until then I hope you enjoyed this issue. As always let me know by liking, commenting or replying.

Data Patterns

Discussion about this post

Ready for more?