Modular Dimensional Data Modeling
How to apply system design principles to Kimball dimensional modeling
I’ve talked in the past about modeling data using activities, specifically the Activity Schema. You can read that post here. While that’s a very unique and interesting way to model data, it’s not something many people are familiar with.
What if we could get the benefits of the activity schema while modeling in a familiar methodology such as star schemas?
Interested? Let’s dive in.
The basics of Dimensional Modeling
A dimensional model is a physical modeling technique that involves having a Fact table in the “center” and several Dimensional tables “circling” the fact. This methodology was introduced by Ralph Kimball in his book The Data Warehouse Toolkit and it has become one of the most popular methods for data warehouse modeling.
This way of modeling data is often used as the final deliverable product before a BI tool such as Tableau, PowerBI, Looker, etc. takes over. In fact many of these BI tools are optimized for this type of modeling.
There are two types of dimensional models:
One Big Table (OBT) where the facts and dimensions are put in a single table
Star Schema where the facts and dimensions are loaded separately and joined at query time
One Big Table (OBT) became quite popular with the advent of columnar databases because you no longer needed to store dimensions in a separate table to optimize storage.
Benefits
Combines facts and dimensions on the same table which makes it super easy to build and load data into
Very easy to understand and query with SQL or with a BI tool. It’s perfect for newbies to SQL (which many times includes PMs or executives)
Best suited for end-user reporting / data marts
Drawbacks
Cannot handle Slowly Changing Dimensions (SCDs). Only replacement is possible, which means historical data is lost
Cannot easily handle late arriving facts. You often have to build it incrementally with long look-back time periods to have a complete picture
Loading performance suffers when the table is large, especially with late arriving facts or changing dimensions
Schema changes (new columns) require full rebuilds
Pretty rigid for broad analysis. Not designed to be joined with other OBTs (in fact this is a BAD idea) Single purpose only
Star Schemas are more forgiving of the above drawbacks and are the tried and true way of delivering what are known as “data marts.” By separating the facts and dimensions, you can load them independently thus more easily handle changes and historical reporting.
Benefits
Separates facts and dimensions on different tables which makes loading faster and easier. (Fun fact: I once improved the loading performance of an OBT from 4hr to 1hr by turning it into a star schema and separating the dimensions)
Easy to understand and query with SQL or explore with a BI tool, though you have to know the schema and which dimensions can be joined with it
Handles SCDs and history much better. New dimensions can easily be added as new columns in the fact + new dim table
Best suited for end-user reporting / data marts
Drawbacks
Building can be tricky given that you have to perform key lookups for the dimensions by joining on the dim value
Adding new facts can potentially break historical data
Still pretty rigid. Certain types of analysis are nearly impossible (cross system, funnels, event streams)
Fact tables aren’t supposed to be joined with each other but analysts often do it anyways leading to incorrect numbers
Modularizing Dimensional Modeling
With that out of the way, let’s talk about another one of my favorite topics, modularity. Software engineers get drilled this concept in their heads throughout their careers, but data analysts, and especially analytics engineers often have no idea what it is.
The key to a modular system design is the ability to separate a large system into independent, composable modules. The idea behind it is to isolate these modules in such a way that each module has only one reason to change. This is known as the Single Responsibility Principle (SRP)
So how can we apply this to dimensional models? In order to understand this we need to learn a little bit about fact tables.
Ralph Kimball’s dimensional modeling defines three types of fact tables:
Transaction fact tables
Periodic snapshot tables
Accumulating snapshot tables
Transactional facts are the most common type of fact table. Each row represents a key business event. Now if you read the article about SOMA above (here’s that link again) the key modeling concept in SOMA is a business activity or event (e.g. Account Executive Opens New Business Opportunity)
Activity fact tables
So, if we think of activities like “mini fact tables” with the right shared dimensional tables in place, we can have a “constellation” of facts. In fact this technique is called Fact Constellation Modeling. It’s considered an advanced / complex modeling technique, but SOMA makes it easy.
Here’s a diagram of what that looks like
As you can see, each fact table is independent and represents a single, key business event. The activities are designed to be immutable, which makes them very composable. If you wanted to create a star schema or OBT, you simply join the desired fact tables to the appropriate dims.
And if you wanted to create a funnel using Narrator’s temporal joins you can easily find the corresponding fact tables, align their schemas (perhaps by using a dbt macro) and then join them using for example First After to chain them into a funnel.
Loading these is very straightforward. Here’s what the code could look like. You can load them incrementally or if you’re following the functional data engineering approach you don’t need to bother. (more on that below)
Benefits
Separate important business activities into independent, immutable fact tables
Easily deal with late arriving facts, just put them in a separate fact table and load it independently
Highly composable. Create any star schema or OBT table you want by joining the appropriate fact tables either via keys or through the activity timestamp
Easy to load. (see the code sample above)
You can actually join fact tables with each other as needed. This makes it easy to integrate multiple systems. As long as you have a shared entity id (like customer id, email, etc) you can easily join them
Easy to debug issues. Since there’s only three layers (source data, activity facts, metric facts) you can easily trace problems
Drawbacks
If you follow this method as opposed to Narrator’s single activity schema, you end up with potentially hundreds of these fact tables. The number of key activities in a business is actually not that large, so perhaps this is not a problem
You still have to deal with SCDs unless you follow a functional data modeling approach (as described by Maxime Beauchemin here) which stores snapshots of all the tables daily
Metrics fact tables
In the diagram above you can also see a few metric fact tables. These are accumulating snapshot tables which means we store one snapshot every day of a lightly aggregated table. You can see the code below.
This is the method we recommend for physical modeling of data in SOMA. The key benefit here is that you can create atomic metrics which can later be used to generate composite metrics through formulas.
This way of calculating metrics is in pure SQL and does have a key drawback in that it separates the aggregations from the facts thus making it harder to debug issues and drill all the way down to a single fact. Many people believe this should be handled by a metrics / semantic layer, so I won’t get into too much details here.
I’ll write more about this modeling technique in future issues.
Until then.
This article is a powerhouse!