One of my most recent newsletters introduced you to the concept of “modular dimensional modeling” as a way to simplify and advance the ideas of Ralf Kimball. If you didn’t get a chance to read it, grab it here and bookmark it for later.
Then, yesterday I ran a poll for my LinkedIn followers to see what kind of content they’d like more from me, totally expecting it to weigh heavily into SQL. The results both surprised and delighted me.
As you know data modeling is by far my favorite topic, outside of SQL, so in this article I’m going to delve into more detail of how we’re modeling data for SOMA.
At its core the SOMA model is made up of two concepts, Activities and Entities. I’ve talked about Activities before but today we’ll get into a precise definitions. I’ll also talk about Entities and how they connect with Activities and Metrics.
So get ready, strap in and let’s get to it.
Activities, Entities and Top Down Modeling
The key to SOMA data modeling are metrics. Metrics are how we measure the operational performance of business processes. At their core, these processes are made up of individual activities. For example a sales process could contain the following activities:
SDR Accepts Prospect
SDR Emails Prospect
SDR Calls Prospect
SDR Qualifies Lead
SDR Books Meeting
SOMA’s key tenet is that companies shouldn’t be innovating on metrics. Companies that share a business model (e.g. B2B SaaS) are largely the same in how they operate. This enables us to define all the metrics upfront.
SOMA B2B SaaS for example defines 450+ metrics. And since metrics depend on activities, we also define all the key activities upfront. B2B SaaS has about 140+ key activities.
So what, right? Well with the metrics and activities defined, we can do something never before possible in data modeling. We can model data top down as opposed to the usual bottom up.
What does that mean? Well here’s how bottom-up modeling is usually done:
Suppose you want to model a customer entity.
- You ask marketing and they say “someone who has started a trial”
- You ask sales and they say: “someone who has signed a contract”
- You ask finance and they say: “someone who has paid”
- You ask accounting and they say: “someone whose transaction has cleared”
So what do data engineers do? They start with raw data, clean it (handling nulls, duplicates, type conversions, etc) and eventually end up with a random assortment of tables that are later joined into data marts.
These data marts are driven by reporting demands, and since that’s unknown and constantly changing, you end up with thousands of tables in your cloud data platform and a hefty monthly bill.
Top-down modeling inverts this approach. Rather than starting with raw data and trying to figure out what tables to build to allow for the broadest reporting capabilities and “always changing” metrics, SOMA is prescriptive.
Once you accept SOMA’s metrics definition, you also accept the activity and entity definitions. You could customize them but why would you? Now data modeling becomes easy and straightforward.
Here’s the architecture:
Raw data is mapped to standard Entities (like Customer, Prospect, Employee, etc.) SOMA B2B SaaS defines about 50 or so. These Entities generate Activities which later are turned into Metrics through a Semantic/Metrics Layer and consumed via BI tools.
Here some examples of SOMA entity definitions:
Contacts: An identifiable individual that has provided their information and can be reached out to for sales or marketing purposes
Companies: A unique entity that has or can enter into independent commercial agreements.
Customers: A Contact or Company that has completed at least one historical purchase.
Implementation
Ok let’s get into specifics. How do we implement Entities and Activities?
Entities
Each SOMA Entity is represented by two physical tables:
Entity History (e.g. customers_history)
Current Entity (e.g. current_customers)
Entity History tables are the primary mapping interface for SOMA implementers. These tables are immutable ledgers that track the state of Entities and are structured as follows:
Immutability is very important to SOMA because it allows full historical representation. This means we have to deal with SCDs (slowly changing dimensions)
We recommend taking full snapshots of the table daily (as recommended by Maxime Beauchemin in his article on functional data engineering) but the same can also be achieved through SCD Type 2.
For example the Opportunity entity has the following attributes:
Opportunity OpportunityID
Opportunity Stage
Opportunity EstimatedValue
Opportunity ExpectedCloseDate
Opportunity ProbabilityToClose
Opportunity OpenedBy
Opportunity OwnedBy
Opportunity PrimaryContactID
Opportunity CompanyID
Opportunity Type
Opportunity ActualCloseDate
Opportunity MRRPotential
Opportunity GeneratingOrg
Opportunity CampaignID
When any of these attributes change (for example the Stage becomes “Qualified”) it generates a corresponding activity (AE Qualifies New Business Opportunity)
Entities are mapped directly to raw data. So if you’re using Salesforce for example, you’d map the SOMA Opportunity entity to the SF Opportunity table, but if you use Hubspot, you’d have to map SOMA’s Opportunity to HS Deals table.
Activities
Activities are generated based on changes to Entities. In SOMA, Activities and Metrics are automatically generated once the appropriate Entities are mapped to the raw data. This makes the job of the implementer very easy.
Activities are defined physically as immutable logs using the following schema:
Activities can join to both Entity History tables as well as Entity tables. Joins to the Entity History tables can be made by using the entity_history_id values in the Actor and Object columns of the Activity.
Activities can also be joined with each other via temporal joins as defined by the Activity Schema here using the person_id/company_id as the entity_id.
Facts and Dimensions, Data Vault, etc.
One of the many benefits of modeling data this way is the compatibility with many popular DW modeling techniques.
If you’re a fan of Kimball dimensional modeling / star schemas, you can think of Entities as Dimensions and Activities as Facts. Metrics can then be expressed as star joins between the two.
If you’re a fan of Data Vault, you can build Entities and Activities directly from Hubs and Satellites where the Entity key is the hub key and each attribute can be a separate satellite table.
If you’re a fan of Activity Schema, you can build an Activity ledger by appending together all the activity facts and afterwards run temporal joins to build customer 360 tables.
There’s a lot more I plan to write about modeling data this way so stay tuned.
Until then.
Very excited about what's to come!