Fundamental data engineering concepts - Part 1
Solving technical problems from first principles
Welcome to the latest issue of the data patterns newsletter. If this is your first one, you’ll find all previous issues in the substack archive.
In this edition we’ll talk about some fundamental data skills and patterns of solving technical problems from first principles. So grab a cup of tea or coffee and strap in.
The more I think about it, the more I realize that what I stand for, what I want to talk about are the boring fundamentals, the enduring skills that are always in demand. I don’t much care about the flashy, the new, the exciting. Survive the economic evolution first, then we talk.
One of these fundamentals is algorithmic (computational) complexity. If you haven’t studied computer science, algorithmic complexity gives you a rough estimate of the time it would take to solve a computational problem depending on the size of the input. It also known as “BigO notation”
Without getting too deep into the weeds of complexity, I’ll cover some basic notation:
O(1) (fixed time) is an operation that takes a fixed amount of time independent of the size of the input. In databases, an operation like “DROP TABLE” will execute in the same amount of time regardless of how big the table is.
O(n) (linear time) is an operation whose execution time depends linearly on the size of the input. This means that if you double the input size, you double the execution time. In databases an operation like “SELECT * FROM table” can be thought of as having linear complexity.
O(n^2) (quadratic time) is an operation whose execution time is the square of the input size. This means if you double the size, you quadruple the execution time. This often happens when you have nested loops. For each level of nesting, you add an exponent, so two levels is squared, three levels is cubed, and so on.
So how does this help you solve data problems? In todays edition I’ll talk about the first of two recent data problems I had to solve from first principles.
The first problem involves running a function on top of a very large table. In PostgreSQL you usually call functions via a JOIN LATERAL statement (or in a subquery) and when you do that, you’re essentially creating a nested loop.
Let me explain.
You can think of scanning a table like a loop. For each row, you perform an operation like calculating a new column or selecting an existing column. These are typically O(1) (fixed time) operations. Scanning the table is O(n) so the worst we could do is linear.
Now when you call a function on every row, depending on what the function does, you could have O(n) complexity or O(n^2) complexity if you’re creating multiple levels of nested loops.
This is what happened to me when I was trying to run an address standardization function on a 20 million row table in PostgreSQL. After about 3 hours it failed. Because the function is written in C, I didn’t have access to the source code, so I assumed worst case scenario that I was dealing with O(n^2) (quadratic) or worse complexity.
When you have no control over the algorithms that the database uses—and in most cases you don’t—the only thing you can control is the size of the input. My idea was to use the age old principle of divide and conquer, which means I had to figure out a way to slice the table up into multiple smaller pieces and process each one in parallel using multiple threads.
This table happened to be a union of 20 years worth of data with each year having roughly 1 million rows. Using dbt, I created one model per year and ran all of them in parallel. To my amazement all 20 years completed in just a few minutes! Now all I had to do was union them back into a single table, but that wasn’t a problem.
By reducing the size of the input, and spawning multiple processes, I probably also took advantage of in-memory processing in the database, increasing the efficiency even more.
This idea is the core principle behind “horizontal partitioning” where a large table is split up (usually by a time component like day, year or month) in order to speed up processing. In fact Snowflake’s key innovation (besides separating storage from compute) is the idea of micro partitions.
A large table is split up into smaller independent partitions in such a way that allows for efficient parallelization when optimizing queries. As long as you don’t need to access the other partitions during computation, you can process the entire table in a fraction of the time.
The second problem involves entity resolution, self joins and blocking keys. Since it would take too long to explain and I want to make this newsletter easier to read, I’ll leave it for part 2.
This issue was slightly more technical than others. I’m experimenting with different types of content so let me know if you enjoyed it by liking it on substack, commenting or replying directly. I read all replies.
Until then.
This was great. Just right level of info. Looking forward to part 2
Excellent read