Fundamental data engineering concepts - Part 2
Solving technical problems from first principles
Welcome to the latest issue of the data patterns newsletter. If this is your first one, you’ll find all previous issues in the substack archive.
In the last edition we started talking about a few data engineering fundamental concepts, like algorithmic complexity and “divide and conquer.” You can read that issue here.
In this issue we continue talking about fundamental concepts and cover another tricky technical problem I’ve personally encountered working with data. So grab a cup of tea or coffee and let’s get into it.
One of the most challenging problems in the data field is something you’ve probably never heard about, but it’s quite common with government data. It’s called by many names such as Entity Resolution, Record Linkage, Record Matching, etc.
Why is it challenging? Imagine we are a government agency. We have a database of information about people with attributes like national id (SSN in the US), name, some kind of address, date of birth and other identifying information.
We need to share this information with another government agency, for whatever reason, but they don’t have the national id. They only have name, address and some other attributes.
How do we match the information across databases?
It might seem easy at first, just join on name and address but as we know real world data is messy. It might be badly formatted, missing, truncated, misspelled, etc.
This is where Entity Resolution comes in. Utilizing the idea of “fuzzy matching” we can get very close to 80%-90% accuracy which should suffice for most use cases.
How Entity Resolution Works
So let’s set up the problem. We have two tables with customer information we want to link or a single table with multiple rows for the same customer we want to resolve the identity for. The only information they share is customer name and address.
To do this we have to make an assumption about the uniqueness of each row. We can assume that a combination of name + address (city, state, postal code) is pretty unique.
We also assume that information for a customer is not exactly the same, but pretty close (e.g in one record they’re Chris in the other they’re Christopher) so we need a way to compare two strings for similarity.
String similarity is a well-known problem with multiple algorithms available such as Levenshtein distance, Jaro-Winkler, etc.
The simplest and most naive approach is to cross-join the two tables (or self join if only one table) and compare every row to every other row using a similarity threshold filter (say 75%+ similar) to find the duplicates.
Ok simple enough, right? But wait! What if one table has 10 million rows and the other 15 million? If you compare every row to every other row, you get 15*10^14 comparisons! That’s O(n^2) complexity. Even if you can compute 1 per nanosecond, the query would take 27.7 hours!
This problem isn’t unique to entity resolution.
When I was working for TripAdvisor, I was asked to convert a python program into a SQL query. The program took rental listings and output points of interest within a 5 mile radius of that location (shopping centers, museums, parks, beaches, etc).
It used two tables with latitude and longitude. One had locations of rentals and the other had locations of points of interest. So all I had to do was cross join the tables, calculate the haversine distance in miles and filter. Easy right?
My first attempt was the naive one. I cross-joined the tables, calculated the distance, filtered down to 5 miles and let it run. The next day it was still running. Why? Imagine cross joining every point on the globe to every other point. Oy!
Clearly this approach doesn’t work, so what can we do?
In the previous issue we talked about “divide and conquer.” Another important fundamental concept in software engineering is to reduce your search space as much as possible by removing all the places unlikely to contain your matches.
What does this mean?
In our entity resolution example, we can be fairly certain that John Smith in Boston, MA is not the same as John Smith in New York, NY. This means we can reduce our search space from comparing every row to only comparing rows from people in the same state + postal code.
In the case of the Python program I converted, a colleague told me that a single latitude or longitude degree covered roughly 70 miles. So I could reduce my location search to within a single degree getting the query to run in 10 minutes!
Here’s what the entity resolution query looks like:
I’m joining the table with itself on state + zipcode to reduce the search space and using string similarity thresholds for filtering potential duplicates. In entity resolution methodology this is known as “blocking.”
The smaller we can make the blocks, the faster we can get the query to run. State + zipcode is good but we can do better. We can use a method like the Soundex algorithm to map similar names to the same code.
I’m experimenting with more technical types of content so let me know if you enjoyed it and want more by liking it on substack, commenting or replying directly. I read all replies.
Until then.
Really like this kind of content, thanks!
What type of join are you doing in that query? When you say “JOIN” like that, I think it’s an INNER JOIN right?