No Such Thing As Dirty Data

You don't clean data because it's not dirty

Jun 29, 2024

You’ve probably heard the metaphor “data is the new oil” or “data is raw material.” As a data professional I’m sure you’re very familiar with the concepts of “dirty data” and “data cleaning.”

Both those ideas are in fact WRONG.

There’s no such thing as “dirty data.” Data is either "fit for purpose" or "unfit for purpose." Data "fit for purpose" requires no changes and can be used as is. Data "unfit for purpose" requires "retrofitting" which will ALWAYS cause problems.

In our parlance this is known as “data modeling” but I really like “retrofitting” because it reminds me retrofitting aftermarket parts to a machine they were not intended for.

An interesting corollary here is that data when it’s produced is always fit for purpose (whatever it was designed for.) It contains exactly the information it was designed.

So it’s quite pointless to get caught up in debates with data providers about whether their data was dirty or not. Nobody likes to think their data is dirty. But when you frame things in terms of fitness and retrofitting it makes the conversation more productive.

What about data that’s inconsistent or messy? Is that not “dirty data?” Yes consistency is the number one reason for messy data. In fact the first thing I do when working with a company is to ensure consistent processes in how data is produced. I’m a big fan of activity based modeling which makes it really easy to talk to business stakeholders. But that’s still not “dirty data” because it still has a purpose.

Fitness of purpose also extends to data modeling (or as we called it retrofitting)

Data modeling, in the traditional sense, claims to provide a single form of data that fits multiple (all?) purposes. I might piss off many purists here but I’m a pragmatist when it comes to this.

I believe you should model data the best way possible to fit your specific purpose. If that purpose were to change in the future you’d need to “retrofit” it to the new purpose.

But Ergest, isn’t this going to cause unnecessary duplication of data assets like tables? It doesn’t have to. You see one of the keys to building useful data models is being able to refactor them when the need arises.

Many data teams don’t do this because they’re too busy answering questions. I’ve already talked about how to deal with that in the previous issue. In my professional experience, when allocated properly, refactoring leads to joy.

I don’t know about you but when I get a chance to redesign something I always find a more elegant way to do it and as an added bonus my code is much more cleaner and easier to maintain.

That’s it for this issue, but I’m working on some cool content coming up. I want to start doing a deep dive into metric trees.

I’m going to start breaking down ARR/MRR as it pertains to recurring revenue businesses and we’ll work our way to the leaf nodes. By the way if you want my help doing this for your company, just hit reply and let me know.

Until next time.

Data Patterns

Discussion about this post