Welcome to the latest issue of the data patterns newsletter. If this is your first one, you’ll find all previous issues in the substack archive.
In this edition we’ll tall about one of my favorite topics—data engineering—and why I believe it’s the most interesting career choice for those who enjoy programming but don’t want to build apps.
Years ago, when I was still in college, I got an internship working for a startup.
I was studying computer science and loved solving problems with code. I had particularly enjoyed the classes on algorithms and data structures, so I was quite excited to get a job where I could practice what I was learning.
Well, that didn’t last very long.
As my first project, the CTO asked me to build a web app in Python. I was new to Python. I had been studying Java and C in school, and while Python is similar, I still couldn’t do it. I took the failure very personally. I struggled for a couple of weeks and eventually gave up.
I ended up doing random stuff for the duration of the internship, Linux sysadmin stuff, cybersecurity and pen testing, even web development. Nothing stuck and my confidence took a major hit. I thought I had wasted 4 years of college studying something that I no longer wanted to do.
Disillusioned, I started doing odd IT jobs and considered a career change. I went back to school to get an MBA, because that’s what you do when you don’t know what to do.
As luck would have it, a classmate recognized my technical skills and offered me a job as a business analyst. It was the first time I had worked with data and SQL. I immediately fell in love with it.
It combined the best of what I had enjoyed in college, solving real problems with code and working with data structures. Little did I know that there was a whole career where all you did was work with data.
Data engineering is a strange career. There’s no curriculum out there that can prepare you for it, like data science. Nobody decides they want to be a data engineer in college. Many practitioners simply transition into it from software engineering roles. It doesn’t get much coverage because it’s not flashy enough.
A few years ago I got offered an opportunity to switch to data engineering and I tuned it down. Why? Because I thought of data engineers as the “plumbers” of modern data architectures. Who wants to be a plumber, right?
But good plumbers are always in demand. Can you imagine living in a house with no running water? Data engineers get no lip service, but they’re indispensable to many modern companies.
Would you rather be indispensable or have a flashy title?
Everybody is fascinated with the latest advancements in AI like ChatGPT, DALL-E, Mind Journey, Stable Diffusion, etc. Everyone wants to learn about them. But nobody talks about the data engineering efforts necessary to build these models.
Just because you don’t see it, doesn’t mean it isn’t there.
ChatGPT’s data architecture
I recently asked ChatGPT about what its architecture looked like. Here’s the response:
The first component is data collection. To train GPT-3 and ChatGPT, OpenAI used a massive corpus of text data that was collected from various sources, including books, websites, and other textual resources. The data was pre-processed to remove noise and irrelevant information, and then organized into a format suitable for training the models
Collecting a massive corpus of text from the web, cleaning it and pre-processing it requires a massive engineering effort. You have to understand data architecture, design multiple data workflows, and organize it into a useful format for training.
The second component is the training pipeline. GPT-3 and ChatGPT were trained using a combination of supervised and unsupervised learning techniques. The models were trained on large-scale distributed systems that used specialized hardware, such as GPUs and TPUs, to accelerate the training process. The training data was fed into the model in batches, and the model was optimized using backpropagation and other techniques to improve its accuracy and performance.
Training this massive corpus requires a deep understanding of distributed computing, batch processing and system architecture working with GPUs. Yes, this is where the ML algorithms come into play, but that’s only the final layer.
The third component is the inference pipeline. Once the models were trained, they were deployed on a distributed inference system that is capable of generating text in real-time. The inference pipeline involves several steps, including tokenization, feature extraction, and prediction, and it is optimized for speed and efficiency.
Finally we get into ML engineering which is slightly out of scope from typical data engineering.
And that is the reason why you should consider becoming a data engineer. They’re indispensable and always in demand. If you enjoy programming but you don’t care about building apps (like me) this might be the right career for you.
So how do you get started?
The best way to get started is by grabbing Joe Reis and Matt Housely’s excellent book Fundamentals of Data Engineering. It will teach you the basics and give plenty of direction on what to learn next.
If you enjoyed this topic let me know by replying, liking or commenting on substack so I can write more about this topic in upcoming newsletters.
Until next time.
I’m a seasoned web and database developer but the favorite part of my job these days is building data flows between various databases. Keep these data engineering posts coming!