In data science, 80% of the resources ( time, energy.. ) are spent on data collection, processing and cleaning, and only 20% is spent building AI models and analyzing data. I am currently working as a Data Scientist and I spend a good part of my time doing Data Engineering before getting to the part where I can finally extract insights or make predictions.
In this article, we'll discuss some basic principles of moving data between different repositories reliably. It can be considered an introduction to data engineering. These principles are technology-agnostic and can be applied with any given stack.
Who is the DE ?
In a data science team, data engineers are responsible for collecting, processing, cleaning and validating the data used by the rest of the team. DEs should be able to put a data infrastructure in place that makes it possible for them to deliver data reliably and empower whoever is working on the data analysis with trusted data they can get insights from. The majority of the DE's time is spent building and maintaining data pipelines.
It is not considered a very rewarding job since it's seen sometimes to be too technical for people to truly see its value. Nonetheless, it is very important and demand is rising ( 40% ) since all the previously hired data scientists now are finding it very challenging to do both analytical work and engineering, and it is expected to be this way for a while.
ETL is a common pattern in Data Engineering for moving data from a state to another. ETL is an abbreviation of "Extract, Transform and Load". It's basically a chain of software components each of them takes the data and hand it over to the next one either directly or through an intermediate database.
Let's explain each one of them.
The step of collecting data from the targeted data sources. Each data source has a specific method of communicating its data.
- websites : web scrapping ( eg. flights prices ) if an API doesn't exist.
- database : query language ( eg. SQL )
- stream of data : subscribe to data producer ( eg. sensors )
- API : ready to use data or service provided through an API ( eg. geoencoding )
Processing the collected data and changing it accordingly to how it should be stored. Few examples of common operation :
- removing duplicates
- changing data types (eg. string to int )
- normalize and prepare for targeted schema ( eg. changing properties' names )
Processing tasks should be efficient, idempotent and have one clear responsibility. We can get into this in another article.
The process of saving input data in a target data repository. Data can be pushed to the target in different ways depending on the data type :
- API : pushing data through an API that takes care of the rest
- Query Language : building queries ( eg. SQL ) and running them against a database
- File system : saving data in files ( eg. csv ) in the file systems.
ETL vs ELT
As the managed data volumes grow, we tend to keep our raw, processed and trusted data separate.
The ELT pattern suggests saving the processed data in a separate database, as a staging area where the engineers can test multiple transformations before loading the results in the trusted database as a safety measure.
Batch vs Stream
Data can move in pipelines either in Batches or Streams. In systems continuously producing data at a high frequency, we can utilize Streaming to move and process their data ( eg. real-time monitoring). But if data is usually produced at a much less frequency and we don't need the updates in realtime, it is recommended to be processed in Batches on a lower frequency ( eg. daily jobs ). With this said, the frequency in which we want our target data repository to be updated is the first factor in choosing between Batch and Stream. We can discuss this in more depth in another article.
A data engineer should be able to ensure data quality through the data pipelines, since analysts will build their studies on top of the delivered data and make recommendations to the clients. Sometimes, other solutions can be built utilizing the data. In this section, we'll discuss few practices for ensuring data quality.
While engineers are working on data pipelines, data analysts are expecting data at the end of the pipeline, and they usually have some defined expectations about the data they want to work with. ( eg. attributes to work with, specific data types ).
This is where validation contracts between the analysts and the engineers should take place. Both profiles should agree on what data to deliver at the end of the pipeline. These contracts should contain detailed requirements which both consider the technical limitations and serves the necessary data for the analysis.
Each operation that interacts with the data should leave metadata behind :
- when was it executed ?
- where did it run ? ( eg. which machine / server )
- what was the exit status ? ( eg. successful, failure )
- what was the data source and data target ?
- which version of the process run ?
On the other hand, every data element (eg. row, document, file) in our data repository should have metadata attached to it to describe where it came from and how is it changing.
- where did the data come from ? ( eg. url, path )
- created_at, updated_at
- the query that produced the data
When having this metadata stored and up to date, it becomes very easy to keep track of how data is changing and the health of our processes ( #auditing ). We can even recover from failures easily thanks to the metadata.
In an environment where sensitive data is constantly moving and shared between multiple systems which are also continuously evolving. It is very common for problems to occur and we need to be ready to recover from possible incidents.
Some of the good practices :
- implement retry and skip behaviors where needed : sometimes we can tolerate a task to rerun or skip, depends on the context
- backup databases regularly creating incremental checkpoints we can go back to any given time
- separate DEV and PROD environments : you can first run a process against the DEV database and see how it goes before running it on the PROD one.
- version control ETL code ( one atomic task per commit ) : changes to the data pipeline should be committed separately and documented through a version control system ( eg. Git ). if a commit causes a problem, we can simply go back to a previous version of the code.
In this article, we discussed the basics of data engineering starting with the ETL pattern, then we elaborated on some good practices of how we can ensure quality through data pipelines. This is just an introduction to the discipline of Data Engineering. I hope it was useful.
Do not hesitate to read more articles on the blog.