Data Engineering
The key to understanding what data engineering lies in the “engineering” part. Engineers design and build things. “Data” engineers design and build pipelines that transform and transport data into a format wherein, by the time it reaches the Data Scientists or other end users, it is in a highly usable state. These pipelines must take data from many disparate sources and collect them into a single warehouse that represents the data uniformly as a single source of truth.
Sounds simple enough but a lot of data literacy skills goes into this role. This is why Data Engineers are in such short supply and why there is confusion around the role. The figure below is one example of the activities involved in data engineering.
What Do Data Engineers Do
Data engineering is a skill that is in increasing demand. Data engineers are the people who design the system that unifies data and can help you navigate it. Data engineers perform many different tasks including:
Acquisition: Finding all the different data sets around the business
Cleansing: Finding and cleaning any errors in the data
Conversion:: Giving all the data a common format
Disambiguation: Interpreting data that could be interpreted in multiple ways
Deduplication: Removing duplicate copies of data
Once this is done, data may be stored in a central repository such as a data lake or data lakehouse. Data engineers may also copy and move subsets of data into a data warehouse.