How we think about Data Pipelines is changing | by Hugo Lu | Nov, 2023

Category:

Harness the Potential of AI Tools with ChatGPT. Our blog offers comprehensive insights into the world of AI technology, showcasing the latest advancements and practical applications facilitated by ChatGPT’s intelligent capabilities.

Photo by Ali Kazal on Unsplash

The goal is to reliably and efficiently release data into production

Hugo Lu

Data Pipelines are series of tasks organised in a directed acyclic graph or “DAG”. Historically, these are run on open-source workflow orchestration packages like Airflow or Prefect, and require infrastructure managed by data engineers or platform teams. These data pipelines typically run on a schedule, and allow data engineers to update data in locations such as data warehouses or data lakes.

This is now changing. There is a great shift in mentality happening. As the data engineering industry matures, mindsets are shifting from a “move data to serve the business at all costs” mindset to “reliability and efficiency” / “software engineering” mindset.

Continuous Data Integration and Delivery

I’ve written before about how Data Teams ship data whereas software teams ship code.

This is a process called “Continuous Data Integration and Delivery”, and is the process of reliably and efficiently releasing data into production. There are subtle differences with the definition of “CI/CD” as used in Software Engineer, illustrated below.

Image the author’s

In software engineering, Continuous Delivery is non-trivial because of the importance of having a near exact replica for code to operate in a staging environment.

Within Data Engineering, this is not necessary because the good we ship is data. If there is a table of data, and we know that as long as a few conditions are satisfied, the data is of a sufficient quality to be used, then that is sufficient for it to be “released” into production, so to speak.

The process of releasing data into production — the analog for Continuous Delivery — is very simple, as it simply relates to copying or cloning a dataset.

Furthermore, a key pillar of data engineering is reacting to new data as it arrives or checking to see if new data exists. There is no analog for this in software engineering — software applications do not need to…

Discover the vast possibilities of AI tools by visiting our website at
https://chatgptoai.com/ to delve deeper into this transformative technology.

Reviews

There are no reviews yet.

Be the first to review “How we think about Data Pipelines is changing | by Hugo Lu | Nov, 2023”

Your email address will not be published. Required fields are marked *

Back to top button