Data quality monitoring made easy
Jul 16, 2021 • 11 min read
Jul 16, 2021 • 11 min read
After the initial excitement that data lakes would help companies maximize the utility of their data, many companies became disillusioned by rapidly diminishing returns from their big data efforts. While it was easy to put large volumes of data in the lakes, turning that data into insights and realizing value from it turned out to be a much more difficult task.
Many of these problems were related to poor quality of data, lack of alignment between business and technology, lack of collaboration, and absence of proper tooling. When software development faced similar challenges, Agile and DevOps techniques helped solve the problem.
In the world of data, the industry invented the term DataOps, which takes Agile and DevOps principles and applies them to data engineering, science, and analytics. We are not going to focus on DataOps techniques in this post, as there are many good articles on the topic. Instead, we will focus on the technology enablers that facilitate DataOps implementation.
In a previous article we wrote about how data lakes are insufficient to satisfy enterprise data analytics needs and that companies need to build fully featured analytical data platforms. In this article we will explore the capabilities of an analytical data platform that facilitates a DataOps methodology and helps companies derive value from the data more quickly and efficiently.
Modern analytical data platforms contain thousands of data transformation jobs and move hundreds of terabytes of data in batch and real-time streaming. Manual management of complex data pipelines is an extremely time consuming and error-prone activity, leading to stale data and lost productivity.
The goal of automated data orchestration is to take the effort of scheduling execution of data engineering jobs off the shoulders of data engineering and support teams and automate it with tools. A good example of an open source data orchestration tool is Apache Airflow, which has a number of benefits:
Some DataOps articles refer to statistical process controls, which we call data monitoring. Data monitoring is the first step and precursor to data quality. The key idea behind data monitoring is observing data profiles over time and catching potential anomalies.
In the simplest form, it can be implemented by collecting various metrics of the datasets and individual data column, such as:
Then for each metric, the system would calculate a number of usual statistics, such as:
With this information, we can observe whether the new data item or dataset is substantially different from what the system has observed in the past. The data analytics and data science teams can also use collected data profiles to learn more about data to quickly validate some hypotheses.
The simple methods of data monitoring can be augmented by AI-driven anomaly detection. Modern anomaly detection algorithms can learn periodic data patterns, use correlations between various metrics, and minimize the number of false positive alerts. To learn more about this technique, read our recent article on various approaches to real-time anomaly detection. To simplify adding anomaly detection to the existing analytical data platform, we have implemented an accelerator, which you can learn more about in this article or reach out to us to try it out.
While data monitoring helps data engineers, analysts, and scientists learn additional details about data and get alerted in case of anomalies, data quality capabilities take the idea of improving data trustworthiness, or veracity, to another level. The primary goal of data quality is to automatically detect data corruption in the pipeline and prevent it from spreading.
Data quality uses three main techniques to accomplish that goal:
If a team already uses automated data orchestration tools that support configuration-as-code such as Apache Airflow, data quality jobs can be automatically embedded in the required steps between, or in parallel to, data processing jobs. This further saves time and effort to keep the data pipeline monitored. To learn more about data quality, please refer to our recent article. To speed up implementation of data quality in the existing analytical data platforms, we have implemented an accelerator based on the open source technology stack. The accelerator is built with cloud-native architecture and works with most types of data sources.
Data governance is a ubiquitous term that also encompasses people and process techniques however, we will focus on the technology and tooling aspects of it. The two aspects of data governance tooling that have become absolute must-haves for any modern analytical data platform are the data catalog and data lineage.
Data catalog and lineage enable data scientists, analysts, and engineers to quickly find required datasets and learn how they were created. Tools like Apache Atlas, Collibra, Alation, Amazon Glue Catalog, or Data Catalogs from Google Cloud and Azure can be good starting points in implementing this capability.
Adding data catalog, data glossary, and data lineage capabilities increases productivity of the analytics team and improves speed to insights.
The concept of DevOps is one of the cornerstones and inspirations behind the DataOps methodology. While DevOps relies on culture, skills, and collaboration, modern tooling and a lightweight but secure continuous integration and continuous delivery process helps with reducing time-to-market when implementing new data pipelines or data analytics use cases.
As is the case with regular application development, the continuous delivery process for data needs to follow microservices best practices. Such best practices allow the organization to scale, decrease time to implement and deploy new data or ML pipelines, and improve overall quality and stability of the system.
While having many similarities with application development, continuous delivery processes for data have their own specifics:
Traditional tooling such as GitHub or other Git-based version control systems, unit testing and static code validation tools, Jenkins for CI/CD, and Harness.io for continuous deployment, find their principal use in the data engineering world. Using data pipeline orchestration tools, which allow configuration-as-code such as Apache Airflow, streamline the continuous delivery process even further.
DataOps has become an important methodology for a modern data analytics organization. As is the case with Agile and DevOps in traditional software development, DataOps helps recognize value sooner and achieve business goals in a more reliable way. To be successful with DataOps, companies need to learn new skills, adjust their culture, collaboration, processes, and extend their data lakes with a set of new technical capabilities and tools.
At Grid Dynamics, we’ve helped Fortune-1000 companies adopt a DataOps culture, onboard required processes, acquire the necessary skills, and implement the needed technical capabilities. To help our clients get to insights faster, we’ve created accelerators for all necessary capabilities including data orchestration, data monitoring, data quality, anomaly detection, and continuous delivery. To learn more about the case studies on implementing these capabilities at the enterprise scale, read our whitepaper. To try our accelerators, see the demos, and discuss how to onboard them, please feel free to reach out to us.