Data quality monitoring made easy
Jul 16, 2021 • 11 min read
Jul 16, 2021 • 11 min read
It’s no secret that leading a healthy lifestyle improves your overall quality of life. A healthy lifestyle involves following a good diet, limiting excessive alcohol or smoking, taking the time to exercise, and getting enough sleep.
This same philosophy can be applied to the way your business treats its data. The better the quality of the data, the better the overall health of the business. According to IBM research, data quality problems cost U.S. businesses more than $3 trillion annually. This is supported by findings from Gartner that indicate company losses from data quality issues average $15M per year. How much is your company contributing to these figures?
For more details on why data quality is so important, feel free to refer to our previous post on the issue. But the good news is that by focusing on rule based monitoring to improve the quality of your data, revenues can be boosted by up to 70 percent and sales by almost a third.
This article covers the following key questions that can be addressed with rule based data monitoring:
Data monitoring is essentially about getting knowledge on different aspects of your data. This can include data volume, velocity, consistency, flow, and mapping. But really, what people are usually trying to arrive at during the monitoring process is to be able to answer the question, “Is my data good enough right now or not?”.
The following types of data monitoring can be considered:
There are a number of techniques that can be used to measure data quality, with their suitability dependent on the type of data in use. In some cases, some techniques can still be used but with certain limitations.
Going back to the analogy of maintaining a healthy lifestyle - is it enough to visit the gym once a month? Or use your bike only on the 22nd of September (World Car Free Day)? The same principles apply to data. It’s not enough to just check if it's correct once, at the moment when a new data source is first attached.
When you have a rule in place to validate a piece of data, it's a good idea to run this validation again from time to time. Usually you’ll have sets of rules or checks you would like to run against some table or dataframe. Hopefully such checks are easy to automate. Logically, the next step is to attach these automated runs to some continuous system (scheduler) and make it a part of the ETL pipelines.
Therefore, when you start to monitor your data you can identify issues such as the simultaneous update of master and derived data sources. Or when your system is trying to ingest too much data at the same time instead of chunking it.
The following example illustrates the necessity of regular data monitoring. There was a system that collected data from several heterogeneous data sources. These included internal databases, data taken from third party services, as well as some spreadsheets. The system synchronized data with the data sources every 15 - 20 minutes.
Occasionally, some of the users would notice that part of the expected data was missing. They asked the team responsible for this tool to check what was going on. This took some time, and the delay between the request and the time when engineers started looking into the issue was about 30 - 45 minutes. During that time, one or two data-sync jobs went down. Ultimately, the engineers determined that all the data was there and said that everything was ok. The users checked this and confirmed that was the case but the overall cause of the issue remained unclear.
After regular data monitoring was added, it helped to show that once every day or two, some tables were indeed disappearing during data sync. However, now the engineering team was immediately notified as soon as it happened. And the evidence was now right there. This meant that the cause of the issue could be identified and properly addressed.
It’s always hard to take the first step. If you’ve decided to start running or cycling, you can never start out running marathon distances or taking on steep mountain tracks. When you start monitoring to increase data quality, the same approach of implementing it step-by-step applies. Rule by rule - from simple ones to more complex business rules.
Here are some key recommendations:
At Grid Dynamics we conducted a review of a number of tools that can help to apply static rules to control data and determined that they each have their pros and cons.
There are 3 types of software:
As a result of this finding, we set out to build our own accelerator that combined the advantages of each type outlined above. We now use it within the company to monitor our own data. It implements rule based techniques and can be used across different types of data.
In contrast to powerful enterprise solutions, our approach does not force us to change the process and is easy to deploy and configure. It also benefits from not requiring the need to write code at all to be able to start using it. As with in-house data monitoring approaches, it can be tightly integrated into your data pipelines. In addition it uses a unified and systematic approach, which is in contrast to the usual in-house zoo of utilities, tools, scripts, and configurations.
Being based on a mainstream open-source stack of technologies (Apache Spark, PostgreSQL, Elasticsearch, and Grafana) our approach shares their benefits. This includes that the majority of developers and DevOps engineers are familiar with these technologies and that it can be extended by writing custom modules.
Continuous data monitoring is a great habit to establish. And once you recognize that it is detecting and preventing issues that lead to cost savings, you should feel comfortable wanting to invest further in it. Using an automated approach increases the chances that you will retain data monitoring as part of your standard data processing procedures.
If you have complex, multi-step ETL pipelines, it’s worth monitoring data on the end of each job. Such validation rules can be declared as ‘must have’ acceptance criteria for development.
Regular reports and dashboards shared to all stakeholders are also a good idea that will help to anchor the practice of regular data checking.
Regular data monitoring gives you a necessary level of confidence about your data. In most cases if something goes wrong you can be sure the reason is not in the data, or can easily prove it just by adding additional rules.
On the other side, you know that your data is ready to be used by b2b solutions and no additional effort is needed to make it stable. Otherwise, you’ll need to be ready to face issues during the integration stage, which usually takes more time if the data is not properly controlled. Sometimes it takes weeks to find an appropriate workaround, which can significantly delay the release to production.
Finally, when issues related to data appear, you will already be in a good position to speed up the defect detection phase. It also simplifies the process of validating that everything is ok once it’s been fixed.
Q: We don’t have Big Data, just some data. Will your approach help us?
A: Obviously, the Big Data area is the main target. However, in our experiences Grid Dynamics’ approach to data monitoring has a wide range of applications and can be used for:
Q: We have our own infrastructure, should we be prepared to spend significant amounts of money on additional infrastructure?
A: There’s no way around it - additional computation requires additional infrastructure. So, some additional expenses should be expected. However, because we already use this tooling widely in our organization we are in a position to make it as cost effective as possible. In most cases only one additional host is needed. Your spark cluster can be used to process the data as well as the data sources themselves. Depending on the amount of checks required, the current clusters may also need to be extended.
Q: All our developers are pretty busy, how much time do we need to plan to code the data monitoring rules?
A: It depends on what types of data sources you have. If the data connectors that the solution has out of the box aren’t adequate, you can plug in your own data solution. You need about 1-2 days to implement the first custom connector. Or our engineers can assist with this. The majority of the rules can be created by data analysts or data quality engineers, you don’t need developers to do it.
Q: All our developers are pretty busy and there is no time to change data processing pipelines to add data quality checks. What should we do?
A: Calling the set of checks is as simple as sending a get request over http. Moreover, in the earlier stages you can follow the ‘Sidecar’ pattern, so no changes are required in your pipelines. However, to realize more benefits it’s better to plan to add data quality jobs into your pipelines.
A large number of decisions are made based on the data contained within organizations. And in a digital world companies also share their data with many other companies and SaaS solutions. So right now is the best time to establish a healthy path for your data.
While you don’t have to take advantage of our solution for your data monitoring, we do strongly encourage you to at least use some kind of data monitoring solution available from the market. You can contact us if you have any questions about data quality or monitoring, and we’re always ready to discuss how you can achieve better results in this important area.