Testing, Monitoring, and Data Observability: What’s the Difference?
This article will give you a better understanding of data quality testing, monitoring, and observability. So let's explore these concepts together.
Join the DZone community and get the full member experience.
Join For FreeWhen it comes to making your data more reliable at scale, you can take a few routes: test your most critical pipelines and hope for the best, automated data quality monitoring, and end-to-end data observability.
The right solution for your needs largely depends on the following
- The volume of data (or tables) you’re managing;
- Your company’s data maturity; and
- Whether or not data is a mission-critical piece of your broader business strategy.
Data teams are evaluating data observability and data quality monitoring solutions as they start to question if they are getting enough coverage “return” for each hour of data engineering time “invested” in data testing. In this economy, teams can’t afford to spend 30% or more of their workweek firefighting data quality issues. Their time and energy are better spent identifying new avenues of growth and revenue.
So, without further ado, let’s walk through the major differences between these three common approaches to data quality management and how some of the best teams are achieving data trust at scale.
What Is Data Testing?
One of the most common ways to discover data quality issues before they enter your production data pipeline is by testing your data. With open-source data testing tools like Great Expectations and dbt, data engineers can validate their organization’s assumptions about the data and write logic to prevent the issue from working its way downstream.
Advantages of Data Testing
Data testing is a must-have as part of a layered data quality strategy. It can help catch specific, known problems that surface in your data pipelines and will warn you when new data or code breaks your original assumptions.
Some common data quality tests include:
- Null Values: Are any values unknown (NULL) where there shouldn’t be?
- Schema (data type): Did a column change from an integer to a float?
- Volume: Did your data arrive? And if so, did you receive too much or too little?
- Distribution: Are numeric values within an expected/ required range?
- Uniqueness: Are any duplicate values in your unique ID fields?
- Known invariants: Is profit always the difference between revenue and cost?
One of the best use cases for data testing is to stop bad data from reaching internal or external facing consumers. In other words, when no data is better than bad data. Data contracts, for example, are a type of data test that prevent schema changes from causing havoc downstream. Data teams have also found success using the Airflow short-circuit operator to prevent bad data from landing in the data warehouse.
The other obvious advantage of data testing, especially for smaller data teams, is that it’s free. There is no cost to writing code, and open-source testing solutions are plentiful. The less obvious point, however, is data testing is still almost always the worst economic choice due to the incredible inefficiencies and engineering time wasted (more on this in the next section).
Disadvantages of Data Testing
Is data testing sufficient to create a level of data quality that imbues data trust with your consumers? Almost certainly not.
Data testing requires data engineers to have enough specialized domain knowledge to understand the common characteristics, constraints, and thresholds of the data set. It also requires foresight, or painful experience, of the most likely ways that data could break badly.
If a test isn’t written ahead of time to cover that specific incident, an alert isn’t set. This is why there needs to be a way to be alerted of issues that you might not expect to happen (unknown unknowns).
“We had a lot of dbt tests. We had a decent number of other checks that we would run, whether they were manual or automated, but there was always this lingering feeling in the back of my mind that some data pipelines were probably broken somewhere in some way, but I just didn’t have a test written for it,” said Nick Johnson, Dr. Squatch VP of Data, IT, and Security.
In addition to providing an overall poor level of coverage, data testing is not feasible at scale. Applying the most common, basic tests using programmatic solutions can still be radically inefficient, even if your data team was diligent about adding them immediately to each data set the minute they are created…which most teams are not.
Even if you could clone yourself – doubling your capacity to write tests – would this be a good use of your company’s time, resources, and budget?
Another reason data testing is incredibly inefficient is that the table where the incident is detected rarely contains the context of its own data downtime. In other words, a test might alert you to the fact the data isn’t there, but it can’t tell you why. A long and tedious root cause analysis process is still required.
Data is fundamental to the success of your business. Therefore, it needs to be reliable and performant at all times. In the same way that Site Reliability Engineers (SREs) manage application downtime, data engineers need to focus on reducing data downtime, in other words, periods when data is missing, inaccurate, or otherwise erroneous.
Much in the same way that unit and integration testing is insufficient for ensuring software reliability (and thus the rise of companies like Data Dog and New Relic), data testing alone cannot solve for data reliability.
The bottom line: testing is a great place to get started, but it can only scale so far, so quickly. When data debt, model bloat, and other side effects of growth become a reality, it’s time to invest in automation. Enter data observability and data quality monitoring.
What Is Data Quality Monitoring?
Advantages of Data Quality Monitoring
Data quality monitoring solves a few of the gaps created by data testing. For one, it provides better coverage, especially against unknown unknowns.
That’s because machine learning algorithms typically do a better job than humans in setting thresholds for what is anomalous. It’s easy to write a no NULL data test, but what if a certain percentage of NULL values are acceptable? And what if that range will gradually shifts over time?
Data quality monitoring also frees data engineers from having to write or clone hundreds of data tests for each data set.
The Disadvantages of Data Quality Monitoring
Data quality monitoring can be an effective point solution when you know exactly what tables need to be covered. However, when we talk with data teams interested in applying data quality monitoring narrowly across only a specific set of key tables, we typically advise against it.
The thought process makes sense; out of the hundreds or thousands of tables in their environment, most business value will typically derive from 20% or less if we follow the Paretto principle.
However, identifying your important tables is easier said than done, as teams, platforms, and consumption patterns are constantly shifting. Also, remember when we said tables rarely contain the context of their own data downtime? That still applies.
As data environments grow increasingly complex and interconnected, the need to cover more bases faster becomes critical to achieving true data reliability. To do this, you need a solution that drills “deep” into the data using both machine learning and user-defined rules, as well as metadata monitors to scale “broadly” across every production table in your environment and to be fully integrated across your stack.
Additionally, like testing, data quality monitoring is limited when it comes to fixing the problem. Teams still need to delegate time and resources to conduct root cause analysis, impact analysis, and incident mitigation in a manual and time-intensive way. With data quality monitoring, teams are often blind to the impact of an incident on downstream consumers making triaging based on severity essentially impossible.
So, if testing and data quality monitoring isn’t sufficient for creating data trust, what comes next? Two words: data observability.
What Is Data Observability?
Data observability is an organization’s ability to fully understand the health of the data in their systems. Data observability eliminates data downtime by applying best practices learned from DevOps to data pipeline observability.
Data observability tools use automated testing and monitoring, automated root cause analysis, data lineage, and data health insights to detect, resolve, and prevent data anomalies.
The five pillars of data observability are:
1. Freshness: Freshness seeks to understand how up-to-date your data tables are and the cadence at which your tables are updated. Freshness is particularly important when it comes to decision-making; after all, stale data is basically synonymous with wasted time and money.
2. Quality: Your data pipelines might be in working order, but the data flowing through them could be garbage. The quality pillar looks at the data itself and aspects such as percent NULLS, percent unique, and if your data is within an accepted range. Quality gives you insight into whether or not your tables can be trusted based on what can be expected from your data.
3. Volume: Volume refers to the completeness of your data tables and offers insights into the health of your data sources. If 200 million rows suddenly turn into five million, you should know.
4. Schema: Changes in the organization of your data; in other words, schema often indicates broken data. Monitoring who makes changes to these tables and when is foundational to understanding the health of your data ecosystem.
5. Lineage: When data breaks, the first question is always “Where?” Data lineage provides the answer by telling you which upstream sources and downstream investors were impacted, which teams are generating the data, and who is accessing it. Good lineage also collects information about the data (also called metadata) that speaks to governance, business, and technical guidelines associated with specific data tables, serving as a single source of truth for all consumers.
Data observability solutions free up company time, revenue, and resources by providing out-of-the-box coverage for freshness, volume, and schema changes across all of your critical data sets and the ability to set custom rules for “deeper” quality checks, like distribution anomalies or field health. More advanced data observability tools also offer programmatic functionality to rule-setting, with the ability to write monitors and incident notifications as code.
Here are four critical ways observability differs from testing and monitoring your data:
1. End-to-End Coverage
Modern data environments are highly complex, and for most data teams, creating and maintaining a high coverage robust testing suite or manually configuring every data quality monitor is not possible or desirable.
Due to the sheer complexity of data, it is improbable that data engineers can anticipate all eventualities during development. Where testing falls short, observability fills the gap, providing an additional layer of visibility into your entire data stack.
2. Scalability
Writing, maintaining, and deleting tests to adapt to the needs of the business can be challenging, particularly as data teams grow and companies distribute data across multiple domains. A common theme we hear from data engineers is the complexity of scaling data tests across their data ecosystem.
If you are constantly going back to your data pipelines and adding testing coverage for existing pipelines, it’s likely a massive investment for the members of your data team.
And, when crucial knowledge regarding data pipelines lies with a few members of your data team, retroactively addressing testing debt will become a nuisance—and result in diverting resources and energy that could have otherwise been spent on projects that move the needle.
Observability can help mitigate some of the challenges that come with scaling data reliability across your pipelines. By using an ML-based approach to learn from past data and monitor new incoming data, data teams can gain visibility into existing data pipelines with little to no investment. With the ability to set custom monitors, your team can also ensure that unique business cases and crucial assets are covered,
3. Root Cause and Impact Analysis
Testing can help catch issues before data enters production. But even the most robust tests can’t catch every possible problem that may arise—data downtime will still occur. And when it does, observability makes it faster and easier for teams to solve problems that happen at every stage of the data lifecycle.
Since data observability includes end-to-end data lineage, when data breaks, data engineers can quickly identify upstream and downstream dependencies. This makes root cause analysis much faster and helps team members proactively notify impacted business users and keep them in the loop while they work to troubleshoot the problem.
Data observability acts as your insurance policy against bad data and accelerates the pace your team can deploy new pipelines. Not only does it cover those unknown unknowns, but it also helps you catch known issues that scale with your data.
The Future of Data Reliability
Ultimately, the path you choose is up to you.
Data trust is at the heart of the modern data stack, whether it’s enabling more self-service analytics and data team collaboration, adoption, or ensuring critical data is usable by stakeholders. And in this day and age of tight budgets, lean teams, and short timelines, the sooner we can deliver trusted data, the better.
Published at DZone with permission of Lior Gavish. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments