Overall, the end goal of data quality is simple: as data engineers, we want to stop firefighting bad data, so we can spend more time building robust and scalable systems. Having complete trust in our data, and knowing that we’ll catch any data quality failures before downstream data consumers do is one of our top priorities.
Upping our data quality game means moving closer to this goal of complete trust in data. In the following article, we’ll cover four key ideas on how to achieve this: (1) connecting to data sources end-to-end, (2) using pre-configured monitors for easier set up and maintenance, (3) validating everything comprehensively, and (4) getting the right type of notifications in case of failures.
Let’s dive in!
Validio’s framework for taking data quality to the next level if you’re already using dbt tests.
When using dbt tests, we know the functionality is warehouse-focused, i.e. we apply user-defined rule-based tests to various columns in tables of our data warehouse, as part of the “T” in the Extract-Transform-Load process.
However, the data quality platform should ideally also support streams and object stores (i.e. data lakes) in addition to data warehouse tables. In this way, all data sources can be monitored in a one-stop-shop, making us better positioned to discover data quality failures across the entire pipeline, even before they enter our warehouse.
To illustrate, many of our clients use data from third party event streams and APIs that come in a JSON-format and then land in an object store. In setups that use dbt tests, data must travel even further, and be pushed to a data warehouse before it can be validated. This means data flows quite far from the source before any type of validation can be applied, delaying the point in time before data engineers can intervene in the case of a failure. Ideally, data should therefore be validated already in the event stream and object store, which is not possible when using dbt tests.
On a similar note, for many machine learning use cases (and other operational use cases) the data never touches a data warehouse. Rather, these use cases often rely solely on streams or object stores. That means that when moving beyond reporting dashboards and analytics use cases into the realm of advanced analytics and machine learning, validating data in the warehouse is not enough.
Contrastingly, Validio can be tool and platform agnostic when applied to each part of the data pipeline: before, during and after transformation, regardless of whether the data is accessible from a warehouse table, object store or stream. This unlocks the power of streamed data for real-time decision making and machine learning use cases based on data that can be trusted.
All to say, Validio enables data quality validation to be applied to a much broader set of business critical data sources than dbt tests allow.
Now let’s take a look at the setup process. By this, we mean the configuration process any team must go through to put validation in place for critical data.
In dbt tests, each test needs to be implemented manually as part of each model. Add to this the complexity that arises when the number of data sources explodes and when those data sources change over time. Any alerting thresholds would need to be manually updated across massive amounts of tables and columns.
To illustrate the magnitude of this challenge, let’s say we have 100 important tables in our data warehouse, and we want to make sure there’s basic test coverage in place for all of them. Imagine writing at least three basic tests for each column in every table, and let’s say there are only ten columns in each table. Consider that it’s uncertain what the thresholds should be (see the section on auto-thresholding below), which means lots of manual tweaking is needed to get it right. In this situation, we’re looking at more than 100 hours just to get the basics in place. And we haven’t even begun to discuss dynamic or seasonal data that require tedious threshold maintenance over time.
Manually creating monitors for many tables can easily become an unfeasible workload.
With Validio, similar monitors can be quickly configured in a graphical user interface. On the platform, we support many types of validations that fall into two buckets: auto-thresholding and rule-based ones. Let’s take a look at them.
Auto-thresholding means Validio can learn expected data ranges. Based on that, Validio can alert in the case of sudden data changes that fall outside the expectation (the magic behind the scenes are machine learning models). In addition, auto-thresholding adapts to trends and seasonality in data, which is critical for businesses across a wide variety of industries.
For data teams, this means not having to spend time manually setting or maintaining alert thresholds like with dbt tests. This auto-thresholding is particularly useful when detailed domain knowledge is not available. For example, we may not know what a reasonable rate of data distribution shift is for a specific feature. We may still want to know, however, when the rate of change fluctuates significantly. Auto-thresholding allows for exactly this.
Example of auto-threshold applied to the mean of a feature where a failure was discovered. On the top half, notice the history of the metric (mean in this case). On the lower half is the same history of the metric (but on a different scale), as well as the permitted regions (shown in light green). The permitted region is predicted by leveraging a machine learning model based on the history of the metric—this is the auto-thresholding feature. When a data point is observed outside of the permitted region, it’s considered a potential data quality failure (which shows up as a red point). How much and how fast the auto-threshold changes can easily be tweaked by the user with the sensitivity and smoothing parameters.
We naturally think auto-thresholding is a critical feature for any data team. However, we recognize that sometimes, good old rule-based validation is needed, especially in order to capture domain specific knowledge. Validio also supports this out of the box.
Because Validio is decoupled from the warehouse and specifically from any dbt models, ownership of data quality can be delegated more easily and handled independently of the transformation code. In essence, it means that data quality can be owned by anyone in the organization via Validio’s web-based user interface, not just by data engineers when implementing code (the tool is currently designed for data engineers, but the user group will be expanded further down the line). This unlocks a powerful paradigm where validation can be co-owned by data consumers as opposed to data producers only. This is especially valuable when data consumers have specific domain knowledge about how the data should behave, and are uniquely positioned to set up these rule-based validations.
All in all, Validio’s powerful and decoupled rule-based and auto-threshold monitors can save teams enormous amounts of time by avoiding cumbersome setup and maintenance. It can also help avoid back-and-forth communication between domain experts and data teams on what sensible rules should be. This way, there’s more time to spend on the fun data engineering stuff, like implementing scalable and robust data systems, even when the size and complexity of data systems grow.
Now that we’ve covered data source connections and validation setup, we’re ready to move into the most important part of upping the data quality game: the actual validations that need to be done on data. This section is longer than others, but it’s for good reason—a data quality solution is only as good as the data quality failures that can be caught, which in turn depends on what can be validated in the data.
We’ll start by looking briefly at how dbt tests work in terms of what can be validated vs. what can’t. We’ll then dive into what a next-generation data quality platform should be able to offer for comprehensive validation and how Validio can help you achieve this.
dbt tests do a great job of providing a library of basic testing like checking for nulls, that values in a column are unique, that values in a column belong to a specific set of allowed values and that the mean value of a column lies within a predefined range.
With that said, dbt tests can of course also be configured to perform more advanced statistical tests, as long as they can be translated into SQL queries. This is true as can be seen in the plethora of extension packages available for testing, like with dbt-extensions. However, the more sophisticated the tests, the deeper the engineering skills required to write those tests. But exactly what degree of sophistication is needed in order to take data quality to the next level?
In order to answer this question, it’s helpful to introduce a framework to structure our thinking. Inspired by former US Secretary of Defense Donald Rumsfeld, we like to think about data quality failures along two dimensions: level of awareness of the failures and level of understanding.
The Rumsfeld matrix for data quality. Data quality failures are either known or unknown (level of awareness) and they’re either understood or not understood (level of understanding). Next-generation data quality should enable us to catch and fix data quality issues regardless of location in this matrix.
The goal of any organization should be to catch failures (i.e. to make sure awareness is never “unknown”) and to understand failures (i.e. to make sure understanding is never “unknown”). Given the complexity of data pipelines and business logic in modern businesses, this isn’t an easy task, especially for data quality failures that are “unknown.”
The same type of data quality failure might fall into different categories for different companies, so it’s not helpful to exemplify the categories with specific scenarios. We can, however, help paint a picture of what some situations might look like:
“Known known” data quality failures mean that we know about the failures and understand them. It might be a situation where a data engineer says something like, “I know that the price of any item_id cannot be lower than 0, although I know it sometimes happens because of a system error, so we should make sure to check for it before that data enters the warehouse.”
Conversely, an “unknown unknown” is a failure we don’t know about and don’t understand. A situation might look something like the following: For the past two months, a retail organization has not been recording any transaction data for customers that have less than two contact persons. No one knows about this hidden loss of data, and no one understands anything about what is causing the problem.
In our experience, companies have plenty of “unknown unknown” data quality failures that in many cases, can have a large impact on the bottom line if they lead to poor or incorrect decision-making.
We won’t cover examples of “known unknowns” and “unknown knowns” here, but for a deep dive on this framework and real-life examples, feel free to check out one of our earlier blog posts.
With the complexity of most modern organizations’ data pipelines and use cases, it’s safe to say that a data quality platform should be able to comprehensively catch both known and unknown data quality failures. Since unknown data quality failures per definition imply that the organization either does not understand or does not know about the potential data quality failure, it’s not possible to manually write rule-based tests to identify them. To cope with the cases of unknown data quality failures, comprehensive end-to-end validation with auto-thresholding is needed.
Specifically, there are three aspects that a comprehensive data quality platform must be able to handle in order to effectively find “unknown unknowns”: partitioning; univariate and multivariate support; and metadata, dataset and data point level validation. Let’s take a closer look at all three.
The first capability required in order to take data quality to the next level is partitioning. This is what helps us compare apples to apples in our data.
Let’s say there's a retail organization that works with customers across the globe. They may want to monitor the price data to ensure that things are properly priced. In one column is the feature “price” and in another, there’s “currency”. To monitor just the price column as a whole makes very little sense, since the prices have different orders of magnitude (due to differences in currency). Before performing a data quality validation to make sure there are no anomalies, the data must be partitioned based on currency. Just think about the difference in the order of magnitude if the very same price for a specific item is expressed in USD versus Iranian Rial, where the conversion rate is ~ 1 USD = ~ 40 000 Iranian Rial. If partitioning isn’t done, we’d literally be comparing apples with cars. And this is a simple example. In reality, there might be more variables that require partitioning, such as product category, brand, etc.
Illustration of why partitioning is needed—a price of 39.9 IRR is significantly lower than expected, but would have not been discovered if compared to the USD prices.
In order to do this in dbt tests, a new table would typically be created, one for each partition (i.e. currency). Then, a specific dbt test would be written per partition and the reasonable threshold based on the distribution of that particular partition would be set. It’s a cumbersome process, but might be reasonable with up to ten partitions. In practice, when looking at the combinatorial explosion that might happen if one wants to partition based on several variables such as country, product and currency, there may be hundreds or even thousands of partitions. For example, with just 50 countries and 500 products, we’re already talking about 25,000 partitions.
In summary, it’s safe to say that dbt tests are not designed for these types of use cases, even though they are possible to carry out in theory. In effect, the threshold to implement these more advanced tests is unreasonably high in a world with limited and precious engineering time.
Conversely, Validio has partitioning capabilities as one of the core elements of its platform. These are easy to set up as part of configuring validation of any data source. This means that as data engineers, we get powerful next-generation data quality validation at a fraction of the time investment that otherwise would have been required.
The next capability needed in order to up data quality is support for univariate and multivariate analysis. Univariate analysis is quite straightforward as it revolves around looking at each column or dimension in the data in isolation. Conversely, multivariate analysis would include things like checking for changes in relationships between several data features.
In dbt tests, univariate analysis can be easily implemented. Metrics like mean and standard deviation for a column can be looked at to make sure they stay within some specified interval. However, given that SQL is most commonly used to querying and computing things based on one feature at a time, SQL-based tools such as dbt test are not purpose-built for multivariate analyses (even though it is possible in theory if one puts in the proper effort to write those tests). This means that the threshold is high to implement meaningful multivariate data quality validation.
With the Validio platform, on the other hand, multivariate analysis comes out of the box so setting up multivariate validation is straightforward. Furthermore, as data engineers, we might know that e.g. correlation rate is important, but we might not know what the typical rate should be without spending significant time either analyzing the data or working out thresholds together with domain experts. In this case, as mentioned earlier, we could create an auto-threshold that learns over time and adapts to seasonality and trends.
Lastly, next-generation data quality requires validation at metadata, dataset and data point levels. Again, it’s probably theoretically possible to validate many of these aspects of data using dbt tests (at least if the data lives in the warehouse like previously mentioned), but doing so in practice is a completely different story, especially for data point validation as we’ll see.
Schematic illustration of the relationship between metadata, dataset and data points. Together, the three aspects form a comprehensive understanding of the data.
Metadata validation means ensuring that things like freshness, schema, data ingestion rate and row count behave as expected.
Dataset level validation includes things like ensuring the null-value distribution, mean, standard deviation or cardinality stays within some expected range, or that the distribution of the data from a certain data source doesn’t shift significantly over time.
Moreover, data point validation means looking at things like duplicates, string and date formats, ensuring data values are within certain ranges or that a single data point value is not an outlier or anomaly. Here, Validio comes with a powerful feature that allows for writing out bad data to any stream, object store or data warehouse. There, the bad data points can be inspected using a visualization tool of choice (e.g. Tableau or Looker), making it easier to understand what’s wrong and rapidly implement a solution. This also means not having to rely on alerts being triggered for every single data point failure, which would certainly lead to alert fatigue (as we shall see in the next section).
More importantly, however, writing out bad data points in real time means these data quality failures can be properly handled as part of the data orchestration flow as soon as they are identified without adding noticeable latency to data workflow —i.e. bad data points can be quarantined rather than pushed downstream in the DAG where they would affect downstream data consumers.
Implementing this in dbt tests would require custom logic and workarounds, especially since dbt tests don’t store test results by default. If you do choose to persist test results, dbt will store test results to a default schema (which can be modified) in the database. However, a test’s results will always replace previous failures for the same test, and so this workflow is more tailored towards a development process rather than for the type of operational data validation that next-generation data quality requires.
The final part of catching data quality failures is about being notified at the right time, in the right way. As data engineers, we are often busy with big responsibilities on our shoulders. This can sometimes come with a hefty burden of colleagues and tools sending messages en masse about broken pipelines and dashboards. Naturally then, a data quality platform must be able to handle notifications in a good way in order for us to be able to catch and handle data quality failures in a scalable way.
As mentioned previously, dbt tests have support for storing failing data flags in a database, which is extremely helpful when debugging pipeline code. However, the problem here is twofold: first, storing failing flags might not be enough from a communication point of view, and second, we might not want to get a flag for every single time a validation fails as it would surely result in alert fatigue.
Validio takes a fundamentally different approach to notifications compared to dbt tests.
First, Validio of course integrates with your team’s messaging tool, e.g. Slack, so notifications can be routed to the channel where they make the most sense and have the greatest chance of being helpful. Second, Validio as a data quality platform does not assume that every data quality failure needs to be communicated. Let’s look at an example that illustrates this point.
Imagine having 0.1% bad data (a very low number in today’s reality among data-driven companies) in a dataset of 1 billion data points. With alerts on data point level, this would trigger 1 million alerts. Imagine also getting 1 million Slack notifications. This is clearly not practical nor useful.
If bad data is expected—which it often is—it is not necessarily so that any manual action is required in order to handle it. For example, let’s say we have a data stream where we only expect positive numbers for a certain feature, but sometimes there are -1's (or nulls or missing values) occurring. Let’s also say downstream pipelines are not built to accommodate this. Therefore, we might want to just remove those data points from the data before letting it affect downstream systems. When this type of automation is set up, communicating that bad data is found in an expected ratio is not needed, since no manual action is required. Rather, as alluded to in the previous section, Validio can be integrated with the data orchestration so that bad data is automatically filtered out.
Validio is purpose-built for catching and fixing data quality failures in a streamlined fashion. This means we can be sure to get the right type of notifications where they matter.
All in all, if you’re already using dbt tests, you’re at a fantastic starting point. In this article, we’ve also covered four areas where Validio can help up your data quality game even further:
Ultimately, Validio exists to help put out data quality fires, so more time can be spent on the fun stuff—building robust and scalable systems. Our mission is to help data engineers eliminate bad data, because we’re convinced that it can unleash an enormous amount of business value and new ideas. We support the heroes of data—data engineers—on their quest to make low trust in data and broken pipelines a thing of the past, and we hope this article has provided ideas for how to get closer to that goal.
Thanks for reading, and please get in touch if you’re interested in finding out more!