In today's world, data is a valuable asset for businesses to make informed decisions. However, to make accurate decisions, the quality of the data must be impeccable. In this insight, we will delve into how you can monitor data quality in the Modern Data Platform by using an open-source, Python-based framework called Great Expectations, running on Databricks.
Great Expectations is a powerful framework that helps you assert what you expect from your data, allowing you to automatically quantify and visualize any violations against these expectations. With its simple and easy-to-use interface, you can easily monitor the quality of your data at any stage of your data pipeline.
Why should you use Great Expectations?
In the state-of-the-art Modern Data Platform, quality of code, ETL pipelines and infrastructure is guaranteed using techniques such as CI/CD pipelines, unit tests and monitoring dashboards or alerting. Unfortunately, the most important asset of the Modern Data Platform, the data, often lacks a framework to explicitly monitor quality. Great Expectations fills this gap. They describe their solution as ‘unit tests for data’.
Using Great Expectations means actively and explicitly defining quality standards for your data, in the form of expectations that you have about the values of specific columns or records. These expectations are validated against the actual data, and violations are caught, quantized and visualized by the framework.
This gives valuable insights about data quality to stakeholders and allows them to act before poor-quality data is used in downstream tasks.
What is the philosophy of Great Expectations?
The Great Expectations framework is configuration-based and formalizes the concept of an expectation one holds about the data of a certain column using a simple JSON representation, in which only the name of the column and some required parameters need to be configured.
An example of an expectation is that all values of a certain column should be inside a certain interval. This is formalized as follows:
There are over 70 different expectation types in the framework. And you can extend the framework by writing your own custom expectations. With an active community reviewing your created expectations and testing them along the way.
The framework makes use of a set of components, which are each an abstraction or formalization of a certain concept. Instances of such components can be combined and reused, facilitating the maintenance of your Great Expectations implementation.
For example, an expectation is defined apart from the data source the data comes from and apart from the runtime that will do the validation computations. Both are configured in other components. So, the same expectation can be used with different data sources or validated using different runtimes, without the need to duplicate the expectation. Once you define an expectation, it can be validated against a SQL database table directly or against a PySpark data frame using the Spark execution engine.
Great Expectations in the Modern Data Platform?
In the Modern Data Platform, raw data is ingested into the data lake in the bronze layer. It is then cleaned and transformed, the result of which is brought to the silver layer. This curated data is then used to build a data model in the gold layer. Cleaning and transforming happen in Databricks using PySpark.
Using Great Expectations makes the most sense in assessing which data needs to be filtered out or transformed from bronze to silver, and in assessing whether data conforms to your data model, going from silver to gold.
The first step focuses on data engineering expectations, with the goal to assess if ingested data is of good quality. Examples of expectations that could be configured in this step are: Are sales amounts of the correct numeric data type? Are VAT numbers in the correct format? Are all values of a mandatory column filled in? Are column names and values consistent? Are key-column values unique? Are all the dates available after our incremental loading logic?
The second step focuses on business engineering expectations, validating the integrity of the data model. Potential expectations are: Are there outliers for a certain measure? Are these amounts in realistic ranges and standard deviations? Are all the general ledger accounts available in the postings? Are all the customers available in the sales invoices? Is the revenue for each customer realistic? Are we expecting the margins to increase over time?
Using formalized expectations to identify and filter out data that does not conform, instead of writing this functionality with custom PySpark code, makes the process a lot more maintainable and transparent.
The framework and corresponding UI dashboard allow stakeholders to get an overview of the currently implemented expectations, even if they do not know Python. Adding, removing or altering expectations is only a matter of updating the configuration instead of writing code.
Validation of expectations using the framework automatically outputs which and how many records violated each expectation. These results can also be consulted in code, in files or in the automatically generated UI dashboard.
In essence, Great Expectations can change a layer-to-layer cleaning procedure from needing a lot of custom implementations with low transparency to an easily maintainable process with full transparency on the criteria on which data will be filtered out or transformed, as well as on the results of applying these criteria.
What are the components of Great Expectations?
Great Expectations has a lot of components that may seem overwhelming at first glance. However, they are all fully configurable and work together so well that the amount of overhead is limited. The benefits of standardization/formalization and maintainability make it worth the overhead.
Expectations are grouped in expectation suites, which are collections of expectations that are validated together. For example, one expectation suite per table will be upserted to the silver layer, which will be validated on the corresponding data frames before upserting.
PySpark data frames can be brought into the framework by embedding them in a batch. Expectations of a certain suite are validated against the batch by a validator, using a configured runtime, in our case PySpark. The validation result contains information on which and how many records failed each of the expectations in the suite. This can be used to filter out or transform the failing data.
Orchestration of which expectation suites to validate against which batches in an automatic production pipeline is configured in a checkpoint. In the case of our basic flow, a single checkpoint suffices, which simply links the expectation suites to the batches of the corresponding silver tables.
In the checkpoint, one also configures what to do with the results, using actions. The most common action is to automatically generate data docs based on the results, which is an HTML dashboard showing an overview of all validation runs, the results of these runs, and the configured expectations suites.
Great Expectations is a framework used to help organizations assert what they expect from their data and to quantify and visualize violations against their expectations. It enables all stakeholders, not only developers, to define expectations on data using simple configuration rather than complex code, with an extensive catalogue of expectations, and the possibility of creating custom expectations with the support of a community.
An automatically generated UI enables an overview of both the currently implemented expectations and the results of automated validations.
Great Expectations fits perfectly in the Modern Data Platform, leveraging PySpark for distributed computation and storing results back on the data lake. It is an unmissable asset by guaranteeing the integrity of data transformations going from bronze to silver, and from silver to gold, thereby bringing sturdiness to your data-driven decision-making.