Why?
Since data pipelines are far behind the efficiency of the software industry, tools are needed to bring the design up to speed and to guarantee data quality.
This optimisation and performance improvement can happen with tools such as dbt and Delta Live Tables that focus on testing, version control, documentation and data quality.
What is dbt?
data build tool (dbt) is a transformation tool that aims to simplify the work of the analytic engineer in the data pipeline workflow. It specifically implements only the T in the ETL process. The greatest feature is that it focuses on implementing the software principles normally only seen in the software industry, but now also introduced in the analytic world. Only the open source dbt Command Line Interface (CLI) will be discussed.
What is Delta Live Tables?
Delta Live Tables (DLT) is a framework that makes it easier to design data pipelines and control the data quality. It covers the whole ETL process and is integrated in Databricks. DLT is still in preview, so these features may be subject to change.
How Delta Live Tables compares with dbt?
In deciding which tool to use or even to use any at all, the different features will be looked at. There are a lot of similar features, but still little nuances can be seen. Bigger differences can make or break the tool in some use cases.
(1) Data lineage graph
To make data teams more efficient, a data lineage graph can be used to find problems in the data easier and faster. It also makes it simpler for new members of the team, data analysts or other colleagues to understand the data pipeline.
- dbt
The data lineage graph includes the source table in the data warehouse, the tables after the different transformations and the dashboard in which the business value of the tables is displayed. It is however not possible to hover above or click on the tables and see more information.
- Delta Live Tables
The data lineage graph shows the tables that load data from the data lake and the different tables after transformation. More information about the different tables can be obtained by clicking on the tables.
- Comparison
Both show the tables of the sources and the transformations, but only dbt also shows what the end point of the data is, either if this is a dashboard, application or a data science pipeline.
(2) Documentation
The importance of documentation cannot be underestimated. Even the team that designed the data pipeline, may forget in time what a certain column did that now needs to be updated. Good documentation results in an easier maintenance of the project.
- dbt
The documentation is automatically generated based on meta data and is served on localhost. It covers documentation for the source tables, dashboards, models and tests. Per table also descriptions can be added to every column and extra information about references to other tables can be found.
- Delta Live Tables
The documentation is automatically generated based on meta data and is integrated in the Databricks UI. Per table the schema can be seen, some extra information about the table can be added. Moreover, the documentation of the testing is also present in the demo that was released, but this feature is not yet in the preview.
- Comparison
Both show the documentation of the schema and the testing, but only dbt also gives information about the dashboards.
(3) Testing
Certain small data deviations or an almost insignificant error in logic are sometimes not noticed in development because the occurrence may be rare, but can result in serious errors. To avoid this, testing can find these small errors early on.
- dbt
Two kind of tests are integrated into dbt and can be added to the data pipeline.
- Bespoke or schema tests - Scalable tests that can be reused on different models. These tests include checking if a row is unique, or if a column has no null fields. One can also check if all fields of a column belong to an allowed list of elements and will fail if a field cannot be found in the list.
- Generic or data tests - Specific tests for certain tables. The test contains a query on a table and if no rows are returned, the test passes. These tests can check if two columns, for example a paid column and a returned column does not become negative, so that when an item is returned, the customer cannot get more money than he paid for it.
- Delta Live Tables
Expectations on a certain table controls the data quality, it works as a CHECK constraint, but has more flexibility when some data fails the check. Certain conditions can be tested to see if a row suffices the data requirements. An expectation can test if a column has no null values, or if all fields of a column are a timestamp. Here no test can be implemented that queries a certain table.
- Comparison
Both can do testing in the form that a CHECK constraint can, but dbt can query a table or combination of tables to test some functionality.
(4) Incremental tables
In big data, the amount of data is so vast that it is impossible to load all the data every time a few rows were added, the delays would be enormous. To solve this, incremental tables only loads those extra rows. The transformation that first took a few hours drops now down to a few seconds.
- dbt
An incremental model creates the whole table the first time it is run and then adapts the SQL code in an incremental run to incrementally transform the data.
- Delta Live Tables
The incrementally transforming and loading of the data from a data lake to the data warehouse is both possible.
- Comparison
Both can incrementally transform data, but Delta Live Tables can also incrementally load data.
(5) Data quality & data reliability
Accurate dashboards are very important in the business world, but this can only be obtained if the data quality is high enough. If the data quality is off, certain predictions can be quite off and result in a great loss.
- dbt
- In testing, a limit can be set on when to warn or when to send an error if there are too many failed rows. A number of failed rows can be allowed in the quality control.
- To further inspect the data quality, failed rows can be stored in a separate table. ∗ If outside of cut-off window, data can be lost. A full refresh that reprocesses the data that already has been ingested can solve this.
- Delta Live Tables
- In an expectation one can choose to drop, pass or fail when there are invalid records. This way the data quality can be tuned.
- This is not present yet in the preview, but already showed in the demo, is the integration of the data quality rules in the Databricks UI.
- If data is lost in an incremental run, a full refresh can be done.
- Comparison
Both have rules that can decide which number of failing records to still allow, but Delta Live Tables has clear visual interface where one can clearly see what the quality of the data is and what is failing.
(6) Detailed logging
For the engineers that maintain the different infrastructures and connections, logging is very important to pinpoint what the error was and more importantly, where a certain error took place. With logging, one can determine exactly when something went wrong very efficiently and action can immediately take place. This way a production line does not experience much or any downtime.
- dbt
The logs are generated from running a dbt command in its command line interface and are stored in a folder named logs in the project folder.
- Delta Live Tables
- The logs are directly integrated in the UI. More information about every event can be found when clicking on the particular event.
- There are real-time live updates when running on the cluster.
- The failing of data quality requirements are captured in the event log.
- Automatic monitoring, recovery and management.
- Comparison
Both have an extensive amount of logs, but only Delta Live Tables has real-time updates and a visual interface to follow the process.
(7) Development to production
A lot can go wrong if one needs to manually maintain two version of almost the same code with one in development and the other in productions. The data engineer can make an important change in the production, but does not do the same in development. Then, later, when an update needs to be made, the same person or his colleague will in development experience a different behaviour in development than in production and no one will know why. A lot of time will be lost in figuring out the problem. This can be solved by using the same code, but making it easier to switch from development to production.
- dbt
Changing from development to production can easily be done by changing workspaces. This workspace can hold other data warehouse connections and other sources. The code to change the workspace has to be manually added in a YAML file, but all the necessary changes can be done in one or two YAML files, so one does not need to make small changes in all the models.
- Delta Live Tables
To change from development to production the only thing that needs to be done is creating different pipelines with the same underlying code, so a pipeline in development and one in production. The only differences between the pipelines will be the name and target. So, a copy of the settings of the development environment can be made and used to make a new pipeline in production with the only few changes the name and target. If one does not need the development environment anymore, a change from development to production can be directly done by changing it with the UI buttons.
- Comparison
Both make it easy to change from development to production, but dbt can also easily change the sources without changing the code. For Delta Live Tables, the UI button makes it simple to change environments.
(8) Programming language
In a team, not everyone knows the same programming languages, so the more languages that are supported, the better, because then the programming language that the most people in the tam know already will be chosen. This will speed up the development process, since now a smaller group needs to learn the language.
- dbt
The only language supported in dbt is SQL. Some jinja needs to be known to create templates.
- Delta Live Tables
The notebooks in which the tables are defined can be in SQL and python. A notebook cannot be in SQL and python at the same time, but different notebooks can be in different languages in the same project.
- Comparison
Both support SQL, but only Delta Live Tables also supports python.
(9) Data warehouses
Not every company has every kind of data warehouse at its disposal. To set up a data warehouse specific for a certain problem or tool while they already have a data warehouse is quite excessive. It is worthwhile to look for a tool that can be used for the data warehouse you already have, unless the tool for another data warehouse has a lot of benefits.
- dbt
Various data warehouses can be used and are supported by dbt, while there are more that are community supported. For example for Databricks, dbt gets its tables from the Databricks hive metastore, but these can also link to various sources.
- Delta Live Tables
It is directly integrated into Databricks, so also sources that can be loaded into the Databricks hive metastore can be used.
- Comparison
Both can make use of different data sources such as a data lake, but only dbt can be used in combination with and ran against other data warehouses.
(10) Running a job
The environment where a job runs, can influence the cost and complexity a lot. Certain environments can cost not much, but require a high complexity to install and work with, while other environments can be a bit more expensive, but are easy to deploy and use. A well deliberated decision needs to be made.
- dbt
Commands can be run in a container instance or locally. The queries are run against the data warehouse. It is however quite difficult to deploy dbt to Azure, since a lot of resources and configurations are needed. Even after deployment, the setup is very unclear and the different features of dbt are hard to find. More about this can be found in the article about dbt.
- Delta Live Tables
In production, Delta Live Tables creates a custom job cluster every time it runs a pipeline. It is not possible to run a pipeline on an existing interactive cluster. However, the Databricks UI is very intuitive, even setting up a pipeline only takes a few seconds, a few clicks and the configurations is done.
- Comparison
Both can be run on the cloud, but Delta Live Tables can follow the run and possible problems much closer with the logging directly in the main interface. Its interface is very clear and simple. dbt on the other hand will be a lot cheaper, since it only requires running a container instance instead of launching a custom cluster every time. Deployment is a difficult trial however. From the deployed version it is also a challenge to find and use the different features.
(11) Development environment
The development environment is important depending on the circumstances. If one works mainly alone, a local development environment is enough. But the bigger the team, the more need there is to have a development environment in the cloud.
- Dbt
- Development happens locally.
- Via git one can work in a team.
- Delta Live Tables
- Development happens in the cloud with notebooks.
- Real-time collaboration with a team.
- Notebooks are connected to git.
- Comparison
Both are connected to git, but Delta Live Tables makes real-time collaboration and development possible in the cloud.
When to use dbt and when to use Delta Live Tables?
When to use dbt?
- only transformation tool needed
- only SQL needed
- simplification of transformation process
- wide variety of testing possibilities
- extensive documentation
- starting a whole cluster is not needed
- local development
When to use Delta Live Tables?
- full ELT process
- databricks as data warehouse
- UI needed to observe:
- data quality
- real-time updates of the run
- logs
- development in the cloud
- real-time collaboration in a team
(written by Kyana Bosschaerts - element61 Trainee in 2021)
Keen to know more?
Continue reading or contact us to get started: