Testing ETL projects - used for Extracting data, Transforming it according to reporting requirements and Loading it into the data warehouse - is very complex because of many reasons. Typically huge amounts of data are handled and data comes often from multiple sources that need to be combined. Sometimes, the data is loaded incrementally, sometimes it needs to be reloaded in full. These ETL-flows can also apply all kinds of complex calculations and transformations on the data.
While there are several advanced tools available for software testing -tools that are quite evolved for automatically testing programming code- testing ETLs is something completely different. Tools dedicated to ETL testing exist, but the mature ones are still rare.
ETL testing is therefore mostly done manually, which is very labour intensive and error prone. Running these tests manually without a dedicated tool to schedule them, also prevents these tests to be automated.
However, as for any IT project, recurrent testing is important to be able to guarantee a high level of (data) quality. The more you test, the more bugs will be solved before going live. And that is crucial in Business Intelligence as when the users cant trust the data, the acceptance of the BI solution will definitely fail.
How to test ETL-processes and why it is so difficult
When adequately validating your ETL-processes, several tests need to be executed before being able to conclude the ETL is working as it is supposed to do.
Data completeness
Probably the most obvious test is to check whether all records have been loaded into the data warehouse. For some ETLs, this can be as easy as comparing a record count of the source and the target table.
But unfortunately its not always that easy: data can be transformed to a completely different structure following some kind of data modelling technique to maintain historical information, to allow complex but fast calculations or to improve reporting performance by storing aggregated results. In that case, more sophisticated tests might be required.
Most data warehouses also have a layered architecture: often only staging and a Kimball-like data warehouse, but sometimes also an extra landing zone per layer, or even an operational data store as an extra layer. Staging can be persistent or in cases of big data volumes often it is not.
Big data volumes also make it very inefficient to load all the data from scratch. Most ETLs therefore apply some kind of delta mechanism to only load new and changed records. It is crucial to not only test the results of the full load, but to test the delta mechanism as well: are all new records coming in without creating duplicates and are all changes applied correctly, while maintaining historical information.
Data quality
Data transformation rules
Another aspect of ETL is the T of "transformation". During this transformation, column values are changed and newly calculated columns are created. These transformations need to be validated. While all these transformations should be documented in some way, this can be easily forgotten. Its obvious that good testing goes hand in hand with good documentation.
Business rules
Next: it is also important to check the correct enforcement of business rules that have been identified: values that should be within a certain range, columns that can not be empty, etc. To avoid loading "dirty data into the data warehouse, this is again an important test that should be executed during ETL testing.
Referential Integrity
A lot of tables in the DWH, especially fact tables, contain a lot of foreign keys to related tables.
Because referential integrity in a data warehouse should not be enforced by foreign key constraints, in order to improve ETL performance, you should regularly validate the integrity of these keys to prevent orphan records.
Standards and conventions
Besides testing the correctness of the data, it is also recommended to plan some technical tests. Every data warehouse environment should respect some kind of coding standards, policies and naming conventions that have been agreed with the data warehouse architect. Executing a regular test to check whether all developers respect these guidelines helps you to maintain the stability, performance and scalability of the data warehouse environment.
Examples of these technical tests are:
- All dimension tables have a unique and numeric primary key ?
- All tables have a unique index, containing the business key columns ?
- Is partitioning done as expected ?
- Are all audit columns available in every table ?
How to automate ETL testing
Now, how can these tests be automated ?
Purchase of a dedicated ETL testing solution
The easiest -but most expensive- option is to buy a dedicated tool for testing your ETLs. For example, if you use Informatica Powercenter, you could buy Informatica Data Validation, which provides complete and effective data validation and ETL testing, with no programming skills required.
If your ETL solution vendor doesnt offer such a tool, you could have a look at a more generic testing solution. One of the most popular and advanced tools is probably QuerySurge. QuerySurge was built specifically to automate the testing of data and supports several data warehouse technologies and even Hadoop, NoSQL, XML files, etc.
Microsoft offers Visual Studio Test Professional. This is a great tool for testing programming code and does have support for database unit testing. However, as it is not dedicated to ETL Testing you will have to build custom test providers. And it is also missing some nice features like data quality dashboards, out-of-the-box access to multiple platforms, support for testers with no programming skills etc.
There are some more tools, even open-source, for example DbFit (http://dbfit.github.io/dbfit) and AnyDbTest (https://anydbtest.codeplex.com/), but they are often not very mature.
Build it yourself
If you prefer not to purchase a dedicated testing tool, you can also create something yourself. It wont be perfect, but it still is a lot better than not automating your tests at all.
When developers are accurately testing their ETLs, they are already writing a lot of SQL scripts for this. Automating these same tests means you develop a solution to store these scripts, schedule the execution and automate the comparison of the outcome of the test scripts with the expected results.
When -for example- using Microsoft technology for your data warehouse all you need is some kind of configuration table to store the test scripts, the return value of these scripts and the expected results to check if the test has succeeded or failed. As SQL Server Integration Services can connect to almost any data source you can easily build an SSIS Package that executes all test scripts and compares results between multiple databases and environments. Use the SQL Server Agent to schedule the tests and a SQL Server Reporting Services report for distributing the results.
Of course writing reusable scripts is a little harder than writing "quick and dirty test scripts that are written to be used only once. But enforcing standards in your BI solution will already make this a lot easier. And dont forget that these scripts can be scheduled to run again, again and again.
What to automate?
There is no golden rule that tells you what tests you should or should not automate.
But if you have a nice standard way of developing BI solutions using ETL templates, table templates, naming conventions etc., a lot of simple tests can probably even be generated.
And for the tests that cant be generated, just be pragmatic. Dont test every bit and byte, but test what is most important.
Conclusion
Automating ETL tests allows daily testing without any user intervention and supports automatic regression tests on old code after every new release.
The decision of buying an extra tool for ETL Testing, or building something pragmatic yourself, depends of the budget you can spend and of your testing requirements. But remember, better something basic, maybe even built by yourself, than no test automation at all.
In the end, it will save you a lot of time and your users will appreciate the quality of your Business Intelligence deliverables and accept the data from your BI solution as the "single version of the truth.