How well do you know your data
Companies have a lot of data these days, and increasingly companies are putting substantial effort into putting that data to work. But as reliance on data grows, should we not put increased effort into ensuring that we do not make the costly mistake of neglecting data quality?
When you want to know the state of your business, you consult KPIs, account balances, and other data that are indicative of your performance. But what if you would instead like to get a view of the state of your data? Or even better, data that you receive from someone else!
This is important data on which you base your business decisions.
You would need to determine some KPIs for the data itself.
This is done by setting business rules: What should the data comply with to be considered, ‘good’ data?
Consider the impact of data quality in the examples below:
James just loves working with dashboards. He will drop his old methods lightning-quick to get his hands on the latest and greatest data-driven solutions. |
Jack has been burned by bad data too many times. He has a lot of experience in making his own reports and does not trust data that he didn’t fully reconcile himself. |
|
|
Unfortunately, James is a bit too trusting and will consider anything that looks fancy to be 100% true. | Jack would rather use his own datasets than rely on this ‘fancy’ new data lakehouse that the data team has built. He’ll also go around telling everyone they should do the same. |
Consequence: James based himself on data that was inaccurate, and customers are complaining about their interactions with him |
Consequence: The adoption for the data lakehouse has been extremely low and the data team is unable to convince Jack and others that the data is being properly monitored. |
As demonstrated by the example above, trust in data is not an easy thing. The cost of blind trust is that you might end up making bad data-driven decisions. But on the other hand, not having trust will mean that all data integration and analytics efforts are in vain as there is no adoption in the organization.
Trust is also a finicky thing. The Dutch have a great expression for this (liberally translated): "Trust comes by foot and leaves on horseback". This means as much as "trust is hard-earned and easily lost". This is why data profiling is important as it can give you the key to get a grip on data quality and help you quantify and demonstrate the quality of your data to others.
Data Profiling
The first step in gaining trust is getting to know each other. In this case, data profiling can help you to get to know your data. Before undertaking any effort to use data for integration or analytical purposes, it is recommended to understand the dataset that is being used to make the right decisions in the design of your solution. Data profiling is the activity of analyzing data to understand the structure, content, quality and relationships.
- Structure – How is your data stored: file formats, tables, fields, data types, …
- Data statistics – Distributions of your data: min, max, median, frequency, ...
- Relationships – What are dependencies of your data: primary & foreign keys, cardinality, referential integrity, ...
- Quality – How does your data hold up against business rules?
Typically, data profiling is used to get a general insight into the content of the data, and besides that, data profiling can help you discover frequent problems:
- Duplicate values
- Out-of-range values compared to expected standards
- Blanks or nulls
- Formatting issues in phone numbers, bank accounts, VAT numbers…
Data Quality
For the informed eye, poor-quality data is easy to spot, but confirming that your data is of good quality is not such an easy job. One important goal of data profiling is to perform an assessment of the quality of the data. The measurement is done by setting business rules and then measuring how well the data complies to those rules.
Very often, these business rules fit into one of the dimensions of Data Quality:
|
Completeness |
|
Consistency
|
|
Accuracy |
|
Validity |
|
Uniqueness |
|
Timeliness |
Combining the Data Profiling outcomes with the Data Quality business rules can give an objective measurement of data quality, which can generate trust for users of the data. Reperforming this exercise consistently will also allow you to monitor data quality over time and spot any new issues that may arise.
Example from Ataccama Data Quality
A party table contains personal information on a customer. The values in the gender field are being validated for accuracy: If the value does not comply with the reference dataset for gender, the test fails.
A quick data profile shows that null values in the column are the root cause of the issue:
Once rules have been defined, the software can scan your data at regular intervals so you can monitor whether any clean-up efforts have paid off and if your data quality is improving.
If you want to know more about our approach to Data Profiling and Data Quality, contact us for more information or with your specific request.