Data profiling & Data quality

How well do you know your data?

Companies have a lot of data these days, and increasingly companies are putting substantial effort into putting that data to work. But as reliance on data grows, should we not put increased effort into ensuring that we do not make the costly mistake of neglecting data quality?

When you want to know the state of your business, you consult KPIs, account balances, and other data that are indicative of your performance. But what if you would instead like to get a view of the state of your data? Or even better, data that you receive from someone else!

This is important data on which you base your business decisions.
You would need to determine some KPIs for the data itself.

This is done by setting business rules: What should the data comply with to be considered, ‘good’ data? 

Consider the impact of data quality in the examples below:

James just loves working with dashboards.
He will drop his old methods lightning-quick to get his hands on the latest and greatest data-driven solutions.
Jack has been burned by bad data too many times. He has a lot of experience in making his own reports and does not trust data that he didn’t fully reconcile himself.
James
Jack
Unfortunately, James is a bit too trusting and will consider anything that looks fancy to be 100% true. Jack would rather use his own datasets than rely on this ‘fancy’ new data lakehouse that the data team has built. He’ll also go around telling everyone
they should do the same.
Consequence: James based himself on data that
was inaccurate, and customers are complaining
about their interactions with him
 
Consequence: The adoption for the data lakehouse has been extremely low and the data team is
unable to convince Jack and others that the
data is being properly monitored.

As demonstrated by the example above, trust in data is not an easy thing. The cost of blind trust is that you might end up making bad data-driven decisions. But on the other hand, not having trust will mean that all data integration and analytics efforts are in vain as there is no adoption in the organization.

Trust is also a finicky thing. The Dutch have a great expression for this (liberally translated): "Trust comes by foot and leaves on horseback". This means as much as "trust is hard-earned and easily lost". This is why data profiling is important as it can give you the key to get a grip on data quality and help you quantify and demonstrate the quality of your data to others.

Data Profiling

The first step in gaining trust is getting to know each other. In this case, data profiling can help you to get to know your data. Before undertaking any effort to use data for integration or analytical purposes, it is recommended to understand the dataset that is being used to make the right decisions in the design of your solution. Data profiling is the activity of analyzing data to understand the structure, content, quality and relationships.

  • Structure – How is your data stored: file formats, tables, fields, data types, …
  • Data statistics – Distributions of your data: min, max, median, frequency, ...
  • Relationships – What are dependencies of your data: primary & foreign keys, cardinality, referential integrity, ...
  • Quality – How does your data hold up against business rules?

Typically, data profiling is used to get a general insight into the content of the data, and besides that, data profiling can help you discover frequent problems:

  • Duplicate values
  • Out-of-range values compared to expected standards
  • Blanks or nulls
  • Formatting issues in phone numbers, bank accounts, VAT numbers…

Data Quality

For the informed eye, poor-quality data is easy to spot, but confirming that your data is of good quality is not such an easy job. One important goal of data profiling is to perform an assessment of the quality of the data. The measurement is done by setting business rules and then measuring how well the data complies to those rules.

Very often, these business rules fit into one of the dimensions of Data Quality:

DQ completeness

Completeness
Does the data capture all required information?

e.g. have all transactions been assigned a sales channel for correct reporting by channel?

DQ consistency

Consistency
Is the same information represented in the same way?

e.g. are all dates in the same format? Are IBAN bank accounts formatted correctly?

 

DQ accuracy

Accuracy
Is the data free from mistakes/inaccuracies when compared to a reference?

e.g. is the legal entity name the same as was registered with the government?

DQ Validity

Validity
Does the data conform to a specific requirement?

e.g. do the records in the employee table have a valid Belgian National ID number?

DQ Uniqueness

Uniqueness
Is the dataset free of duplicates?

e.g. do all the records in the supplier table have a distinct VAT number?

DQ Timeliness

Timeliness
Is the data up to date and reflects the current reality?

e.g. is the customer’s address up to date after their move?

Combining the Data Profiling outcomes with the Data Quality business rules can give an objective measurement of data quality, which can generate trust for users of the data. Reperforming this exercise consistently will also allow you to monitor data quality over time and spot any new issues that may arise.  

Example from Ataccama Data Quality

A party table contains personal information on a customer. The values in the gender field are being validated for accuracy: If the value does not comply with the reference dataset for gender, the test fails.

Image
Recording data quality Ataccama

A quick data profile shows that null values in the column are the root cause of the issue:

Image
Screenshot data profile

Once rules have been defined, the software can scan your data at regular intervals so you can monitor whether any clean-up efforts have paid off and if your data quality is improving.

If you want to know more about our approach to Data Profiling and Data Quality, contact us for more information or with your specific request.

 

ataccama logo
great expectations logo