The alternative to Garbage In, Garbage out : a Data Quality Program

Introduction

This is the story of a ‘blue’ and a ‘green’ hospital …

In the ‘blue’ hospital, amongst other things, the medical procedures (diagnoses) and the gender associated with a patient were recorded in a database. Via a data profiling exercise the ‘blue’ hospital discovered that certain recorded procedures were not at all possible for the associated gender and were as such leading to inaccurate facts and analysis. The conclusion drawn here was simply, just correct these inaccurate facts. As such the data was only clean at one point in time, whereas the next day new invalid gender – diagnosis combinations were introduced all over again.

The ‘green’ hospital which had a data quality program and process in place went a couple of steps further in its analysis tackling the same type of problem. Here, in addition to the profiling exercise, also a root cause analysis was executed in order to discover the true reasons behind these inaccurate facts. The root cause found was that the data was first handwritten – accompanied by mistakes, wrong codes, missing information … - and secondly that the data entry staff entered the data as fast as possible which appeared to be one of their supervisors’ criteria. Next to the wrong combination of gender and diagnosis, most likely a lot of other combinations of persons and diagnoses were also subject to being inaccurate, which were not discovered in the ‘blue’ hospital given the fact that all diagnosis codes were valid ones.

As a remedy, the two-fold process was turned into an online data entry process, eliminating the paper 'middleware'. Furthermore various data checking mechanisms were put in place reducing the number of impossible combinations of data (gender – diagnosis, age –diagnosis, incompatible diagnoses …), clear diagnosis code – descriptions were available at hand at the time of data entry and finally the personnel was educated on the importance of entering the data correctly.

Because of the better, re-newed data entry procedures, the number of errors prevented not only included those which were detectable, but also many others that were not detectable through a first-hand analysis. Finally the analysis which could be done based on this hospital data proved to be much more reliable than before.

It is obvious that it is better to be in a ‘green’ position than in a ‘blue’ position. This insight will therefore discuss in detail the need for a data quality program and how to set it up.

Business Case Skepticism

A multitude of reasons exist why typically not much has been done about data quality : low awareness of the extent of inaccurate data in the systems, low awareness of the (magnitude of) cost of the data quality issues, incorrect guesses of the impact of data quality issues, low awareness of the potential value of improvements fixing the problems, etc. But above all, often there is skepticism about the business case for improving quality, which results in the acceptance of poor quality or possibly in the treatment of known issues as isolated problems instead of solving the root causes. In the end IT might acknowledge the problems and business analysts might acknowledge as well ; but the higher you move up the management chain, the more likely the willingness to ‘see and deal with’ will diminish.

Data quality is one of those topics that everyone knows is important, but it just doesn’t generate enough enthusiasm to get something done about it. Low data quality is a threat, but nothing compared to other issues that are more real and serious threats.

It is however the cost of inaccurate data that builds the business case. Compare it to advertising; you don’t know exactly which impact it will have, but you do know, if you don’t do it you will lose a lot of sales. Information systems are becoming more and more important, even critical today. Data quality can make the information systems more effective ... or the opposite.

Without a data quality program, data is certain to have inaccuracies, normally very costly to a corporation. It requires an active data quality program to improve data quality and to maintain high levels of data quality.

Data quality program

To identify inaccurate data, and as such poor data quality, and defining appropriate actions to improve the accuracy is not a task for a small isolated group of people. On the contrary, it requires a ‘quality’ data quality program.

The goal of a data quality program is to reach and maintain high levels of data quality (accuracy) within the critical data stores of the organization. Therefore it must encompass all existing, important databases and secondly it must be part of every project related to data in general. Its mission is therefore three-fold:

· Improve
This embraces the investigation of databases and information processes in order to fix the existing data quality problems, given the fact that the current state of data quality is not satisfactory.
 
· Prevent
Prevention here means helping departments and individual users to build better data checks, better data capture processes, better screen designs, better policies and so on and so forth.
 
· Monitor
Keeping a close eye on whether the improvements made and the prevention as such is as effective as initially expected.

Components of a data quality program

A data quality program embraces the following components:
· Data quality dimensions
· Organization
· Methodologies for executing activities
· Activities themselves

1 - Data quality dimensions to be addressed

A strong correlation exists between data & its use, therefore data has quality if it satisfies the requirements of its intended use. This ‘usage’ is essential when we consider the different data quality dimensions. In the following sections each dimension will be illustrated by a small example.

Accuracy

This the most important data quality dimension. When the data is inaccurate, then the other dimensions are of little or no importance.
Suppose having a database containing records of Belgian doctors, where the data quality is assumed to be approximately 85% (records missing, double information, incorrect values, etc.).
· When this database is used for notifying doctors about a new law concerning assisted suicide, then data quality would be considered as low.
· When this database however is used to find potential customers for a new surgical device manufacturer, then data quality would be considered as high.

Completeness

Suppose having a database containing info on repairs done on medical instruments, from time to time (app. 5%) this information, due to people working under time pressure, is not entered at all.
· When this database is used for having an overall view on repair cost, then data quality, although certain details are missing, would be considered as ok.
· When this database however is used in more detail, whether for example a specific part shows a low rotation, or might be missing at all, then data quality would be of low value.
These previous two dimensions clearly focus on data stored in databases. The following dimensions focus on the use by the user-community and interpretation of the data.

Relevance

Suppose having a medical inventory database containing part numbers, storage locations, quantity on hand, etc. for medical instruments. But it does not contain source information, implying that there is no indication where the parts came from, where at the same time multiple suppliers could source the same part. But all information recorded is always accurate & current.
· When this database is used for tracking medical inventory transactions and decision making, then data quality would be of good value.
· When however a particular supplier reports a shipment with defect parts or potential hazard effects when used in surgery, it would be very difficult to trace back the items. Data quality would then be of low value, given the missing relevant information.

Timeliness

Suppose having a database containing 3 years of invoicing information for a hospital. But at the end of each month, the complete set of invoicing records are not always present, it could take until the 7th or 8th of the new month to be complete due to late arriving records, corrections, etc. but at that point it could be considered as 100% correct.
· When used for calculating historical trends, then data quality could be considered as high.
· When used for calculating medical bonuses at the end of the month, then data quality would be considered as low given the fact the data is not timely enough for its intended use.

Understood

Suppose having a database containing hospital invoices. When a complaint is filed in this application, then an ‘adjustment’ invoice is created for inverting the original invoice and then a consecutive new invoice is written to the database. As such this procedure assigns 2 new invoice numbers, being an adjustment and a replacement.
· For accounting purposes the data quality would be considered of good value.
· For a business analyst trying to determining trends in growth of invoices, the data quality might be considered of low value, where the business analyst might assume that each invoice number represents a distinct invoice. Understanding the data is therefore vital.

Trusted

Suppose having a database containing different medical instrument parts, with an application on top to determine the amount and timing of ordering parts (based on history and time in service). Now due to a programming error in the application, each time, incorrectly 10 times the amount actually required was ordered. This was not discovered until a large order was sent. Thereafter the programming error was fixed.
· Due to lack of trust, the business analysts decided to create a mini-database (read: Excel) instead of simply using the application. Even when the data quality in the database was 100% correct, the application on top was not trusted leading to data quality of low value.

2 – Organization

It is essential to set up a data quality department or group, responsible for the data quality program. Potentially this role can be fulfilled as a part-time by certain business owners. However given the reach of its activities and required skills it is advisable that this group has a dedicated role in order to act both independently and pro-actively. As such, it requires full time members. It should not be outsourced either, given the fact that data quality should be considered as a company's core competency.

The team members should become experts in concepts & tools to identify and correct data quality issues. The following profiles / skills are essential in such a team :
· Experts in data profiling, investigation of forthcoming issues and fabrication of remedies. This requires a thorough understanding of the business, the ability to dig problems out of the data and a thorough understanding of the term inaccurate data.
· Expert data analysts, referring to database architecture and analytical techniques.
· Expert business analysts, referring to user requirements and business interpretation of data.
· Experts in measuring and quantifying the cost of poor information quality
It is also key for this data quality group to have an optimal interaction with different other parties in the organisation.
· With other data management persons. This can involve database administrators, data architects, repository owners, application developers ...
· With key users, which can involve business analysts, power users …


With an advisory group, which includes membership from all the relevant business functions and/or departments. The goal here is, based on an inventory of quality assurance projects worth doing, to prioritize and assign work from it. Furthermore the advisory group should help in assessing the impact of data quality problems and help in assessing the impact of corrective measures and planning their implementation. It is important that the team making the remedy recommendations includes representatives from IT and the business, in order to avoid recommending something that will be rejected upon further inspection.

3 - Methodologies for executing activities

Before a remedy can be designed and implemented, it is important to identify the data quality problems. Here 2 approaches can be taken.

Outside-in

The outside-in approach looks for issues in the business. It looks with the business for evidence of negative impacts on the organization that may originate in bad data quality. Examples might be returned goods, modified orders, complaints, rejected reports, lost customers, missed opportunities, incorrect decisions, corrective activities ... Next users are being interviewed to determine their trust in the data. Finally the problems are taken to the data to determine if the data caused the problems.

Drawbacks here are that this approach requires spending a lot of time interviewing people in other departments and it only catches problems which manifest in external behavior, and that behavior must be recognizable as being not good. In the knowledge that ‘perception is reality’, it is nevertheless crucial to have a good understanding of end users’ appreciation of the data quality. If they think there are data problems, they will start looking for alternative information sources, which will have a negative impact on the correct use of the available BI environment.

Inside-out

The inside-out approach looks for issues in the data. It starts with a complete and correct set of rules that define data accuracy of the data. This is metadata (element descriptions, permitted values, relationships ...). It should be taken into account that mostly this information does not exist or is incomplete. However, data profiling can complete and correct the metadata. Hereafter the data analyst can discuss the results with business analysts and key users in order to re-define the metadata. Together the re-defined metadata and data profiling produce evidence of inaccurate data. Finally the inaccurate data is studied to determine the impact on the business that either already has occurred or could occur in the future.

Benefits here are that this approach is easier to accomplish than the first approach. It requires generally less time and it mainly uses own members from the data quality group, minimally bothering anyone else. Finally it catches many problems that the outside-in approach will not catch. (For a more detailed insight focussed on this technique, please refer to Data quality screening: an inside-out approach to improving data quality 
). 

Drawbacks are that this approach will not catch any problems where the data is inaccurate but valid. An example might be a wrong part number on an order, leading to wrong shipments. This will not be detected by using this approach.

A data quality program should use both approaches in order to capture all relevant data quality issues. It should however be taken into account that the use of both techniques still does not guarantee a 100 % safe haven; what about valid data, but wrong and no sufficient external evidence is produced to raise a flag ?

4 – Activities

Project services (for new projects)

This involves working directly with other departments on IT-related projects. In a normal setting, a project team focuses merely on core project activities (requirements analysis, design, build & test) while at the same time, in the best case, the same project team will only briefly touch upon certain data quality aspects, potentially dealing with them in an un-appropriate manner. On the other hand, the only focus of the data quality team is just that.

The data quality group provides :
· Data profiling activities, which gives accurate and complete metadata descriptions of the data. The target system requirements are matched against the content and structure of the source systems and input is given for developing ETL processes. The data profiling activities also result in an inventory of detected data quality problems. Is the source data strong enough or is there a need for a project tackling improvements in the source system ?
· Advice on the design of a target database structure in terms of using correct data-types, lengths, referential integrity, check constraints etc.
· Advice on the process for collecting and updating data.
· Advice on how to embed checking & monitoring functions in the new system.
· Advice on how to design user-friendly screens resulting in less error generation.
· Consistent way of dealing with data quality across various projects.

Stand-alone assessments (for existing databases & applications)

This involves performing assessments entirely within the data quality group and on their initiative. This encompasses checking the health of an existing application & database (because it is suspicious or because it is a very important database), using the inside-out approach and keeping the other departments involved throughout the entire process.

Teach & preach

This involves educating and encouraging employees in other departments or groups to :
· Perform data auditing functions.
· Deploy best practices in designing and implementing new systems such as embedding quality checkpoints in any project.
· Collect and advertise the value to the organization realized from these activities, which also includes informing properly the executive team.
· Train others on technology for building & maintaining systems, how to develop quality requirements, how to qualify data.
· To create the necessary awareness with business users to quality issues, importance of data accuracy. This also includes that business users should be given feedback on the quality of data they generate.
· Define best practices. As more & more issues pass through the process, the data quality group will learn which types of remedy are most effective & which ones can be easily adopted; this can then be converted this into best practices.
 

Conclusion

Without the correct attention to data quality, data is certain to show inaccuracies, which are normally very costly to the organization. However, when the mindset is there to improve the data quality, it might be a good idea to start off by defining and executing a proper data quality program in order to reach and maintain high levels of data quality. A data quality program can and will pro-actively turn the tide in terms of data quality. It is essential that the data quality program integrates as much as possible with other business activities, in order to be as successful as can be.

After all ... everybody is working in data quality, since continuous process quality improvement is what it’s all about.