Data quality screening : an inside-out approach to improving data quality
An undertaking of improving data quality or getting insight into the data quality provided, should answer amongst other the following questions;
- Which data quality issues exist?
- Is data quality getting better or worse?
- Which source systems or data providers generate the most and least data quality issues?
- Are there interesting patterns or trends revealed over time?
- Which business (decisions) is (are) impacted by a lack of data quality?
- Is there any correlation observable between the data quality levels and the performance of the organization as a whole?
- Which of the data quality issues and validations consume the most/least time in the ETL window?
- drives the current operations of an organization
- records the history of activities of an organization
- is used as basis for important future business decisions
- gets more valuable, time after time, as additional ways are found to employ it (e.g. RFID – Radio-Frequency Identification, which is an automatic identification method, relying on storing and remotely retrieving data using devices called RFID tags. RFID technology enables improved insight in item locations, item movements and other item properties, such as temperature, on a very granular level)
- volumes are increasing exponentially whereas data is generated by more people, is used in the execution of more tasks by more people, etc.
- is the main
asset of many large organizations like insurance companies, banks, etc. but also increasingly for
manufacturing organizations, which merely produce and ship products, but use data that drives processes of material acquisition, shipping, billing, just-in-time delivery, ...
Suppose having a database containing records of Belgian doctors, where the data quality is assumed to be approximately 85% (records missing, double information, incorrect values, etc.).
- When this database is used for notifying doctors about a new law concerning assisted suicide, then data quality would be considered as low
- When this database is used to find potential customers for a new surgical device manufacturer, then data quality would be considered as high
Possible impact of poor-quality data
- Transaction rework costs: e.g. Organizations which have entire departments handling customer complaints on mishandled orders where data errors might be the cause: wrong part numbers, wrong amounts, incorrect shipping addresses
- Cost incurred in implementing new systems: e.g. New systems which need to cope with older systems rules, data migration, bad data, un-documentation
- Delays in delivering data to decision makers: e.g. Wrong data can require, in a recurring way, manual massaging of the information before it can be released for consumption
- Poor customer satisfaction or lost customers through poor service: e.g. Customers consistently getting orders shipped incorrectly, customers getting wrong invoices
- Lost production through supply chain problems: e.g. Supply chain systems delivering wrong parts or wrong quantity to the production line, causing an oversupply or a stoppage
Sources for inaccurate data
Initial data entry mistakes are obviously one of the main sources for inaccurate data, but not the only ones:
- Wrong values entered
- Data entry people who do not care to do their job right
- Confusing and contradictory data entry screens or forms
- Procedures that allow for data to not be entered, wrongly entered
or not be entered on time
- Procedures or policies that promote entering wrong values
- Poorly defined database systems and problems due
to data syncronisation between different systems
- Data that is initially created accurately, but becoming inaccurate over time when the database is not updated to reflect the occurred changes
- Inaccurate data is often created from perfectly good data through the processes used to move and restructure data such as extract, cleansing, transformation, loading, integration or migration
- Screen succeeds = data ok
-> Data is loaded further downstream
- Screen fails = data not ok (severe errors)
-> Data is placed in suspension until further investigation or
-> Data will be subject to cleansing or
-> ETL is fully stopped
In the case of different parties delivering their data, and when one or more severe errors are detected, it might be a valid option to not load the data any further, but to ask for new, higher quality data. These severe errors will also aid in reducing the number of iterations required in a pre-production phase, in order to have good quality datasets. In a post-production phase these severe errors will remain as they are.
- Screen fails = data is suspicious (warning)
-> Data is tagged, but loaded further downstream
The sole purpose of ‘warnings' is to flag ‘strange' data, and to communicate this appropriately to the corresponding source owner. In the context of different parties providing their data to a central data warehouse, these warnings will aid in reducing the number of iterations required in a pre-production phase, in order to have initial quality datasets. In a post-production phase the organization can decide to disable these screens. In this case, the warnings do not have the pretention to be an exhaustive formal data content validation. They are merely a help towards the different parties, to improve data quality.
- Column screens – rules based on one single column
- Mandatory columns containing null values
- Values not adhering to a predefined list of values
- Numeric values not fitting within a predefined range
- Column lengths which are too long
- Structure screens – rules based on multiple columns
- Invalid hierarchical roll ups
- Invalid parent - child relationships
- Invalid foreign key – parent key relationships
- Invalid code – description relationships
- Business rule screens – complex rules
- ‘Key account' customers should have at least 100.000 euro revenue in a predefined period
Screening star schema
Most important attributes:
- All foreign surrogate keys referring to the primary surrogate keys in the dimensions parties, sources & objects
- Screen description: describing what the screen is about
- Screen category: column, structure or business rule screen
- Screen type: a grouping of screens which logically belong together. E.g. all screens which validate whether a mandatory column is filled in or all Referential Integrity screens
- Other clusters next to screen type & screen category might be added as well
- Object: source table or flat file to which the screen applies to
- Column: column name to which the screen applies to
- Stage before / after indicator: indication whether the screen should be executed before or after the stage mentioned in the attribute stage
- Stage: in which ETL stage should the screen be run
- Processing order: order within the stage in which the screens should be run
- Corrective action: Possibly a corrective action can be made / defined on the erroneous record. This can be the description of the action, or the actual SQL statement to perform the correction
- Exception action: should the record be passed or rejected when this type of error is encountered? Is the screen a severe error or merely a warning? Should the ETL be stopped overall?
- Active flag: possibly a screen is no longer required (retired) or not yet required for a certain source / object. Depending on the requirements a screen can be flagged active in a pre-production phase and can be retired when going to a production phase.
- Error detail: specifies whether all erroneous records will be logged in the error table, or only a summary will be logged or yet another scenario
- SQL statement: contains the SQL statement to be executed. Other attributes within this table form the metadata which will be used to dynamically create the SQL statement which needs to be executed. Optionally an explicit SQL statement, which doesn't need to be generated, can be stored here
- Generate flag: specifies whether the SQL statement needs to be generated or not.When new screens are added, the corresponding SQL statements should be generated by running a procedure
The batches dimension represents all occurrences of the overall batch process. A batch contains a predefined number of stages (e.g. flat file to staging area stage). Each stage contains a predefined number of processes (e.g. load customer data from flat file to staging area). Within the processes dimension the following attributes might be very interesting:
- Process description: for the screening processes, each executed screen will be shown as a process in the process table. Optionally the entire SQL statement which was executed to satisfy a screen could be stored here for reference purposes.
- Start time: time before a process is started
- End time: time after a process is ended
- Time elapsed: time expressed in seconds between start time & end time. This value expresses the time window required for executing a specific screen. Interesting towards aggregation of screen timings and trend analyses for 1 specific screen.
- Status: expresses whether the screening was successful or a failure
one single entry, up to all delivering subsidiaries for a global company.
- Natural key: business key which is defined for the object
Most important attributes:
- Surrogate primary key in case a single record needs to be identified in the fact table
- All foreign surrogate keys referring to the primary surrogate keys in the dimensions dates & batches
- Time: time in seconds since midnight
- Key: the record id (in case of a flat file) or the business key value (in case of a table) is stored here. To be used as a back pointer to the source system
- Error description: a comprehensive error description is given here
- Occurrence: number of occurrences to which the error applies to. 1 by default
A best practice is to create a reference table containing all possible screens definitions together with clear examples which can be communicated to the source owners. On top of the screening star schema a semantic layer, through means of a business intelligence tool, could be deployed. The objective should be to have a self-explanatory error report. In this way a fully automated loop can be established between data being delivered, data screening and communication returned to the owner.
Advantages of the screening technique
- Framework for capturing all data quality errors
- Framework for measuring data quality over time
- Easily integrated with existing metadata
- Validations, which are no longer needed, can be easily (logically) removed
- Thorough – being as complete as possible
- New validations – screens - can be added easily
- Possibility to qualify severity
- Readable & maintainable by non-technical persons
- Low cost
- One central place for all errors
- Error traceability - backward traceability up to the delivered flat files or sources
- Possibility for trend analysis or other analytics via the star schema