450 likes | 595 Views
Project Driven Data Quality Improvement. Ron Forino DAMA - Washington, DC September 1999. Examples. According to DM Review, one European company discovered through an audit that it was not invoicing 4% of its orders. With $2 billion in revenues, that meant $80 million went unpaid.
E N D
Project Driven Data Quality Improvement Ron Forino DAMA - Washington, DC September 1999
Examples • According to DM Review, one European company discovered through an audit that it was not invoicing 4% of its orders. With $2 billion in revenues, that meant $80 million went unpaid. • Electronic data audits show that the invalid data values in the typical customer database average around 15 - 20%. Physical audits suggest that this number may be closer to 25 - 30%. • In 1992, 96,000 IRS tax refund checks were returned “undeliverable” due to incorrect addresses. • This year, incorrect price data in retail databases will cost American consumers as much as $2.5 billion in overcharges. • According to organizations like the Data Warehouse Institute, the Gartner Group and MetaGroup - Data Quality is one of the top 1-3 success factors to Data Warehousing. • The average mid-sized company may have 30,000 - 50,000 fields in files, tables, screens, reports, etc. [Platinum Technology]
Agenda • Definitions • What is Data Quality? • Tactics and the End Game • Building Blocks to Data Quality • Tactical Initiatives • Strategic Initiatives • Tactical Data Quality • Rule Disclosure • Data Quality Measurement, Analysis and Certification • Meta Data Creation • Validation • Quality Improvement
Definitions • Data Transformation - Changing data values to a format consistent with integrity and business rules agreed to by data stakeholders. • Data Cleansing - Consolidation of redundant customer records. Term used to describe the process of “merging and purging” of customer lists in an effort to reduce duplicate or inaccurate customer records. • Data Quality Improvement - The process of improving data quality to the level desired to support the enterprise information demand. • Data Quality -definition to follow….
Data Quality Improvement Decision Tree Task Process Transform Conform to Business Rule Data Reengineering Data Quality Improvement Process Reengineering Standardize Validate Match Dedupe Integrate Enrich Match & Dedupe Data Cleansing
Tactics and The End Game “We need better data quality...” Enterprise Initiative Select Project Data Quality Assessment Report & Recommendations Source System Clean-up Initiative
Tactics and The End Game “We need better data quality...” Data Warehouse Enterprise Initiative Select Project Data Quality Assessment Data Quality Assessment Staging Specifications Report Report & Recommendations Source System Clean-up Initiative Source System Clean-up Initiative
How Can We Know Good Data Quality? • Column 1 • 321453 • 212392 • 093255 • 214421 • . • . • . Is this Good Data Quality? What can we conclude?
What is Data Quality? • Information Quality = f(Definition + Data + Presentation) • Definition • Defines Data • Domain Value Specification • Business Rules that Govern the Data • Information Architecture Quality • Data Content • Completeness • Validity/Reasonability • Data Presentation • Accessible • Timely • Non-ambiguous
Data Content Missing Data Invalid Data Data Outside Legal Domain Illogical Combinations of Data Structural Record Key Integrity Referential Integrity Cardinality Integrity Migration/Integration Rationalization Anomalies Duplicate or Lost Entities Definitions and Standards Ambiguous Business Rules Multiple Formats for Same Data Elements Different Meanings for the Same Code Value Multiple Codes Values with the Same Meaning Field Used for Unintended Data Data in Filler Y2K Violation Common Data Quality Problems
Building Blocks to Data Quality
Building Blocks of a Data Quality Program Benefits Realization Strategic Defect Prevention DQ Requirements Quality Reengineering Enterprise Cultural Shift QC/Process Auditing Data Stewardship Tactical Validation Quality Improvement Meta Data Creation Measure Analyze & Certify Rule Disclosure
Steps to Tactical Data Quality Rule Disclosure Measure Quality Analyze & Certify Meta Data Creation Validation Quality Improvement
Sources of Meta Data • Legacy Meta Data • Data Models, Process Models • Data Dictionary, Definitions, Aliases • Glossary of Terms • Transformation Meta Data • Data Mapping • Transformation Rules • Error Handling Rules • Access Meta Data • Data Directory • Data Definitions • The Subject Matter Expert • Database Directory • Domain Values, Range of Values • Run Books • Derived Data Calculations • Audit Statistics • Source & Transformation
Acquiring good Meta Data is Essential Collect Documentation Validate the Meta Data Assess the Data Report Findings Collect Documentation Assess the Data Report Findings Validate Findings Preferred Collect Valid Meta Data Assess the Data Report Findings Meta Data can be gathered before, during or after the Assessment “You can pay me now, or you can pay me later…”
Measuring • Data Quality • Techniques • Tools • Methods
How can Data Quality be Measured? “One accurate measurement is worth a thousand expert opinions” [Grace Hopper, Admiral, US Navy] • Customer Complaints • User Interviews & Feedback • Customer Satisfaction Survey • Data Quality Requirements Gathering • Data Quality Assessments
Measuring Data Quality - Tools • Analysis Tools • Specifically designed assessment tools • Quality Manager, Migration Architect • N & A: Trillium, Group-1, ID Centric, Finalist, etc. • Improvisations • SAS, Focus, SQL, other query tools • Other Necessary Tools • File Transfer • Data Conversion
Assessment Measurements • Level 1: Completeness • Nulls or Blanks • Misuse (or overuse) of Default Values • Level 2: Validity • Data Integrity Anomalies • Invalid Data based on Business Rule • Level 3: Structural Integrity • Primary Key Uniqueness • Key Structure (Cardinality, Referential Integrity, Alternate Keys) • Level 4: Business Rule Violations • Relationship between two or more fields • Calculations Field Integrity Intuitive Integrity Rules Business Rule Integrity Requiring Meta Data
Report Card • Analyze • and Certify • Identifying Problems • Sizing up Problems • “To Certify or Not to Certify…”
Template - field level • Value - the domain occurrence • Frequency - the number of occurrences within the data set • Percent - the % of the whole set • 88 Info - the copybook definition for the value • Analysis - comments about our findings
Identifying Problems 1 2 3 • Analysis (and Discovery) • 1. Is the field required? If so, blanks indicate an anomaly. • 2. Are the values “ID206” and “STANG” allowed? (Is this a problem with the data or the Meta Data? • 3.Some values occur in only 1.3% of the records. Is this telling us there is a problem?
Field Analysis In a range of values, in the absence of domain rules, investigate the first and last .2% Bell curve distribution
Example: Data Quality Repository Newly Discovered Rules
Meta Data Supply Chain Definition & Domain Meta Data Gathering Data Quality Statistical Reports Knowledge Management DQ Assessment Transformation & Edit Recommendations Data Quality & Definition Validation SME Validation Data Cleansing Update Meta Data Field Name Data Inventory Meta Data Work Groups Data Requirements
Report Validation SME validation… an opportunity to improve Meta Data 1. Supply a clear name for the field. 2. Is there a good definition? 3. Make the business rules public? 4. Will the SME initiate a data cleansing initiative? 5. Does the SME recommend edit or data transformation rules? 6. Are the findings consistent with the SMEs expectations? Report Sections Identification 1 2 3 Field Definition & Rules 4 5 Score & Explanation Statistical Reports & Analysis 6
Quality Improvement
Next Steps Continued Monitoring Legacy Data Extractions Monthly Reports Information Management Objectives Initiatives Data Clean-up Legacy System Enhancements & Re-engineering Management Report & Recommendations Perform Baseline Assessment (Discovered Business Rules) Steering Committee Data Migration Transformation & Cleansing Specifications Metadata, Models, Reports, etc.
Lessons Learned- Data Cleanup $$ (More complete, more error prone) (Most complete, most accurate, most costly, most timely) (More accurate, less data) 100% Completeness 100% Accuracy
Summary • We made the distinction between: - Data Migration - Data Quality - Data Cleansing • We defined what “good” data quality is. • We discussed that there could be 10 or more processes that could take place in building a comprehensive data quality program for the enterprise. - Tactical should precede the Strategic [or be the 1st step of ] • There are 6 steps to an effective tactical data quality initiative: - Rule Disclosure - Quality Measurement - Analyze and Certify - Meta Data Creation - Validation - Quality Improvement
Reference Material • The Demings Management Method (Total Quality Management), Mary Walton • Data Quality for the Information Age, Tom Redman • The Data Warehouse Challenge: Taming Data Chaos, Michael Brackett • Improving Data Warehouse and Business Information Quality, Larry English • DM Review Magazine, Information Quality series by Larry English
Ron Forino Director, Business Intelligence DMR Consulting Group (732)549-4100 X-8292 rforino@dmr.com ronforino@aol.com