E N D
“A company’s most important asset is information. A corporation’s ability to compete, adapt, and grow in a business climate of rapid change is dependent in large measure on how well the company uses information to make decisions … Sharing information that isn’t clean and consolidated to the fullest extent can substantially reduce the effectiveness of a system of significant investment and considerable pay-off potential.” Stoker, 1999
INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION Today’s Coverage INTRODUCTION • Data Cleansing and Data Quality • Steps in Data Cleansing • Why is “Dirty” Data a Problem? • Why is Legacy Data “Dirty”? • To Cleanse or Not To Cleanse • Parsing • Matching • Correcting • Consolidating • Standardizing • Conclusion • Demonstration • Questions WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION
INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION Data Cleansing and Data Quality • Data is a product that can be characterized as either “quality” or “non-quality.” The ability to make quality decisions depends in part on the decision-maker’s ability to access quality data. • Data cleansing is the process that insures that the same piece of information is referred to in only ONE way. When data is clean, its users can focus on its use and not its credibility.
INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION Steps in Data Cleansing • Parsing • Correcting • Standardizing • Matching • Consolidating
INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION Why is Data “Dirty” and Why is This a Problem? • Simply put, dirty data for data warehouses is the product of relying on data from legacy systems. • But if company’s have relied on this data for decades, why is it a problem today? • Because a data warehouse “promises” to deliver “a single version of the truth.” Unfortunately integrating data from different sources magnifies its problems.
INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION Why is Legacy Data “Dirty” ? • Dummy Values, • Absence of Data, • Multipurpose Fields, • Cryptic Data, • Contradicting Data, • Inappropriate Use of Address Lines, • Violation of Business Rules, • Reused Primary Keys, • Non-Unique Identifiers, and • Data Integration Problems
INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION To Cleanse or Not to Cleanse • CAN the legacy data be cleansed? • Sometimes the answer is “NO” • Then, SHOULD it be cleansed? • Again, sometimes “NO” • Next, WHERE should it be cleansed? • Finally, HOW should it be cleansed?
INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION Steps in Cleansing Data • Parsing • Correcting • Standardizing • Matching • Consolidating
INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION Parsing Parsing locates and identifies individual data elements in the source files and then isolates these data elements in the target files.
INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION Parsing
INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION Correcting Corrects parsed individual data components using sophisticated data algorithms and secondary data sources.
INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION Correcting
INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION Standardizing Standardizing applies conversion routines to transform data into its preferred (and consistent) format using both standard and custom business rules.
INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION Standardizing
INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION Parsing, Correcting, Standardizing TITLE FIRST CONC. LAST GENER. NAME LINE William Mr. Bill St. John III 101 S. Main Strete Sant. Louis, MO 63181 HSNO ST-DIR ST-NM ST-TYPE St. STREET LINE CITY STATE POST St. 63118 GEOG. LINE
INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION Matching Searching and matching records within and across the parsed, corrected and standardized data based on predefined business rules to eliminate duplications.
INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION Match Patterns Customer #/Tax ID Branch Type Vendor Code Pattern I.D. Business Name Street City Pattern Exact Exact Exact Exact Exact Exact AAAAAA P110 Exact VClose Exact VClose Exact Blanks ABAAA- P115 Exact VClose Exact Blanks Exact Exact ABA-AA P120 Exact VClose Close Close Exact Exact ABCCAA S300 VClose VClose Exact Close Exact Exact BBACAA S310
INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION Matching
INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION Consolidating Analyzing and identifying relationships between matched records and consolidating/merging them into ONE representation.
INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION Consolidating
William Jones Janet Jones Karen Jones William Jones Jr. INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION Consolidating
INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION Recommended Best Practices 1. Use metadata to document rules 2. Determine data cleansing schedule 3. Build quality into new and existing systems
INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION Legacy Systems View (3 Clients) Account No.83451234 Policy No.ME309451-2 Transaction B498/97
INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION The Reality – ONE Client Account No.83451234 Policy No.ME309451-2 Transaction B498/97
INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION Demonstration • Valityhttp://www.vality.com • Trillium Software http://www.trilliumsoft.com • First Logichttp://www.firstlogic.com