120 likes | 214 Views
Chapter 1. Introduction to Data Quality. Data Quality Characteristics. Data quality affects several attributes associated with data: Accuracy – Is it realistic or believable? Integrity – Is it structured and managed? Consistency – Is it consistently defined and maintained?
E N D
Chapter 1 Introduction to Data Quality
Data Quality Characteristics • Data quality affects several attributes associated with data: • Accuracy – Is it realistic or believable? • Integrity – Is it structured and managed? • Consistency– Is it consistently defined and maintained? • Validity – Is the data valid, based on business or industry rules and standards?
What Causes Poor Data Quality? • These factors can contribute to poor data quality: • Business rules do not exist or there are no standards for data capture. • Standards may exist but are not enforced at the point of data capture. • Inconsistent data entry (incorrect spelling, use of nicknames, middle names, or aliases) occurs. • Data entry mistakes (character transposition, misspellings, and so on) happen. • Integration of data from systems with different data standards is present. • Data quality issues are perceived as time-consuming and expensive to fix.
Primary Sources of Data Quality Problems Source: The Data Warehousing Institute, Data Quality and the Bottom Line, 2002
How Is Clean Data Achieved? • Clean data is the result of a combination of efforts: • making sure that data entered into the system is clean • cleaning up problems after the data is accepted.
Typical Data Quality Issues • The most common processes in a data quality initiative are • Data Analysis and Standardization • consistency analysis • standardization schemes • gender analysis • entity analysis • data parsing and casing. continued...
Typical Data Quality Issues • The most common processes in a data quality initiative are • Matching and Merging • de-duplication • householding • Address Verification – against a CASS certified database • Geocoding – data enrichment using third-party data elements.
Analysis and Standardization Example Who is the biggest supplier? Anderson Construction $ 2,333.50 Briggs,Inc $ 8,200.10 Brigs Inc. $12,900.79 Casper Corp. $27,191.05 Caspar Corp $ 6,000.00 Solomon Industries $43,150.00 The Casper Corp $11,500.00 ... ...
Standardization Scheme • Briggs, Inc • Brigs Inc. Briggs Inc. Casper Corp. Caspar Corp The Casper Corp Casper Corp. ... ...
50,000 Casper Corp. 40,000 Solomon Ind. 30,000 Briggs Inc. 20,000 10,000 Anderson Cons. 0 $ Spent Supplier Spending
Mark Carver SAS SAS Campus Drive Cary, N.C. Mark W. Craver Mark.Craver@sas.com Mark Craver Systems Engineer SAS Data Matching Example Operational System of Records Data Warehouse 01Mark Carver SAS SAS Campus Drive Cary, N.C. 02Mark W. Craver Mark.Craver@sas.com 03Mark Craver Systems Engineer SAS ... ...
Mark Carver SAS SAS Campus Drive Cary, N.C. Mark W. Craver Mark.Craver@sas.com Mark Craver Systems Engineer SAS Data Quality Process Operational System of Records Data Warehouse 01 Mark Craver Systems Engineer SAS SAS Campus Drive Cary, N.C. 27513 Mark.Craver@sas.com DQ ... ...