210 likes | 397 Views
Data Quality. David Loshin. Course Structure. Overview of Data Quality Data Ownership and Data Roles Cost Analysis of Poor Data Qaulity Dimensions of Data Quality Data models, Data values, Presentation Data Extraction and Transformation ETL, Data transformation. Course Structure (2).
E N D
Data Quality David Loshin
Course Structure • Overview of Data Quality • Data Ownership and Data Roles • Cost Analysis of Poor Data Qaulity • Dimensions of Data Quality • Data models, Data values, Presentation • Data Extraction and Transformation • ETL, Data transformation
Course Structure (2) • Data Quality Improvement • Metadata and Enterprise Reference Data • Domains and Mappings • Data Quality Rules • Definition of Rules • Discovery of Rules
Course Structure (3) • Using Data Quality Rules • Message Transformation and Routing • Data warehouse validation • GUI Generation • Data Warehouse Population
Course Structure (4) • Data Cleansing • Data Parsing • Standardization • Linkage • Duplicate Elimination • Approximate Searching • Scalability Issues
Project • Build a data quality tool • rule definition • data parsing • data element standardization • record linkage • Apply the tool in characterizing real-world data (I’ll supply some, don’t worry ;-)
Some Examples • Frequent Flyer Miles and Long-Distance Service • Corporate Credit Card • Direct Marketing Event • CD Club Scam
What is Data? • Working definitions: • Data: arbitrary values (with their own representation) • Information: data within a context • Knowledge: Understanding of information within its context • Metadata: data about data
Who Owns Data? • Important question, because the answers indicate where responsibility for data quality lies • Data quality can be difficult to effect because of complicating notions • Data Processing as an “information Factory” • Actors in the information factory and their roles
Supplier Acquirer Creator Processor Packager Delivery Agent Consumer Middle Manager Senior Manager Decision-maker Actors and Their Roles
Definition of data Authorization and Security User support Data packaging and delivery Maintenance Data quality Management of business rules Management of metadata Standards management Supplier management Ownership Responsibilities
Creator Consumer Compiler Enterprise Funder Decoder Packager Reader Subject Purchaser Everyone Owernship Paradigms
Complicating Notions • Ownerhsip is affected by the value of data • Privacy • Turf • Fear • Bureaucracy
The Data Ownership Policy • Order of enforcement • Identify stakeholders • Identify data sets • Allocation of ownership • Ownership roles and responsibilities • Dispute Resolution
The Data Ownership Policy (2) • Maintain a metadata database for data ownership • Parties table • Data set table • Roles and responsibilities • Policies (i.e., dispute resolution, communication, etc.)
CIO CKO Trustee Policy Manager Registrar Steward Custodian Data Administrator Security Administrator Information Flow Information Processing Application development Data Provider Data Consumer Ownership Roles
The Information Factory • Information processing can be broken down into a graph • Each node in the graph is a data producer, data consumer, or both • The edges represent communcation paths
What is Data Quality? • “Fitness for Use” • Different rules for different data sets • Includes, but is more than: • Data cleansing • Standardization • Deduplification • Merge-purge
Lather, Rinse, Repeat • Data quality is a process: • Assess the current state of the quality of data • Determine the area that needs most improvement • Determine success criteria • Implement the improvement • Measure against success threshold • If success: goto 2
No one wants to admit mistakes Denial of responsibility Lack of understanding “Dirty work” Lack of recognition Data Quality is Hard to Do
Steps to Data Quality • Training • Data ownership policy • Economic model of data quality • Current state assessment and requirements analysis • Project selection and implementation