430 likes | 491 Views
Data Quality. Class 3. Goals. Dimensions of Data Quality Enterprise Reference Data Data Parsing. Dimensions of Data Quality. Poor data quality is similar to obscenity- It seems as if there are no real ways to measure it, but you know it when you see it!
E N D
Data Quality Class 3
Goals • Dimensions of Data Quality • Enterprise Reference Data • Data Parsing
Dimensions of Data Quality • Poor data quality is similar to obscenity- • It seems as if there are no real ways to measure it, but you know it when you see it! • In reality, data quality can be measured • The frame of reference for measurement is different
Dimensions of Data Quality 2 • Data Models • Data Values • Data Presentation • Data Policy
Example: Sales Database • Sales and marketing database • Current customers • Sales leads • Name, address, contact data • For current customers, sales data
Data Quality of Data Models • Clarity of definition • Comprehensiveness • Flexibility • Robustness
Data Quality of Data Models 2 • Essentialness • Attribute granularity • Precision of domains • Homogeneity
Data Quality of Data Models 3 • Naturalness • Identifiability • Obtainability • Relevance
Data Quality of Data Models 4 • Simplicity • Semantic Consistency • Structural Consistency
Data Quality of Data Values • Accuracy • Null values • Completeness • Consistency • Currency
Accuracy • Agreement with established sources • Database of record • Other corroborative sources
Null Values • Null vs. Missing • Unavailable • Not applicable • No value • Not classified • Truly null
Completeness • Mandatory attributes require values • Optional attributes may hold values (when and how?) • Inapplicable attributes may not have a value (also when and how?) • Completeness constraints
Consistency • Are values in one set consistent with values in another set? • Consistency relations between attributes in the same table • Consistency assertions across columns • Consistency relationships between tables
Currency/Timeliness • What data is current? • How is it kept up-to-date? • Time expectations for accessibility to data
Data Quality of Data Presentation • Appropriateness • Correct Interpretation • Flexibility • Format Precision
Data Quality of Data Presentation 2 • Portability • Representation Consistency • Representation of Null Values
Data Quality of Data Policy • Access • Metadata • Privacy • Fault-tolerance • Security
Reference Data • Relatively static • Referred to from within many tables • Shared data • Examples: • Product catalog • Security type classification • Currency codes
Reference Data 2 • Metadata • Data Domains • Mappings between those domains
Domains • A data domain is a subclassed data type • Domains can be described using enumerations • Example: states, cities, product codes • Domains can be described using functions • Example: formatted trouble ticket ids such as CC-NNNN
Domain Membership • Data attributes can be affiliated with data domains • Test a value to make sure it is valid within a domain • For enumerated domains, a lookup works • For generated/described domains, check to see if the value could be generated by function
Domains as Metadata • The existence of a data domain is metadata • Keep track of which tables have attributes making use of which domains • Manage domains as enterprise reference data
Domain Tables • Domains are sets of values • Maintain in two tables: • Domain name table (domain_name, domain_id) • Domain value table (domain_id, value) • Example: • (2, “STATES”) • {(2, “NY”), (2, “MA”), (2, “CT”), (2, “VA”)…}
Mappings • Map values in one domain to values in other domains • 1-1, 1-Many, Many-1, Many-Many • Represents relations from one set to another set • 1-1, Many-1 are functions
Mappings as Metadata • Semi-static mappings are also metadata • Keep track of which tables refer to mappings • Enterprise reference data
Mapping Tables • Maintain mapping information in two tables: • Mapping name table (from_domain, to_domain, mapping_id, mapping name) • Mapping table (mapping_id, source_value, target_value)
Example: • (“Currency”, “Country”, 65, “CurrencyToCountry”) • (65, “USD”, “US)
Tables • Inference of data fields from source data • In our case, this is straightforward based on SGML tags in the data • Determine table structure in context of use of metadata and reference data • In other words, try to maintain normal form
Keys • Primary key: must be unique across all values in table • Primary key may be assigned based on internal increasing value, or maybe extant in the data • Foreign key: relates values in one table to values in another table • Foreign keys must exist in target table (=referential integrity)
Data Profiling and Parsing • Goals: • Identify data domains • Identify mappings • Identify candidate keys
Data Column Profiling • For each table • For each column • Type inference • Subclassed Type inference • Count the number of distinct values • Enumerate the distinct values • Sort the distinct values • Look for patterns of usage • Document discoveries
Data Table Profiling • For each Table • Identify candidate keys • Identify data domains • Identify data mappings • Look for patterns
Multiple Table Profiling • Identify foreign key relationships • Validate referential integrity • Look for patterns
Data Parsing • 2 kinds of data domains • Enumerated, consisting of a predefined set of values • Inferred/Functional, consisting of values that are validated based on a set of rules
Type Inference • Data value sets all conform to a data type • All values within a column must belong to the same type • Propose data type by a series of inferences • Initial assumption is that all values belong to strictest type • Test for violations to type restriction • As a violation is discovered, loosen the restriction and test again for violations • Measure conformance at each level
Subclassed Type Inference • Once type has been inferred, look for additional restrictions within type • Example: • Integer vs. Integer within a range (0..100) • For integers and real values, look for ranges • For character strings, look for length, context, character patterns
String Parsing • Classify characters into groups • Alphabetic • Numeric/Digits • Punctuation (this can be further refined) • Transform all values within a column set into their corresponding pattern • Example: 789-23-1100 would change to • DDDPDDPDDDD
String Parsing • Given a set of representative strings, perform column analysis again • Look for high frequency counts for specific patterns • This will provide proposed functional domains • We can characterize certain domains by format (e.g., telephone numbers, SSNs, Tax Ids, UPC codes, etc.)
Word Parsing • More advanced technique to explore domain types • Classify word tokens in terms of predefined characterization • Example: Name words, business words, transaction words, connector words, titles, etc.
Primary Key Discovery • Iterative process to find candidate keys • A candidate key is one or more attributes whose values, when composed, can uniquely locate a record
Primary Key Discovery 2 • For I = 1, number of attributes • For each set S composed of I attributes do • Are the composed values unique across the tables? • If so, add to candidate key set
Next Assignment • Type Inferencer • Propose data types for analyzed columns • Format Parser • Transform strings into pattern strings • Propose format for a column • Catalog discovered named formats (telno, SSN, etc.) • More details on posting on web site