1 / 43

Data Quality

Data Quality. Class 3. Goals. Dimensions of Data Quality Enterprise Reference Data Data Parsing. Dimensions of Data Quality. Poor data quality is similar to obscenity- It seems as if there are no real ways to measure it, but you know it when you see it!

larios
Download Presentation

Data Quality

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Quality Class 3

  2. Goals • Dimensions of Data Quality • Enterprise Reference Data • Data Parsing

  3. Dimensions of Data Quality • Poor data quality is similar to obscenity- • It seems as if there are no real ways to measure it, but you know it when you see it! • In reality, data quality can be measured • The frame of reference for measurement is different

  4. Dimensions of Data Quality 2 • Data Models • Data Values • Data Presentation • Data Policy

  5. Example: Sales Database • Sales and marketing database • Current customers • Sales leads • Name, address, contact data • For current customers, sales data

  6. Data Quality of Data Models • Clarity of definition • Comprehensiveness • Flexibility • Robustness

  7. Data Quality of Data Models 2 • Essentialness • Attribute granularity • Precision of domains • Homogeneity

  8. Data Quality of Data Models 3 • Naturalness • Identifiability • Obtainability • Relevance

  9. Data Quality of Data Models 4 • Simplicity • Semantic Consistency • Structural Consistency

  10. Data Quality of Data Values • Accuracy • Null values • Completeness • Consistency • Currency

  11. Accuracy • Agreement with established sources • Database of record • Other corroborative sources

  12. Null Values • Null vs. Missing • Unavailable • Not applicable • No value • Not classified • Truly null

  13. Completeness • Mandatory attributes require values • Optional attributes may hold values (when and how?) • Inapplicable attributes may not have a value (also when and how?) • Completeness constraints

  14. Consistency • Are values in one set consistent with values in another set? • Consistency relations between attributes in the same table • Consistency assertions across columns • Consistency relationships between tables

  15. Currency/Timeliness • What data is current? • How is it kept up-to-date? • Time expectations for accessibility to data

  16. Data Quality of Data Presentation • Appropriateness • Correct Interpretation • Flexibility • Format Precision

  17. Data Quality of Data Presentation 2 • Portability • Representation Consistency • Representation of Null Values

  18. Data Quality of Data Policy • Access • Metadata • Privacy • Fault-tolerance • Security

  19. Reference Data • Relatively static • Referred to from within many tables • Shared data • Examples: • Product catalog • Security type classification • Currency codes

  20. Reference Data 2 • Metadata • Data Domains • Mappings between those domains

  21. Domains • A data domain is a subclassed data type • Domains can be described using enumerations • Example: states, cities, product codes • Domains can be described using functions • Example: formatted trouble ticket ids such as CC-NNNN

  22. Domain Membership • Data attributes can be affiliated with data domains • Test a value to make sure it is valid within a domain • For enumerated domains, a lookup works • For generated/described domains, check to see if the value could be generated by function

  23. Domains as Metadata • The existence of a data domain is metadata • Keep track of which tables have attributes making use of which domains • Manage domains as enterprise reference data

  24. Domain Tables • Domains are sets of values • Maintain in two tables: • Domain name table (domain_name, domain_id) • Domain value table (domain_id, value) • Example: • (2, “STATES”) • {(2, “NY”), (2, “MA”), (2, “CT”), (2, “VA”)…}

  25. Mappings • Map values in one domain to values in other domains • 1-1, 1-Many, Many-1, Many-Many • Represents relations from one set to another set • 1-1, Many-1 are functions

  26. Mappings as Metadata • Semi-static mappings are also metadata • Keep track of which tables refer to mappings • Enterprise reference data

  27. Mapping Tables • Maintain mapping information in two tables: • Mapping name table (from_domain, to_domain, mapping_id, mapping name) • Mapping table (mapping_id, source_value, target_value)

  28. Example: • (“Currency”, “Country”, 65, “CurrencyToCountry”) • (65, “USD”, “US)

  29. Tables • Inference of data fields from source data • In our case, this is straightforward based on SGML tags in the data • Determine table structure in context of use of metadata and reference data • In other words, try to maintain normal form

  30. Keys • Primary key: must be unique across all values in table • Primary key may be assigned based on internal increasing value, or maybe extant in the data • Foreign key: relates values in one table to values in another table • Foreign keys must exist in target table (=referential integrity)

  31. Data Profiling and Parsing • Goals: • Identify data domains • Identify mappings • Identify candidate keys

  32. Data Column Profiling • For each table • For each column • Type inference • Subclassed Type inference • Count the number of distinct values • Enumerate the distinct values • Sort the distinct values • Look for patterns of usage • Document discoveries

  33. Data Table Profiling • For each Table • Identify candidate keys • Identify data domains • Identify data mappings • Look for patterns

  34. Multiple Table Profiling • Identify foreign key relationships • Validate referential integrity • Look for patterns

  35. Data Parsing • 2 kinds of data domains • Enumerated, consisting of a predefined set of values • Inferred/Functional, consisting of values that are validated based on a set of rules

  36. Type Inference • Data value sets all conform to a data type • All values within a column must belong to the same type • Propose data type by a series of inferences • Initial assumption is that all values belong to strictest type • Test for violations to type restriction • As a violation is discovered, loosen the restriction and test again for violations • Measure conformance at each level

  37. Subclassed Type Inference • Once type has been inferred, look for additional restrictions within type • Example: • Integer vs. Integer within a range (0..100) • For integers and real values, look for ranges • For character strings, look for length, context, character patterns

  38. String Parsing • Classify characters into groups • Alphabetic • Numeric/Digits • Punctuation (this can be further refined) • Transform all values within a column set into their corresponding pattern • Example: 789-23-1100 would change to • DDDPDDPDDDD

  39. String Parsing • Given a set of representative strings, perform column analysis again • Look for high frequency counts for specific patterns • This will provide proposed functional domains • We can characterize certain domains by format (e.g., telephone numbers, SSNs, Tax Ids, UPC codes, etc.)

  40. Word Parsing • More advanced technique to explore domain types • Classify word tokens in terms of predefined characterization • Example: Name words, business words, transaction words, connector words, titles, etc.

  41. Primary Key Discovery • Iterative process to find candidate keys • A candidate key is one or more attributes whose values, when composed, can uniquely locate a record

  42. Primary Key Discovery 2 • For I = 1, number of attributes • For each set S composed of I attributes do • Are the composed values unique across the tables? • If so, add to candidate key set

  43. Next Assignment • Type Inferencer • Propose data types for analyzed columns • Format Parser • Transform strings into pattern strings • Propose format for a column • Catalog discovered named formats (telno, SSN, etc.) • More details on posting on web site

More Related