1 / 41

Data Quality

Data Quality. Class 4. Goals. Discuss Project Midterm Statistical Process Control Data Quality Rules. Project. Informtion is now on web site Final version is due on July 26 Data will be available by end of the week We will spend some time discussing goals today. Midterm.

rbrigance
Download Presentation

Data Quality

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Quality Class 4

  2. Goals • Discuss Project • Midterm • Statistical Process Control • Data Quality Rules

  3. Project • Informtion is now on web site • Final version is due on July 26 • Data will be available by end of the week • We will spend some time discussing goals today

  4. Midterm • Written exam on July 5th • Will cover: • Cost of low data quality • Dimensions of data quality • domains and mappings • SPC • Data Quality Rules

  5. Statistical Process Control • Developed by Shewhart at Bell Labs in the 1920’s through 1950’s • Notions of Variation vs. Control • Important in original context of both equpiment manufacture and service quality

  6. Variation • Natural variations • Defects • Errors • Mistakes • Some variations are meaningful, some are not

  7. Causes of Variation • Common, or Chance causes • minor fluctuations or differences • not necessarily important to correct • observed to form a normal distribution • Assignable, or Special causes • (self explanatory) • We expect to see the normal variations, but assignable cause variations are interesting

  8. Example • Measure railroad on-time performance • Trains are typically on time or a few minutes late • One night, the trains are all 1 hour late due to electrical problems – a special cause

  9. Statistical Control • State in which variations observed can be attributed to common causes that do not change with time

  10. Pareto Principle • In a population that contributes to a common effect, relaively few of the contributors account for the bulk of the effect • Example: code performance analysis • Can be used to direct analysis

  11. Control Chart

  12. Control Chart 2 • Used to look for distinct variations from the mean • Goal: predictable behavior • Plot series of data over time • Variations are represented as distance from the mean

  13. Control Chart 3 • Center Line: can be computed as mean of variable points • Upper Contril Limit: three standard deviations above center line • Lower Control Limit: three standard deviations below center line

  14. Control Chart 4 • As long as all points are between UCL and LCL, the variations are due to common causes, and the process is said to be in control, or stable • Points above UCL or below LCL are indicative of abnormal variation, and are due to special causes – the process is not in control

  15. Control Chart 5 • Select variables chart or attributes chart • Use data quality dimensions as guideline • Select meaningful variables to measure (i.e., stuff that will point at a diagnosible problem)

  16. Interpreting the Control Chart • Lack of stability indicates potential problem • Look for: • points utside of control limits • zone testing (clusters of points within certain standard deviation limits) • potential to split out data points into different logical data sets • Look for cycles

  17. SPC and Data Quality • “The Information Factory” • Use data quality dimensions as guideline for investigation • Analyze the state of data as it passes through the information chain • Probing can be automated with data quality rules

  18. Inserting the Probes • FInd a location in information chain that is: • nondisruptive • easy to access • easy to retool

  19. Data Quality Rules • Definitions • Proscriptive Assertions • Prescriptive Assertions • Conditional Assertions • Operational Assertions

  20. Definitions • Nulls • Domains • Mappings

  21. Proscriptive Assertions • Describe what is not allowed • Used to figure out what is wrong with data • Used for validation

  22. Prescriptive Assertions • Describe what is supposed to happen with data • Can be used for data population, extraction, transformation • Can also be used for validation

  23. Conditional Assertions • Define an assertion that must be true if a condition is true

  24. Operational Assertions • Define an action that must be taken if a condition is true

  25. 9 Classes of Rules • 1)      Null value rules • 2)      Value rules • 3)      Domain membership rules • 4)      Domain Mappings • 5)      Relation rules • 6)      Table, Cross-table, and Cross-message assertions • 7)      In-Process directives • 8)      Operational Directives • 9)      Other rules

  26. Null Value Rules • Null value specification • Define GETDATE for unavailable as “fill in date” • Null values allowed • Attribute A allowed nulls {GETDATE, U, X} • Null values not allowed • Attribute B nulls not allowed

  27. Value Rules • Value restriction rule Restrict GRADE: value >= ‘A’ AND value <= ‘F’ AND value != ‘E’

  28. Domain Rules • Domain Definition • Domain Membership • Domain Nonmembership • Domain Assignment

  29. Mapping Rules • Mapping definition • Mapping membership • Mapping nonmembership

  30. Relation Rules • Completeness • Exemption • Consistency • Derivation

  31. Completeness • Defines when a record is complete (I.e., what fields must be present) IF (Orders.Total > 0.0), Complete With {Orders.Billing_Street, Orders.Billing_City, Orders.Billing_State, Orders.Billing_ZIP}

  32. Exemption Defines which fields may be missing IF (Orders.Item_Class != “CLOTHING”) Exempt {Orders.Color, Orders.Size }

  33. Consistency • Define a relationship between attributes based on field content • IF (Employees.title == “Staff Member”) Then (Employees.Salary >= 20000 AND Employees.Salary < 30000)

  34. Derivation • Prescriptive form of consistency rule • Details how one attribute’s value is determined based on other attributes IF (Orders.NumberOrdered > 0) Then { Orders.Total = (Orders.NumberOrdered * Orders.Price) * 1.05 }

  35. Table and Cross-Table Rules • Functional Dependence • Primary Key Assertion • Foreign Key Assertion (=referential integrity)

  36. Functional Dependence • Functional Dependence between columns X and Y: • For any two records R1 and R2 in a table, • if field X of record R1 contains value x and field X of record R2 contains the same value x, then if field Y of record R1 contains the value y, then field Y of record R2 must contain the value y. • In other words, attribute Y is said to be determined by attribute X.

  37. Primary Key Assertion • A set of attributes defined as a primary key must uniquely identify a record • Enforcement = testing for duplicates across defined key set

  38. Foreign Key Assertion • When the values in field f in table T is chosen from the key values in field gin table S, field S.g is said to be a foreign key for field T.f • If f is a foreign key, the key must exist in table S, column g (=referential integrity)

  39. In-process Directives • Definition directives (labeling information chain members) • Measurement directives • Trigger directives

  40. Operational Directives • Transformation • Update

  41. Other Rules • Approximate Searching rules • Approximate Matching rules

More Related