1 / 36

Da ta Quality Simon Razniewski

Da ta Quality Simon Razniewski. In collaboration with Werner Nutt Free University of Bozen-Bolzano. Introduction. Simon Razniewski PhD Student at the FUB Data quality Data completeness Werner Nutt Professor in Computer Science at the FUB Focus in research and teaching:

Download Presentation

Da ta Quality Simon Razniewski

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data QualitySimon Razniewski In collaboration with Werner Nutt Free University of Bozen-Bolzano

  2. Introduction • Simon RazniewskiPhD Student at the FUB • Data quality • Data completeness • Werner NuttProfessor in Computer Science at the FUBFocus in research and teaching: • Data management, data modelling • Data integration • Incomplete information Data Quality

  3. Why data quality? • Data are the basis for (scientific) conclusions about the world • Conclusions only as good as the data they are based on • Low-qualitydatalow-qualityconclusions Data Quality

  4. Some effects of erroneous data are funny Man invited for pre-natal check Data Quality

  5. Some data errors are long-living Spinach contains much iron 100g ofspinachcontain 35mg ofiron 100g spinachcontainonly 3,5mg ofiron Gustav v. Bunge 1890 Data Quality

  6. Some data errors are mysterious Student records in in Georgia (USA), 2009 19.000 students leave their school to change to another … but arrive nowhere ? ? Data Quality

  7. Overview • What are data used for? • Data model the real world • What can go wrong? • Wrong, outdated, missing data • What can one do for • Correctness • Currency • Completeness of data? Data Quality

  8. Data model the real world We analyze the data (instead of the real world) and draw (scientific) conclusions  data determines our conclusions Real world: Students, teachers, classes Database: Tables HOB Bozen Class 2A Anna Diego Paul Maria Data Quality

  9. Questions about students • „How many students are there in the class 2A of the HOB Bozen?“ • „What is the average age of the students of this class?“ • „How many students play an instrument?“ Data Quality

  10. Table „Students“ Whatistheaverageageofthestudentsof theclass2A ofthe HOB Bozen? Data Quality

  11. Many things can go wrong Whatistheaverageageofthestudentsof theclass2A ofthe HOB Bozen? Data Quality

  12. Typos date of birth of Anna school of Paul Data Quality

  13. Factual errors school of Diego (“HOB Meran“ instead of “HOB Bozen“) Data Quality

  14. Outdated entries class of Anna (“1A“ instead of “2A“) Data Quality

  15. Missing values date of birth of Diego (“Null value“) Data Quality

  16. Missing records the record about Maria is missing Data Quality

  17. Missingconcepts no possibility to store information about music instruments Data Quality

  18. Whatcanbedone? There is a distinction between different dimensions of data quality The most important ones are • CorrectnessDoes the data match the real world? • TimelinessIs the data up-to-date? • CompletenessAre all aspects of the domain of interest captured? Further: Comprehensibility, accessability, … Datenqualität

  19. Dimension 1: Correctness IT-techniques: • Detecting typos or statistical outliersstudents born in 1959 • Recognizing duplicatesMohammad Al Zaïn = Muhamad Alzain • Rules for logical consistencyno student can visit two schools at the same time Organisation: Special treatment of core data: Master data managementFor example: students, teachers, schools Datenqualität

  20. Dimension 2: Timeliness • By workflow organisation:Bind workflows onto the IT system Timeliness is guaranteedExample: an enrolment is only valid if it is recorded in the database • Trough data about the currency of the data (metadata) Timeliness can be estimatedExample: “All dropouts until 31th of March are recorded“ Datenqualität

  21. Dimension 3/1: Completenessofvalues • Can be enforced by the IT systemRisk: nonsensical entries • Alternative solution: enforce input of less values • Record reasons for missing values E.g. “Not applicable” or “Unknown” Datenqualität

  22. Dimension 3/3: Conceptualcompleteness • Solid design is important, but not everything can be foreseen • Flexible IT: Schema changes if necessary • Space for comments, additional information • Otherwise: Other fields will be abusedExample: Gasworks in the USA Warning of dogs for meter-readers … later they send bills Datenqualität

  23. Dimension 3/2: Table completeness • Events are completely recorded, if they are bound to the IT system Example: Sales in a supermarket • In general, this binding is not possible  only parts of the database tables are complete • But: Completeness is only necessary for specific uses Example: school statistics from ASTAT  Research Datenqualität

  24. Partial table completeness • Common scenario: Wehave • Some, but not all datacomplete • Questions (‘‘queries“) overdata • Problems: • Do wehavethedatathatisneededtoanswerthequeries? If not: • Whatmoredata do weneed? Data Quality

  25. An (intuitive) example • Supposewehavedataabout all studentsfrom • Italianschools • German schools, exceptoftheprimaryschool‘‘Andreas Hofer“ • Ladin schools, exceptofthe high school“Gherdëna“ • Can wecorrectlyanswerquestionsabouttheitalianstudents in South Tyrol? • Yes, becausewehave all dataaboutstudentsfrom italianschools Data Quality

  26. An (intuitive) example (2) • Supposewehavedataabout all studentsfrom • Italianschools • German schools, exceptoftheprimaryschool‘‘Andreas Hofer“ • Ladin schools, exceptofthe high school“Gherdëna“ • Can weanswerquestionsaboutthehigh-schoolstudents in South Tyrol? • No, becausedatafromthe“Gherdëna“ high schoolismissing • Wecouldbugthemtosubmittheirdata (but maybethesecretaryis on holiday) • Wecouldasksomeoneelseforthedata, e.g., thelocaldistrictadministration Data Quality

  27. Ourresearch • How can one describethat data is complete to a certain extent? • How can one find out whether the data one has is sufficient for a certain use? • How can one find out which data is necessary to serve a certain use? Data Quality

  28. Formal example How many students attend an Italian school? SELECT count(*) FROM student, school WHERE student.school = school.name AND school.language = ‘italian‘; Suppose, we have all Italian students. Can we answer this query completely? Data Quality

  29. Howcanweformalizetablecompleteness? “Wehave all studentsfromitalianschools“ • Weimagine: an idealdatabasethatcontainscompleteinformationabouttheworld • Completenessstatementsrefertothisideal database: => All ideal studentsfromItalianschools occuramongourreal students Data Quality

  30. Howcanweassert(partial) completenessoftables? “Wehave all studentsfromitalianschools“ Table completenessassertion: real.studentCONTAINS( SELECT ideal.student.* FROM ideal.student, ideal.school WHERE ideal.student.school = ideal.school.name AND ideal.school.language = ‘Italian‘) Table completenessassertionsconstitute a logicaltheoryaboutrealandidealdatabase Data Quality

  31. Whatdoesitmeanthatourqueryiscomplete? Considertwoversions: “Real query““Ideal query“ SELECT count(*) SELECT count(*) FROM real.student, real.school FROM ideal.student, ideal.school WHERE WHERE real.student.school = real.school.nameideal.student.school = ideal.school.name AND AND real.school.language = ‘italian‘; ideal.school.language = ‘italian‘; Ourqueryiscompleteiftherealandtheidealqueryreturnthe same results (Can beexpressed in logic, too) => Reasoning Data Quality

  32. Our results so far • Formalization • General reasoning procedures for • Single block SQL queries • With comparisons • Group By • Aggregate functions min, max, count, sum • Complexity analysis (sometimes high!) • Architecture for reasoning system • “Inverse reasoning” (see later slide) This is a start, many things are still missing Data Quality

  33. Reasoning with schema information • To draw interesting inferences, we need to take into account • Keys • Foreign keys • Finite domains ~> Reasoning becomes more complicated (Current research) Data Quality

  34. Inverse reasoning • So far: Given: Assertions about table completeness Question: Can query Q be answered completely? • Also interesting: Given: query Q Question: which are the minimal completeness assertions that assure completeness of Q? • Can be answered by applying our inference methods backwards Data Quality

  35. Perspective: Probabilisticcompletenessmanagement • Our theory so far: Boolean statements: complete/not complete • In practice, it is often sufficient to know “With probability < p, we make an error < ε“ • Probabilistic assertions: “With 90% probability, we are not missing more than 5 students“ => Probabilistic inferences Data Quality

  36. Conclusion • Data quality has several dimensions • Correctness, timeliness, completeness • Our current interest • How can one describe which data are complete • How can one find out which queries can be answered completely? • If not, which additional data is needed? • Perspective: Probabilistic completeness management Data Quality

More Related