200 likes | 711 Views
Data Quality Issues: ... by Example. Wolfgang Lehner Dresden University of Technology Faculty of Computer Science Database Technology Group (SyA). Dagstuhl Seminar on Data Quality. … my personal background. Statistical and Scientific Databases
E N D
Data Quality Issues:... by Example • Wolfgang Lehner • Dresden University of Technology • Faculty of Computer Science • Database Technology Group (SyA) • Dagstuhl Seminar on Data Quality
… my personal background • Statistical and Scientific Databases • very long tradition (1st Berkeley Workshop in 1981) • modeling perspective • come up with statistical data models • capture semantics of micro and macro data • processing perspective • "workflow": collection, preprocessing, analysis • provide database technology for efficient data analysis • a small subset of SSDB-techniques are used by • Data-Warehouse-Systems …
Example 1: Processing Perspective of DQ Non-Food Tracking Retail panelbased marketing information for manufacturers and retailers in consumer technology industries GfK Group Monitoring the global markets for consumer goods • Which arethetop ranking brands of mobile phones at present? • How large is the market share of digital television sets? • How much do consumer spent on computer in average? periodical monitoring
local terms per shop Clients global reporting terms Retailers Working Areas Data - IN Data - Preparation MDM IDAS Data Warehouse(Extrapolation, Reports) DWH Creating value through knowledge
RawTrackingData DataRelease DataOrder Data Orders Reporting projects Base projects Identified data DWH QC Korrdat DWH QC project Target Actual DWH Reporting „Good“ data Data & Control Flow from 10.000 feet DWH2Tools GIMWinCosSeparation WebTAS MDMitems & shops DHWSuite DWH Explorer DWH Extrapolation /Projection DWH Builder IDAS InfoSystem Preprocessing Fact Tool IDASOutput Pool DWHLoader IDAS2DWH Dataflow IDAS2MF Controlflow
Issues for Data Quality Discussion • Observation • subsequent production steps depend on the success and DQ of the result set produced by preceding steps(simplest form: count(*) > x, in most cases expert knowledge) • data context (more technical: primary key) changes(very hard to trace outliers at the end back to the incoming data object) • DQ determines production process, e.g. TODO-lists of >100 workers • DQ is key factor for Production Optimization • impact on demand-driven production • Example: need report of cell phone sales by end of next week • report quality depends on type of customer: premium customer higher data quality identify important data providers, prioritize single jobs/data orders/… propagate individual deadline to participating working steps • impact on error correction • more raw data material: manual article identification (extremely expensive) • better raw data material: manually "correct" data, re-order tracking data data lineage is a critical issues(fine granular data quality causes data explosion!!!)
Example 2: Integration Perspective of DQ Perform Analysis across different Data Sources • What a similar sub-sequences of amino acid residues? • What are stable/typical conformations? Current Situation • independent (mostly non-relational) data sources • no integrity constraints (within/across) different data sources • DQ is key factor for integrated data access • problems beyond "regular" integration issues data sets are growing…
protein code 1AL3, atoms 1478 to 1492 conflicting 3D coordinates for protein atoms (protein code 1AL3, atoms 1478 to 1492) Example: Protein Structure the happy day scenario: 1:N Protein Code Atom Positions the real world scenario:
Issues for Data Quality Discussion • Observation • in theory: "nice" entities and relationshipsin practice: many exceptions due to the experimental nature of the data collection process • have to relax the schema constraints bad "data schema quality" • Data Schema Quality • impact on data analysisstatistical analysis process requires constraints as a guideline of data exploration (e.g. dimensional structures in OLAP) • need outlier management at the conceptual level schema exceptions
Summary and Conclusion • Data Quality • key issue in statistically analyzing huge data sets • data analysis means:complex – often DQ driven - process of transforming micro into macro data • Solution • a) no size fits it all ... • b) need a general framework considering multiple aspects • statistical metadata • data lineage (most critical from my perspective) • users with expert knowledge (voting concept?)