1 / 35

Data Quality on the Web

Data Quality on the Web. Dagstuhl Seminar N o 03362 August 31 – September 5, 2003. Why we are here….

duena
Download Presentation

Data Quality on the Web

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Quality on the Web Dagstuhl Seminar No 03362 August 31 – September 5, 2003

  2. Why we are here… • “… there is a significant gap between perception and reality regarding the quality of data in many organizations, and that current data quality problems cost U.S. businesses more than $600 billion a year.” (TDWI's Data Quality Report, 2002) • “… 39% of companies have no information quality standards, less than 10% of companies have customer databases that can drive a contact strategy,…” (Qci, London, 2002) • “According to experts, data quality issues account for a data warehouse failure of up to 70% and contribute to a 55%-to-70% failure rate for CRM projects. …” (Len Dubois, 2002) • … Data Quality on the Web Dagstuhl, September 2003

  3. How did we get here… Why is data quality such an important and crucial issue in data management and processing? • Hypothesis 1: The quality of data has always been poor, but in isolated settings, people know how to deal with this issue. • Hypothesis 2: The ease of publishing and exchanging data one the Web/Internet has become much easier. Integration of data sources reveal data of poor quality, i.e., data that does not meet user expectations or data/information requirements. Data Quality on the Web Dagstuhl, September 2003

  4. What we can do about it… Data quality is a data management issue. First of all, lets have some precise definitions or at least some kind of ontological commitment. • What does data of poor quality mean? • How can one identify data of poor quality? • Who is responsible for data of poor quality? • How can one deal with data of poor quality? • How can one improve data of poor quality? Data Quality on the Web Dagstuhl, September 2003

  5. Outline • The various meanings of data quality (DQ) • Fundamental questions • Questions in context • Working groups Data Quality on the Web Dagstuhl, September 2003

  6. The Various Meanings of DQ • How are data quality aspects defined ? • What entities are involved in dealing with data quality? Data Quality on the Web Dagstuhl, September 2003

  7. Meanings and Definitions for DQ • “Data quality is an inexact science in terms of assessments and benchmarks.” • “High-quality data is data that is fit for use by data consumers.” There is no general (formal) definition for data quality; data quality aspects (or data quality dimensions) heavily depend on the specific application domain or data management context. Data Quality on the Web Dagstuhl, September 2003

  8. Accuracy The degree of correctness and precision with which the real-world data of interest to an application domain is represented in an information system. • Incorrect data values • Data errors • Rounding errors, measurements,… Data Quality on the Web Dagstuhl, September 2003

  9. Completeness The degree to which the data relevant for an application domain has been recorded in an information system. • Missing records • Missing attribute values • Missing schema information • Missing metadata • … Data Quality on the Web Dagstuhl, September 2003

  10. Timeliness The degree to which the data recorded in an information system is up-to-date. • What does up-to-date mean; is time relative? • What concept of time is of relevance here? Valid time, transaction time,… • Is outdated data of poor quality? Data Quality on the Web Dagstuhl, September 2003

  11. Consistency The degree to which the data recorded in an information system satisfies certain constraints. • Not all data resides in a database; but if, then integrity constraints are often not used or known • Consistency of data depends on how good integrity constraints are • Consistency is extremely hard to achieve in the context of Web data and data integration Data Quality on the Web Dagstuhl, September 2003

  12. Other Data Quality Aspects (taken from Strong, Lee, Wang: Data Quality in Context, CACM 45(4), 2002) Data Quality on the Web Dagstuhl, September 2003

  13. Other Data Quality Aspects (2) • Wang, Strong: Beyond Accuracy: What Data Quality means to Data Consumers, J. of Management Inf. Systems, 1996. 179 data quality attributes (or DQ dimensions) have been collected from data consumers: Accuracy, completeness, believability, interpretability, documentation, pedigree, availability, expense, verifiable, definability, age, auditable, ease of exchange, origin, integrity, portability, redundancy, security, correctness, conciseness, … Convenience, friendliness, usable, ease of retrieval, partitionability, ergonomic, meets requirements, creativity, interesting, … Data Quality on the Web Dagstuhl, September 2003

  14. Entities involved in DQ Aspects • Data producers • Data custodians (provide and manage resources for processing and storing data) • Data consumers • A single entity can have more than one role • Precise role depends on specific setting and application domain • Entities interact with each other through (often complex) workflows/data-flows. Data Quality on the Web Dagstuhl, September 2003

  15. Fundamental Questions

  16. Fundamental Questions • Questions serve the study, development, and specification of models and definitions for data quality dimensions • Questions suggest starting points for working groups • Questions need to be addressed in specific scenarios and settings (data producer, custodian, consumer) to obtain meaningful answers. Data Quality on the Web Dagstuhl, September 2003

  17. DQ Assessment • How does one determine DQ requirements? • Where and how is data quality assessed and measured? • What are the specifics of settings where data is measurable? • DQ dimensions can have different importance for data producers, custodians, and customers. • In particular, DQ dimensions can have varying importance among different data consumers and applications. Data Quality on the Web Dagstuhl, September 2003

  18. DQ as Metadata • Ideal data management scenario (1) DQ dimensions are well-defined (2) Data come with metadata that describe the quality of the data according to DQ dimensions • What is required to capture, manage, and utilize DQ metadata within workflows/data-flows? • Can this be combined with data lineage aspects? • How to ensure that metadata is of good quality? Data Quality on the Web Dagstuhl, September 2003

  19. DQ Life-cycle • Data has a life-cycle, so do DQ aspects; for example, data can become outdated or data can become less accurate due to transformations • DQ life-cycle should be studied in the context of data producer, custodian, and consumer • If life-cycle has been determined, DQ assessment and recording of DQ metadata can be automated (?) • DQ life-cycle can be used to communicate DQ requirements among entities involved in data management Data Quality on the Web Dagstuhl, September 2003

  20. DQ-based Data Usage • Assuming that DQ metadata is available, how can DQ measurements be used by data custodians and/or data customers? • DQ dimensions are typically used in the context of data integration (source selection, query planning and optimization, …) • What about query models in which DQ dimensions are explicit to the user and where users can modify DQ measurements? Data Quality on the Web Dagstuhl, September 2003

  21. DQ Improvement • Data cleansing is a big business; mainly concerned with cleansing data close to the consumer’s site (customer DB, sales data warehouse, …) • Depending on the degree of autonomy, DQ improvements should occur at all entities involved in DQ management • Requires appropriate feedback mechanisms and well-understood workflow/data-flow • Improvement should occur close to data producer site; feasible? Data Quality on the Web Dagstuhl, September 2003

  22. DQ and Trust • Trust in data/information primarily of concern at data consumer site • Is trust in data/information just an aggregated measure for several DQ dimensions or is trust (or trustworthy data) a DQ dimension on its own? Data Quality on the Web Dagstuhl, September 2003

  23. Other Fundamental Questions? • … Data Quality on the Web Dagstuhl, September 2003

  24. Questions in Context

  25. Questions in Context • (Claim) Previously stated questions can only be answered in a meaningful way in the context of specific data management and application scenarios. • There are many data management frameworks where DQ is of concern. • Here we outline a few of such frameworks… Data Quality on the Web Dagstuhl, September 2003

  26. Scientific Databases • Probably the largest collections of data from experiments and observations today. • Most valuable asset in computational sciences (Biology, Physics, Chemistry, Neuroscience, Cosmology, …) • Issues: • Data collection and pre-processing tasks • Precise requirements regarding DQ aspects • Data provenance Data Quality on the Web Dagstuhl, September 2003

  27. Data Integration • Probably the largest body of work that addresses data quality issues • Ranging from integration of (heterogeneous) databases to integration of data from the Web • Plays a role in almost all other contexts • Issues: • How to assess and measure DQ aspects ? • Utilization of DQ in query processing () • DQ metadata, DQ improvement ?? Data Quality on the Web Dagstuhl, September 2003

  28. E-Commerce (and other E-services) • Biggest business on the Internet • Customer oriented; DQ requirements at data customer side are well-defined (?) • Main focus is cleansing of customer databases • Issues: • How is data of good quality ensured ? • Many E-services require data integration • Feedback mechanisms; DQ aspects in data flow • … Data Quality on the Web Dagstuhl, September 2003

  29. Streaming Data • Hot topic in database research • Data arrives and needs to be processed continuously • App domains: monitoring and sensor networks • Issues: • Correctness, precision etc. of query results • Online aggregation • Properties of data collection entities (e.g., sensors) • … Data Quality on the Web Dagstuhl, September 2003

  30. Web Data • Data collected by a Web-crawler; not necessarily extracted from Web-accessible databases • Queries are submitted to search engine • Issues: • What are DQ aspects of interest ? • Are current quality measures sufficient (e.g., PageRank) ? • Domain specific quality assessment techniques • … Data Quality on the Web Dagstuhl, September 2003

  31. Data Warehouses • Data from multiple sources, aggregated over time • Despite data cleansing techniques employed during loading data into DW, DQ is still a major problem • DQ is crucial because of tools on top of DW; data is mission critical and most valuable asset • Issues: • Is poor DQ in DW a data integration problem ? • Data cleansing is a reactive technique; what are proactive techniques ? Is the autonomy of OLTP systems the problem ? • … Data Quality on the Web Dagstuhl, September 2003

  32. Data Mining • Data exploration and knowledge discovery are standard applications on top of large DBS & DWs • (1) Tool for the discovery of interesting (relevant to business) patterns in data • (2) Tool for the analysis of DQ dimensions • Issues • Poor quality data in and poor quality rules/patterns out • DQ assessment and measurements ? • … Data Quality on the Web Dagstuhl, September 2003

  33. Working Groups

  34. Working Group Scheme • 4-5 working groups • Work together on specific topic(s) for a whole day; meeting rooms will be announced • 30-40 minute presentation (layout will be provided) Data Quality on the Web Dagstuhl, September 2003

  35. Work-Plan • Identify specific application domain(s) and setting(s) where DQ is of interest • Identify data usage scenarios • Identify entities involved in data management and interaction among these entities • Detail DQ requirements • Develop (formal) models, techniques, mechanisms that address some DQ dimensions • Outline remaining open (hard) problems Data Quality on the Web Dagstuhl, September 2003

More Related