350 likes | 455 Views
Data Quality on the Web. Dagstuhl Seminar N o 03362 August 31 – September 5, 2003. Why we are here….
E N D
Data Quality on the Web Dagstuhl Seminar No 03362 August 31 – September 5, 2003
Why we are here… • “… there is a significant gap between perception and reality regarding the quality of data in many organizations, and that current data quality problems cost U.S. businesses more than $600 billion a year.” (TDWI's Data Quality Report, 2002) • “… 39% of companies have no information quality standards, less than 10% of companies have customer databases that can drive a contact strategy,…” (Qci, London, 2002) • “According to experts, data quality issues account for a data warehouse failure of up to 70% and contribute to a 55%-to-70% failure rate for CRM projects. …” (Len Dubois, 2002) • … Data Quality on the Web Dagstuhl, September 2003
How did we get here… Why is data quality such an important and crucial issue in data management and processing? • Hypothesis 1: The quality of data has always been poor, but in isolated settings, people know how to deal with this issue. • Hypothesis 2: The ease of publishing and exchanging data one the Web/Internet has become much easier. Integration of data sources reveal data of poor quality, i.e., data that does not meet user expectations or data/information requirements. Data Quality on the Web Dagstuhl, September 2003
What we can do about it… Data quality is a data management issue. First of all, lets have some precise definitions or at least some kind of ontological commitment. • What does data of poor quality mean? • How can one identify data of poor quality? • Who is responsible for data of poor quality? • How can one deal with data of poor quality? • How can one improve data of poor quality? Data Quality on the Web Dagstuhl, September 2003
Outline • The various meanings of data quality (DQ) • Fundamental questions • Questions in context • Working groups Data Quality on the Web Dagstuhl, September 2003
The Various Meanings of DQ • How are data quality aspects defined ? • What entities are involved in dealing with data quality? Data Quality on the Web Dagstuhl, September 2003
Meanings and Definitions for DQ • “Data quality is an inexact science in terms of assessments and benchmarks.” • “High-quality data is data that is fit for use by data consumers.” There is no general (formal) definition for data quality; data quality aspects (or data quality dimensions) heavily depend on the specific application domain or data management context. Data Quality on the Web Dagstuhl, September 2003
Accuracy The degree of correctness and precision with which the real-world data of interest to an application domain is represented in an information system. • Incorrect data values • Data errors • Rounding errors, measurements,… Data Quality on the Web Dagstuhl, September 2003
Completeness The degree to which the data relevant for an application domain has been recorded in an information system. • Missing records • Missing attribute values • Missing schema information • Missing metadata • … Data Quality on the Web Dagstuhl, September 2003
Timeliness The degree to which the data recorded in an information system is up-to-date. • What does up-to-date mean; is time relative? • What concept of time is of relevance here? Valid time, transaction time,… • Is outdated data of poor quality? Data Quality on the Web Dagstuhl, September 2003
Consistency The degree to which the data recorded in an information system satisfies certain constraints. • Not all data resides in a database; but if, then integrity constraints are often not used or known • Consistency of data depends on how good integrity constraints are • Consistency is extremely hard to achieve in the context of Web data and data integration Data Quality on the Web Dagstuhl, September 2003
Other Data Quality Aspects (taken from Strong, Lee, Wang: Data Quality in Context, CACM 45(4), 2002) Data Quality on the Web Dagstuhl, September 2003
Other Data Quality Aspects (2) • Wang, Strong: Beyond Accuracy: What Data Quality means to Data Consumers, J. of Management Inf. Systems, 1996. 179 data quality attributes (or DQ dimensions) have been collected from data consumers: Accuracy, completeness, believability, interpretability, documentation, pedigree, availability, expense, verifiable, definability, age, auditable, ease of exchange, origin, integrity, portability, redundancy, security, correctness, conciseness, … Convenience, friendliness, usable, ease of retrieval, partitionability, ergonomic, meets requirements, creativity, interesting, … Data Quality on the Web Dagstuhl, September 2003
Entities involved in DQ Aspects • Data producers • Data custodians (provide and manage resources for processing and storing data) • Data consumers • A single entity can have more than one role • Precise role depends on specific setting and application domain • Entities interact with each other through (often complex) workflows/data-flows. Data Quality on the Web Dagstuhl, September 2003
Fundamental Questions • Questions serve the study, development, and specification of models and definitions for data quality dimensions • Questions suggest starting points for working groups • Questions need to be addressed in specific scenarios and settings (data producer, custodian, consumer) to obtain meaningful answers. Data Quality on the Web Dagstuhl, September 2003
DQ Assessment • How does one determine DQ requirements? • Where and how is data quality assessed and measured? • What are the specifics of settings where data is measurable? • DQ dimensions can have different importance for data producers, custodians, and customers. • In particular, DQ dimensions can have varying importance among different data consumers and applications. Data Quality on the Web Dagstuhl, September 2003
DQ as Metadata • Ideal data management scenario (1) DQ dimensions are well-defined (2) Data come with metadata that describe the quality of the data according to DQ dimensions • What is required to capture, manage, and utilize DQ metadata within workflows/data-flows? • Can this be combined with data lineage aspects? • How to ensure that metadata is of good quality? Data Quality on the Web Dagstuhl, September 2003
DQ Life-cycle • Data has a life-cycle, so do DQ aspects; for example, data can become outdated or data can become less accurate due to transformations • DQ life-cycle should be studied in the context of data producer, custodian, and consumer • If life-cycle has been determined, DQ assessment and recording of DQ metadata can be automated (?) • DQ life-cycle can be used to communicate DQ requirements among entities involved in data management Data Quality on the Web Dagstuhl, September 2003
DQ-based Data Usage • Assuming that DQ metadata is available, how can DQ measurements be used by data custodians and/or data customers? • DQ dimensions are typically used in the context of data integration (source selection, query planning and optimization, …) • What about query models in which DQ dimensions are explicit to the user and where users can modify DQ measurements? Data Quality on the Web Dagstuhl, September 2003
DQ Improvement • Data cleansing is a big business; mainly concerned with cleansing data close to the consumer’s site (customer DB, sales data warehouse, …) • Depending on the degree of autonomy, DQ improvements should occur at all entities involved in DQ management • Requires appropriate feedback mechanisms and well-understood workflow/data-flow • Improvement should occur close to data producer site; feasible? Data Quality on the Web Dagstuhl, September 2003
DQ and Trust • Trust in data/information primarily of concern at data consumer site • Is trust in data/information just an aggregated measure for several DQ dimensions or is trust (or trustworthy data) a DQ dimension on its own? Data Quality on the Web Dagstuhl, September 2003
Other Fundamental Questions? • … Data Quality on the Web Dagstuhl, September 2003
Questions in Context • (Claim) Previously stated questions can only be answered in a meaningful way in the context of specific data management and application scenarios. • There are many data management frameworks where DQ is of concern. • Here we outline a few of such frameworks… Data Quality on the Web Dagstuhl, September 2003
Scientific Databases • Probably the largest collections of data from experiments and observations today. • Most valuable asset in computational sciences (Biology, Physics, Chemistry, Neuroscience, Cosmology, …) • Issues: • Data collection and pre-processing tasks • Precise requirements regarding DQ aspects • Data provenance Data Quality on the Web Dagstuhl, September 2003
Data Integration • Probably the largest body of work that addresses data quality issues • Ranging from integration of (heterogeneous) databases to integration of data from the Web • Plays a role in almost all other contexts • Issues: • How to assess and measure DQ aspects ? • Utilization of DQ in query processing () • DQ metadata, DQ improvement ?? Data Quality on the Web Dagstuhl, September 2003
E-Commerce (and other E-services) • Biggest business on the Internet • Customer oriented; DQ requirements at data customer side are well-defined (?) • Main focus is cleansing of customer databases • Issues: • How is data of good quality ensured ? • Many E-services require data integration • Feedback mechanisms; DQ aspects in data flow • … Data Quality on the Web Dagstuhl, September 2003
Streaming Data • Hot topic in database research • Data arrives and needs to be processed continuously • App domains: monitoring and sensor networks • Issues: • Correctness, precision etc. of query results • Online aggregation • Properties of data collection entities (e.g., sensors) • … Data Quality on the Web Dagstuhl, September 2003
Web Data • Data collected by a Web-crawler; not necessarily extracted from Web-accessible databases • Queries are submitted to search engine • Issues: • What are DQ aspects of interest ? • Are current quality measures sufficient (e.g., PageRank) ? • Domain specific quality assessment techniques • … Data Quality on the Web Dagstuhl, September 2003
Data Warehouses • Data from multiple sources, aggregated over time • Despite data cleansing techniques employed during loading data into DW, DQ is still a major problem • DQ is crucial because of tools on top of DW; data is mission critical and most valuable asset • Issues: • Is poor DQ in DW a data integration problem ? • Data cleansing is a reactive technique; what are proactive techniques ? Is the autonomy of OLTP systems the problem ? • … Data Quality on the Web Dagstuhl, September 2003
Data Mining • Data exploration and knowledge discovery are standard applications on top of large DBS & DWs • (1) Tool for the discovery of interesting (relevant to business) patterns in data • (2) Tool for the analysis of DQ dimensions • Issues • Poor quality data in and poor quality rules/patterns out • DQ assessment and measurements ? • … Data Quality on the Web Dagstuhl, September 2003
Working Group Scheme • 4-5 working groups • Work together on specific topic(s) for a whole day; meeting rooms will be announced • 30-40 minute presentation (layout will be provided) Data Quality on the Web Dagstuhl, September 2003
Work-Plan • Identify specific application domain(s) and setting(s) where DQ is of interest • Identify data usage scenarios • Identify entities involved in data management and interaction among these entities • Detail DQ requirements • Develop (formal) models, techniques, mechanisms that address some DQ dimensions • Outline remaining open (hard) problems Data Quality on the Web Dagstuhl, September 2003