Problem Statement and Objectives

Kai-Uwe Sattler, Michael Gertz, Vipul Kashyap, Cai Ziegler, Cinzia Cappiello, Susanne BollDagstuhl Seminar “Data Quality on the Web”

Problem Statement and Objectives What is the relationship between trust and data quality? • What is the meaning of trust in the context of data quality? • What are the dimensions of the data quality in different settings? • How do we characterize these dimensions ? • Can we determine any metrics for assessment? How can data quality measurements establish trust?

Data, Fact, Belief, and Trust Data verifiable Non-verifiable Belief Fact non-evidence based evidence based Belief Belief experience atomic indirect atomic indirect reputation Trust Trust

Notions we established in this work • If you can verify -> Fact • If you can not verify -> Belief • Webster: a state or habit of mind in which trust or confidence is placed in some person or thing • Two different views of belief • Evidence-based belief • Non evidence based belief • Reputation is the memory and summary of behavior from past transactions • Trust is a subjective expectation an agent has about another‘s future behavior • Two different variants • Atomic trust • Indirect trust • Reputation and Trust are built over time (feedback)

Working Model • Consumers (Query); Providers (M), • Distinguish Trusted sources M1, non trusted sources M2 • Query result r1 from M1, Query result r2 from M2 • Question: What relationship between r1 and r2 can be used to estimate the quality of the result?

Data Management Settings 2 3 unstructured / semistructured data Web data Doc coll. Inf. retrieval Multimedia 1 4 structured data Traditional databases Inf. retrieval Databases exact queries imprecise queries

The three DQ dimensions for Setting 1 Completeness: “degree to which the expected values are included in a data collection” In the presence of trusted data sources In the absence of trusted data sources

The three DQ dimensions for Setting 1 • Timeliness: relationship between the validity of the data item and the time referred to in the query • Timeliness is a verifiable notion, but there is no distinction between trusted & untrusted sources if we can make an assumption that validity intervals are trustworthy

The three DQ dimensions for Setting 1 • Correctness: similar to the completeness ratio between the correct values and the total values, and we make a distinction between trusted & untrusted sources In the presence of trusted data sources In the absence of trusted data sources

Setting 2 – exact queries / unstructured data • Completeness • The number of sources is big • It is harder to establish a notion of completenes than it is in Setting 1 • Timeliness • Same as Setting 1 • Under the assumption that validity intervals are explicit • Correctness • Same as Setting 1 (close to DB Scenario)

Setting 3 – imprecise queries / unstructured data Quality of metadata has major impact on completeness and correctness • Completeness • Same as above • Timeliness • Same as above • Correctness The difference here is the ranking

Setting 4 – imprecise queries / structured data • Completeness • Timeliness • Correctness

So … • DQ is a composite of different DQ dimensions • For the DQ dimensions, there are measurements for different settings • DQ – TRUST • DQ values need to be fed into the trust values • Trust values need to be fed back into DQ values

Open issues • Scalability • Metadata quality • Further dimensions • How to feedback • Models for combining quality values

Problem Statement and Objectives

Problem Statement and Objectives

Presentation Transcript

Problem Statement

Research problem and Problem statement

Problem Statement

PROBLEM STATEMENT:

Problem Statement

PROBLEM STATEMENT

Problem Statement

Problem Statement

Problem Statement

Problem Statement

Problem Statement

Problem Statement

Problem statement

PROBLEM STATEMENT

Problem Statement

Problem statement

Problem Statement

Problem Statement

Problem Statement

Problem Statement

Problem statement

Problem Statement