Managing Uncertain Data

Managing Uncertain Data Anish Das Sarma Stanford University Anish Das Sarma

What is Uncertain Data? Anish Das Sarma

Why Does It Arise? Precision of devices Lack of information Uncertainty about the future Anonymization Anish Das Sarma

Applications: Information Extraction Anish Das Sarma

Applications: Information Integration name, hPhone, oPhone, hAddr, oAddr name, phone, address Combined View Anish Das Sarma

Applications: Deduplication ? 80% match Anish Das Sarma

Applications: Scientific & Medical Experiments Probably not cancer Anish Das Sarma

How Do Database Management Systems (DBMS) Handle Uncertainty? They don’t  Anish Das Sarma

What Do (Most) Applications Do? • Clean: turn into data that DBMSs can handle • Loss of information • Errors compound insidiously Anish Das Sarma

Outline of The Talk • Part 1: Managing Uncertainty in a DBMS theory  systems • Part 2: Handling Uncertainty in Data Integration systems  theory • Other Research (trailer) • Future Plans Anish Das Sarma

Part 1: Managing Uncertain Data • Primarily in the context of the Trio project • Data • Uncertainty • Lineage • Today’s focus: how lineage helps Anish Das Sarma

Uncertain Data • Anuncertain database represents a set of possible instances (or, possible worlds) • Our work: finite sets of possible instances Anish Das Sarma

Representing Uncertain Data • 20+ years of work (mostly theoretical) • Appears to be fundamental trade-off between expressiveness & intuitiveness • We spent some time exploring the space of models for uncertainty Anish Das Sarma

Hierarchy of Models [ICDE 06] + Expressive - Complex • Next • Consider a model M • Isolate inexpressiveness • Solve problem with lineage + Intuitive - Inexpressive Anish Das Sarma

Running Example: Crime-Solver • Saw (witness, color, car) // may be uncertain • Drives (person, color, car) // may be uncertain • Suspects (person) = πperson(Saw ⋈ Drives) Anish Das Sarma

Simple Model M 1. Alternatives:uncertainty about value 2. ‘?’ (Maybe) Annotations Three possible instances Anish Das Sarma

Simple Model M 1. Alternatives 2.‘?’ (Maybe): uncertainty about presence ? Six possible instances Anish Das Sarma

Review: Relational Queries D S Q πperson(σcolor=red) Anish Das Sarma

Queries on Uncertain Data D D′ Closure: up-arrow always exists direct implementation possible instances rep. of instances Q on each instance I1, I2, …, In J1, J2, …, Jm Completeness: All sets of possible instances can be represented Anish Das Sarma

Model M is Not Closed Suspects= πperson(Saw ⋈ Drives) CANNOT Does not correctly capture possible instances in the result ? ? ? Anish Das Sarma

to the Rescue Lineage Model M + Lineage = Completeness Anish Das Sarma

Example with Lineage Suspects= πperson(Saw ⋈ Drives) ? ? ? Anish Das Sarma

Example with Lineage Correctly captures possible instances in the result Suspects= πperson(Saw ⋈ Drives) λ(31) = (11,2) Λ (21,2) ? λ(32,1) = (11,1) Λ (22,1); λ(32,2) = (11,1) Λ (22,2) ? λ(33) = (11,1) Λ 23 ?

Trio’s Data Model Uncertainty-Lineage Databases (ULDBs) • Alternatives • ‘?’ (Maybe) Annotations • Confidence values (next) • Lineage Theorem: ULDBs are closed and complete [VLDB 06] Formally studied properties like minimization, equivalence, approximation and membership. [VLDB 06, VLDB J. 08] Anish Das Sarma

Confidence Values in Trio • Confidence values supplied with base data • Default probabilistic interpretation • Problem: Compute confidence values on result data [ICDE 08] • 5-minute DBClip • Search “confidence computation” on YouTube. Anish Das Sarma

Problem Description Cars= πcar(Saw ⋈ Drives) : ? : ? Anish Das Sarma

Operator-by-Operator Saw Drives : 0.5*0.9 : 0.45 : 0.4 ⋈ : 0.6 Wrong!! πcar : 0.67 0.45 + 0.4 - (0.45*0.4) Anish Das Sarma

Operator-by-Operator Not independent! : 0.45 : 0.4 : 0.6 0.45 + 0.4 - (0.45*0.4) Anish Das Sarma

Database Query Processing 101 Execution Plans Query Pick and execute best plan Q Statistics, indexes Anish Das Sarma

Operator-by-Operator Confidence Computation Plans Query Can be much smaller or empty Q Anish Das Sarma

Decouple Data and Confidence Computation Plans • Compute data • Use lineage to compute confidences (on demand) Query Q Theorem: Arbitrary improvement. [ICDE 08] Anish Das Sarma

Our Approach Correct!! 0.5 * (0.9 + 0.8 - 0.9*0.8) λ(41) = 11 Λ (21 V 22) : ? : 0.49 λ(42) = 12 Λ 23 : 0.6 : ? Anish Das Sarma

Algorithm 0.9 0.4 1.0 t5 t6 t7 0.7 1. Expand lineage to base data t4 2. Get confidence of base data 0.4 3. Evaluate the probability λ(t) Detecting independence t1 t2 Memoization Batch computation R λ(t) = f(t4,t5,t6,t7) t 0.823 Anish Das Sarma

Some Other Trio Work • Modifications and Versioning [TR 08] • Stored derived relations • Modifications  versions • Indexes and Statistics [MUD 08] • Specialized indexes, histograms • Functional Dependencies & Schema Design [TR 07] • Definitions, sound and complete axiomatization of FDs • Lossless decomposition • FD testing, finding, and inference Anish Das Sarma

Related Work (sample) • Modeling Uncertainty: Plenty, covered in textbooks • Systems: Avatar, BayesStore, MayBMS, MYSTIQ, ORION, PrDB, ProbView, Trio, others? Anish Das Sarma

Part 2: Data Integration • Reboot! or, wake up! Anish Das Sarma

Mediated Schema D5 D1 D2 D4 D3 Traditional Data Integration: Setup Who authored the most SIGMOD papers in the 90’s? MappingSELECT P.title AS title, A.name AS author, NULL AS conf, P.year AS year, FROM Author AS A, Paper AS P, AuthoredBy AS B WHERE A.aid=B.aid AND P.pid=B.pid 1. Mediated Schema Publication(title, author, conf, year) 2. Schema Mappings 3. Query Answering Significant up-front effort Mike Carey Author(aid, name) Paper(pid, title, year) AuthoredBy(aid,pid) Bib(title, authors, conf, year)

“Pay-As-You-Go” Data Integration • Automated best-effort integration from the outset • Further improve the system over time with feedback How advanced a starting point can we provide? Anish Das Sarma

to the Rescue Uncertainty • Automatic integration • Make guesses • Model probabilities • Specifically • Probabilistic schema mappings • Probabilistic mediated-schema >90% accuracy in automatically integrating 50-800 data sources for several domains [SIGMOD 08] Anish Das Sarma

Next • Probabilistic mediated schemas • Probabilistic schema mappings • Experimental results Anish Das Sarma

Mediated Schema {name, person-name} {email} {phone-num, phone} {address, mailing-addr} Med-S (name, email, phone, addr) S1(name, email, phone-num, address) S2(person-name,phone,mailing-addr) • A mediated schema is a clustering of a subset of the set of all attributes appearing in source schemas. Anish Das Sarma

Example Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr}) ? S1(name, hPhone, oPhone, hAddr, oAddr) S2(name,phone,address) Q: SELECT name, hPhone, oPhone FROM Med

Example Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr}) Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr}) S1(name, hPhone, oPhone, hAddr, oAddr) S2(name,phone,address) Q: SELECT name, phone, address FROM Med

Example Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr}) Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr}) Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr}) S1(name, hPhone, oPhone, hAddr, oAddr) S2(name,phone,address) Q: SELECT name, phone, address FROM Med

Example Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr}) Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr}) Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr}) Med4 ({name}, {phone, oPhone}, {hPhone}, {address, oAddr}, {hAddr}) S1(name, hPhone, oPhone, hAddr, oAddr) S2(name,phone,address) Q: SELECT name, phone, address FROM Med

Example Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr}) Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr}) Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr}) Med4 ({name}, {phone, oPhone}, {hPhone}, {address, oAddr}, {hAddr}) Med5 ({name}, {phone}, {hPhone}, {oPhone}, {address}, {hAddr}, {oAddr}) S1(name, hPhone, oPhone, hAddr, oAddr) S2(name,phone,address) Q: SELECT name, phone, address FROM Med

Probabilistic Mediated Schema Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr}) Pr=0.5 Med4 ({name}, {phone, oPhone}, {hPhone}, {address, oAddr}, {hAddr}) Pr=0.5 S1(name, hPhone, oPhone, hAddr, oAddr) S2(name,phone,address) • Probabilistic Mediated Schema (p-med-schema) is a set M = {(M1,Pr(M1)), …, (Mk,Pr(Mk))} where • Mi is a med-schema; i≠j => Mi≠ Mj • Pr(Mi)ϵ(0,1]; ΣPr(Mi) = 1 Anish Das Sarma

P-Mappings PM1 Med3 (name, hPP, oP, hAA, oA) S1(name, hP, oP, hA, oA) Pr=.64 Med3 (name, hPP, oP, hAA, oA) S1(name, hP, oP, hA, oA) Pr=.16 Med3 (name, hPP, oP, hAA, oA) S1(name, hP, oP, hA, oA) Pr=.16 Med3 (name, hPP, oP, hAA, oA) S1(name, hP, oP, hA, oA) Pr=.04 PM2 Med4 (name, oPP, hP, oAA, hA) S1(name, hP, oP, hA, oA) Pr=.64 Med4 (name, oPP, hP, oAA, hA) S1(name, hP, oP, hA, oA) Pr=.16 Med4 (name, oPP, hP, oAA, hA) S1(name, hP, oP, hA, oA) Pr=.16 Med4 (name, oPP, hP, oAA, hA) S1(name, hP, oP, hA, oA) Pr=.04 Anish Das Sarma

Expressive Power of P-Med-Schema & P-Mapping • Theorem 1. For one-to-many mappings: • (p-med-schema + p-mappings) = (mediated schema + p-mapping) > (p-med-schema + mappings) • Theorem 2. When restricted to one-to-one mappings: • (p-med-schema + p-mappings) • = (p-med-schema + mappings) > (mediated schema + p-mapping) Anish Das Sarma

Managing Uncertain Data

Managing Uncertain Data

Presentation Transcript

Managing Data

Probabilistic/Uncertain Data Management -- III

Uncertain Data Management

Probabilistic/Uncertain Data Management

Managing Data

Representation Formalisms for Uncertain Data

Probabilistic/Uncertain Data Management -- IV

Uncertain

Clustering Uncertain Data Items

Cleaning Uncertain Data with Quality Guarantees

Top- k Queries on Uncertain Data

OLAP Over Uncertain and Imprecise Data

Clustering Uncertain Data

Cleaning Uncertain Data with Quality Guarantees

COMP9315 Uncertain and Probabilistic Data

Managing data

Robust Ranking of Uncertain Data

Managing Data

Managing Data

Probabilistic Reasoning with Uncertain Data

Managing in These Uncertain Times

Probabilistic Reasoning with Uncertain Data