690 likes | 830 Views
Managing Uncertain Data. Anish Das Sarma Stanford University. What is Uncertain Data?. Why Does It Arise?. Precision of devices. Lack of information. Uncertainty about the future. Anonymization. Applications: Information Extraction. Applications: Information Integration. name, hPhone,
E N D
Managing Uncertain Data Anish Das Sarma Stanford University Anish Das Sarma
What is Uncertain Data? Anish Das Sarma
Why Does It Arise? Precision of devices Lack of information Uncertainty about the future Anonymization Anish Das Sarma
Applications: Information Extraction Anish Das Sarma
Applications: Information Integration name, hPhone, oPhone, hAddr, oAddr name, phone, address Combined View Anish Das Sarma
Applications: Deduplication ? 80% match Anish Das Sarma
Applications: Scientific & Medical Experiments Probably not cancer Anish Das Sarma
How Do Database Management Systems (DBMS) Handle Uncertainty? They don’t Anish Das Sarma
What Do (Most) Applications Do? • Clean: turn into data that DBMSs can handle • Loss of information • Errors compound insidiously Anish Das Sarma
Outline of The Talk • Part 1: Managing Uncertainty in a DBMS theory systems • Part 2: Handling Uncertainty in Data Integration systems theory • Other Research (trailer) • Future Plans Anish Das Sarma
Part 1: Managing Uncertain Data • Primarily in the context of the Trio project • Data • Uncertainty • Lineage • Today’s focus: how lineage helps Anish Das Sarma
Uncertain Data • Anuncertain database represents a set of possible instances (or, possible worlds) • Our work: finite sets of possible instances Anish Das Sarma
Representing Uncertain Data • 20+ years of work (mostly theoretical) • Appears to be fundamental trade-off between expressiveness & intuitiveness • We spent some time exploring the space of models for uncertainty Anish Das Sarma
Hierarchy of Models [ICDE 06] + Expressive - Complex • Next • Consider a model M • Isolate inexpressiveness • Solve problem with lineage + Intuitive - Inexpressive Anish Das Sarma
Running Example: Crime-Solver • Saw (witness, color, car) // may be uncertain • Drives (person, color, car) // may be uncertain • Suspects (person) = πperson(Saw ⋈ Drives) Anish Das Sarma
Simple Model M 1. Alternatives:uncertainty about value 2. ‘?’ (Maybe) Annotations Three possible instances Anish Das Sarma
Simple Model M 1. Alternatives 2.‘?’ (Maybe): uncertainty about presence ? Six possible instances Anish Das Sarma
Review: Relational Queries D S Q πperson(σcolor=red) Anish Das Sarma
Queries on Uncertain Data D D′ Closure: up-arrow always exists direct implementation possible instances rep. of instances Q on each instance I1, I2, …, In J1, J2, …, Jm Completeness: All sets of possible instances can be represented Anish Das Sarma
Model M is Not Closed Suspects= πperson(Saw ⋈ Drives) CANNOT Does not correctly capture possible instances in the result ? ? ? Anish Das Sarma
to the Rescue Lineage Model M + Lineage = Completeness Anish Das Sarma
Example with Lineage Suspects= πperson(Saw ⋈ Drives) ? ? ? Anish Das Sarma
Example with Lineage Correctly captures possible instances in the result Suspects= πperson(Saw ⋈ Drives) λ(31) = (11,2) Λ (21,2) ? λ(32,1) = (11,1) Λ (22,1); λ(32,2) = (11,1) Λ (22,2) ? λ(33) = (11,1) Λ 23 ?
Trio’s Data Model Uncertainty-Lineage Databases (ULDBs) • Alternatives • ‘?’ (Maybe) Annotations • Confidence values (next) • Lineage Theorem: ULDBs are closed and complete [VLDB 06] Formally studied properties like minimization, equivalence, approximation and membership. [VLDB 06, VLDB J. 08] Anish Das Sarma
Confidence Values in Trio • Confidence values supplied with base data • Default probabilistic interpretation • Problem: Compute confidence values on result data [ICDE 08] • 5-minute DBClip • Search “confidence computation” on YouTube. Anish Das Sarma
Problem Description Cars= πcar(Saw ⋈ Drives) : ? : ? Anish Das Sarma
Operator-by-Operator Saw Drives : 0.5*0.9 : 0.45 : 0.4 ⋈ : 0.6 Wrong!! πcar : 0.67 0.45 + 0.4 - (0.45*0.4) Anish Das Sarma
Operator-by-Operator Not independent! : 0.45 : 0.4 : 0.6 0.45 + 0.4 - (0.45*0.4) Anish Das Sarma
Database Query Processing 101 Execution Plans Query Pick and execute best plan Q Statistics, indexes Anish Das Sarma
Operator-by-Operator Confidence Computation Plans Query Can be much smaller or empty Q Anish Das Sarma
Decouple Data and Confidence Computation Plans • Compute data • Use lineage to compute confidences (on demand) Query Q Theorem: Arbitrary improvement. [ICDE 08] Anish Das Sarma
Our Approach Correct!! 0.5 * (0.9 + 0.8 - 0.9*0.8) λ(41) = 11 Λ (21 V 22) : ? : 0.49 λ(42) = 12 Λ 23 : 0.6 : ? Anish Das Sarma
Algorithm 0.9 0.4 1.0 t5 t6 t7 0.7 1. Expand lineage to base data t4 2. Get confidence of base data 0.4 3. Evaluate the probability λ(t) Detecting independence t1 t2 Memoization Batch computation R λ(t) = f(t4,t5,t6,t7) t 0.823 Anish Das Sarma
Some Other Trio Work • Modifications and Versioning [TR 08] • Stored derived relations • Modifications versions • Indexes and Statistics [MUD 08] • Specialized indexes, histograms • Functional Dependencies & Schema Design [TR 07] • Definitions, sound and complete axiomatization of FDs • Lossless decomposition • FD testing, finding, and inference Anish Das Sarma
Related Work (sample) • Modeling Uncertainty: Plenty, covered in textbooks • Systems: Avatar, BayesStore, MayBMS, MYSTIQ, ORION, PrDB, ProbView, Trio, others? Anish Das Sarma
Part 2: Data Integration • Reboot! or, wake up! Anish Das Sarma
Mediated Schema D5 D1 D2 D4 D3 Traditional Data Integration: Setup Who authored the most SIGMOD papers in the 90’s? MappingSELECT P.title AS title, A.name AS author, NULL AS conf, P.year AS year, FROM Author AS A, Paper AS P, AuthoredBy AS B WHERE A.aid=B.aid AND P.pid=B.pid 1. Mediated Schema Publication(title, author, conf, year) 2. Schema Mappings 3. Query Answering Significant up-front effort Mike Carey Author(aid, name) Paper(pid, title, year) AuthoredBy(aid,pid) Bib(title, authors, conf, year)
“Pay-As-You-Go” Data Integration • Automated best-effort integration from the outset • Further improve the system over time with feedback How advanced a starting point can we provide? Anish Das Sarma
to the Rescue Uncertainty • Automatic integration • Make guesses • Model probabilities • Specifically • Probabilistic schema mappings • Probabilistic mediated-schema >90% accuracy in automatically integrating 50-800 data sources for several domains [SIGMOD 08] Anish Das Sarma
Next • Probabilistic mediated schemas • Probabilistic schema mappings • Experimental results Anish Das Sarma
Mediated Schema {name, person-name} {email} {phone-num, phone} {address, mailing-addr} Med-S (name, email, phone, addr) S1(name, email, phone-num, address) S2(person-name,phone,mailing-addr) • A mediated schema is a clustering of a subset of the set of all attributes appearing in source schemas. Anish Das Sarma
Example Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr}) ? S1(name, hPhone, oPhone, hAddr, oAddr) S2(name,phone,address) Q: SELECT name, hPhone, oPhone FROM Med
Example Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr}) Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr}) S1(name, hPhone, oPhone, hAddr, oAddr) S2(name,phone,address) Q: SELECT name, phone, address FROM Med
Example Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr}) Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr}) Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr}) S1(name, hPhone, oPhone, hAddr, oAddr) S2(name,phone,address) Q: SELECT name, phone, address FROM Med
Example Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr}) Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr}) Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr}) Med4 ({name}, {phone, oPhone}, {hPhone}, {address, oAddr}, {hAddr}) S1(name, hPhone, oPhone, hAddr, oAddr) S2(name,phone,address) Q: SELECT name, phone, address FROM Med
Example Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr}) Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr}) Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr}) Med4 ({name}, {phone, oPhone}, {hPhone}, {address, oAddr}, {hAddr}) Med5 ({name}, {phone}, {hPhone}, {oPhone}, {address}, {hAddr}, {oAddr}) S1(name, hPhone, oPhone, hAddr, oAddr) S2(name,phone,address) Q: SELECT name, phone, address FROM Med
Example Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr}) Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr}) Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr}) Med4 ({name}, {phone, oPhone}, {hPhone}, {address, oAddr}, {hAddr}) Med5 ({name}, {phone}, {hPhone}, {oPhone}, {address}, {hAddr}, {oAddr}) S1(name, hPhone, oPhone, hAddr, oAddr) S2(name,phone,address) Q: SELECT name, phone, address FROM Med
Probabilistic Mediated Schema Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr}) Pr=0.5 Med4 ({name}, {phone, oPhone}, {hPhone}, {address, oAddr}, {hAddr}) Pr=0.5 S1(name, hPhone, oPhone, hAddr, oAddr) S2(name,phone,address) • Probabilistic Mediated Schema (p-med-schema) is a set M = {(M1,Pr(M1)), …, (Mk,Pr(Mk))} where • Mi is a med-schema; i≠j => Mi≠ Mj • Pr(Mi)ϵ(0,1]; ΣPr(Mi) = 1 Anish Das Sarma
P-Mappings PM1 Med3 (name, hPP, oP, hAA, oA) S1(name, hP, oP, hA, oA) Pr=.64 Med3 (name, hPP, oP, hAA, oA) S1(name, hP, oP, hA, oA) Pr=.16 Med3 (name, hPP, oP, hAA, oA) S1(name, hP, oP, hA, oA) Pr=.16 Med3 (name, hPP, oP, hAA, oA) S1(name, hP, oP, hA, oA) Pr=.04 PM2 Med4 (name, oPP, hP, oAA, hA) S1(name, hP, oP, hA, oA) Pr=.64 Med4 (name, oPP, hP, oAA, hA) S1(name, hP, oP, hA, oA) Pr=.16 Med4 (name, oPP, hP, oAA, hA) S1(name, hP, oP, hA, oA) Pr=.16 Med4 (name, oPP, hP, oAA, hA) S1(name, hP, oP, hA, oA) Pr=.04 Anish Das Sarma
Expressive Power of P-Med-Schema & P-Mapping • Theorem 1. For one-to-many mappings: • (p-med-schema + p-mappings) = (mediated schema + p-mapping) > (p-med-schema + mappings) • Theorem 2. When restricted to one-to-one mappings: • (p-med-schema + p-mappings) • = (p-med-schema + mappings) > (mediated schema + p-mapping) Anish Das Sarma