200 likes | 292 Views
Trio: A System for Data, Uncertainty, and Lineage. Stanford University. The “Trio” in Trio. Data Student #123 is majoring in Econ: (123,Econ) Major Uncertainty Student #123 is majoring in Econ or CS: (123, Econ ∥ CS ) Major With confidence 60% student #456 is a CS major:
E N D
Trio: A System for Data, Uncertainty, and Lineage Stanford University
The “Trio” in Trio • Data Student #123 is majoring in Econ:(123,Econ) Major • Uncertainty Student #123 is majoring in Econ or CS: (123, Econ ∥ CS) Major With confidence 60% student #456 is a CS major: (456, CS0.6) Major • Lineage 456HardWorkerderived from: (456, CS) Major and “CS is hard” some web page
Ingredients for Uncertainty • 1. Alternatives • 2. ‘?’ (Maybe) Annotations • 3. Confidences
Our Model for Uncertainty • 1. Alternatives:uncertainty about value • 2. ‘?’ (Maybe) Annotations • 3. Confidences Three possible instances
Our Model for Uncertainty • 1. Alternatives • 2.‘?’ (Maybe): uncertainty about presence • 3. Confidences ? Six possible instances
Our Model for Uncertainty • 1. Alternatives • 2.‘?’ (Maybe):uncertainty about presence • 3. Confidences absent unknown ?
Our Model for Uncertainty • 1. Alternatives • 2. ‘?’ (Maybe) Annotations • 3. Confidences: weighted uncertainty ? Six possible instances, each with a probability
Our Model is Not Closed Suspects= πperson(Saw ⋈ Drives) CANNOT Does not correctly capture possible instances in the result (not 12)! ? ? ?
to the Rescue Lineage • Lineage (provenance): “where data came from” • Internal lineage – from another Trio relation • External lineage – from an external source • In Trio: A Boolean functionλ from data elements to other data elements (or external sources)
Example with Lineage Suspects= πperson(Saw ⋈ Drives) ? ? ? λ(31) = (11,2)˄(21,2) λ(32,1) = (11,1)˄(22,1) ; λ(32,2) = (11,1)˄(22,2) λ(33) = (11,1) ˄ 23
Example with Lineage Correctly captures possible instances in the result (7) Suspects= πperson(Saw ⋈ Drives) λ(31) = (11,2)˄(21,2) ? λ(32,1) = (11,1)˄(22,1); λ(32,2) = (11,1)˄(22,2) ? λ(33) = (11,1) ˄ 23 ?
Trio Data Model Uncertainty-Lineage Databases (ULDBs) • Alternatives • ‘?’ (Maybe) Annotations • Confidences • Lineage ULDBs are closed and complete [Databases with Uncertainty and Lineage, VLDB J. Special Issue (2/08)]
ULDBs: Lineage • Conjunctive lineage sufficient for most operations (,,⋈) • Disjunctive lineage for duplicate-elimination () • Negative lineage for set difference () • General case after several queries: • Boolean formula • Beyond just lineage: • Can capture arbitrary Boolean constraints among tuples • Can represent any finite (sub-)set of possible instances
Formal Semantics • Relational (SQL) query Qon ULDB D implementation of Q D D + Result operational semantics possible instances representation of instances Qon each instance D1, D2, …, Dn Q(D1), Q(D2), …, Q(Dn)
Applications– SERF • Stanford Entity Resolution Framework <id>r1 <name>J. Doe <phone>351-2535 <id>r2 <name>John Doe <phone>357-2635 <email>jdoe@yahoo.com <id>r3 <name>John D. <phone>351-2535 <email>jdoe@yahoo.com Iteration 1 <id>r4 <name>{John Doe||John D.} <phone>{357-2635||351-2535} <email>jdoe@yahoo.com Iteration 2 <id>r5 <name>{John Doe||John D.||J. Doe} <phone>{357-2635||351-2535} <email>jdoe@yahoo.com Swoosh: A Generic Approach to Entity Resolution Benjelloun, Garcia-Molina, Whang et al.
Applications– SERF • Stanford Entity Resolution Framework <id>r1 <name>J. Doe <phone>351-2535 <id>r2 <name>John Doe <phone>357-2635 <email>jdoe@yahoo.com <id>r3 <name>John D. <phone>351-2535 <email>jdoe@yahoo.com Iteration 1 <id>r4 <name>{John Doe : 0.4 ||John D. : 0.3} <phone>{357-2635 : 0.5 ||351-2535 : 0.5} <email>jdoe@yahoo.com Iteration 2 <id>r5 <name>{John Doe : 0.4 ||John D. : 0.3||J. Doe : 0.3} <phone>{357-2635 : 0.3 ||351-2535 : 0.6} <email>jdoe@yahoo.com Swoosh: A Generic Approach to Entity Resolution Benjelloun, Garcia-Molina, Whang et al.
Applications – PharmGKB • Pharmaceutical / Medical Databases • PharmGKB.org • Relationships between Drugs, Diseases, Genes • Evidence from Literature References, Clinical Outcomes, etc.
Applications – PharmGKB.org • Relational Schema • Drugs(DrugID, Name, AltName) • Diseases(DiseaseID, Name, AltName) • Genes(GeneID, Name, Symbol, AltSymbol) • Relationships(RelID, RelatedTo, Evidence) • Independent Base Data • P[EffectiveA,B,C] • = P[A] P[B] P[C] DrugA GeneC EffectiveA,B EffectiveA,B,C DiseaseB
Applications– PharmGKB.org E.g.: C = Heart Disease B = High Blood Pressure A = Warfarin • Relational Schema • Drugs(DrugID, Name, AltName) • Diseases(DiseaseID, Name, AltName) • Genes(GeneID, Name, Symbol, AltSymbol) • Relationships(RelID, RelatedTo, Evidence) • Non-Independent Base Data • P[EffectiveA,B,C] • = P[E|A,B,C] • ≠ P[A] P[B] P[C] • ULDBs Vs. Bayesian Nets Factorized representation w/conditional probability tables (CPTs) DrugA GeneC EffectiveA,B EffectiveA,B,C DiseaseB
UNCERTAINTY LINEAGE DATA Search “stanford trio” Trio contributors, past and present Parag Agrawal, Omar Benjelloun, Julien Chaumond, Ashok Chandra, Anish Das Sarma, Alon Halevy, Chris Hayworth, Ander de Keijzer, Raghotham Murthy, Michi Mutsuzaki, Shubha Nabar, Tomoe Sugihara, Martin Theobald, Jeff Ullman, Jennifer Widom