630 likes | 815 Views
Trio: A System for Data, Uncertainty, and Lineage. Martin Theobald Jennifer Widom Stanford University. Uncertainty in Databases. Not a new idea — proposed 20+ years ago “Classic” probabilistic database systems Uncertainty about attribute values Modeling of missing values/incompleteness
E N D
Trio: A System for Data, Uncertainty, and Lineage Martin Theobald Jennifer Widom Stanford University
Uncertainty in Databases • Not a new idea — proposed 20+ years ago • “Classic” probabilistic database systems • Uncertainty about attribute values • Modeling of missing values/incompleteness [Cavallo ‘87, Abiteboul ‘91, Garcia-Molina ‘92, Fuhr/Rölleke ‘97] • Most initial (18+ years) work largely theoretical; not much systems-building until recently: URank – U Waterloo, Orion – Purdue U, MayBMS - Saarland U • But applications weren’t ready anyway Are they now?
The “Trio” in Trio • Data Student #123 is majoring in Econ:(123,Econ) Major • Uncertainty Student #123 is majoring in Econ or CS: (123, Econ ∥ CS) Major With confidence 60% student #456 is a CS major: (456, CS0.6) Major • Lineage 456HardWorkerderived from: (456, CS) Major and “CS is hard” some web page
Original Motivation for Trio New Application Domains • Many involve data that is uncertain • (probabilistic, imprecise, fuzzy, incomplete...) • Many of the same ones need to track the lineage (provenance) of their data Coincidence or Fate?
Original Motivation for Trio New Application Domains • Many involve data that is uncertain • (probabilistic, imprecise, fuzzy, incomplete...) • Many of the same ones need to track the lineage (provenance) of their data Trio combines uncertainty & lineage for a closed & complete representation model Neither uncertainty nor lineage is supported in current commercial database systems
Sample Applications • Information extraction • Find & label entities in unstructured text • Often probabilistic • Information integration • Combine data from multiple sources • Inconsistencies, probabilistic schema mappings • “Human input” data / Scientific experiments • Inexact/incomplete data • Many levels of “derived data products”
Sample Applications • Sensor data management • Approximate readings • Missing readings • Levels of data aggregation • Deduplication (“data cleaning”) • Object linkage, entity resolution • Often heuristic/probabilistic • But *NOT* approximate query processing • Fast but inexact answers
Our Claim Coincidence or Fate? • The connection between uncertainty and lineage goes deeper than just a shared need by several applications
Substantiation of Claim • [technical] Lineage: • Enables simple and consistent representation of uncertain data • Correlates uncertainty in query results with uncertainty in the input data • Can make computations over uncertain data more efficient • [fluffy] “Applications may use lineage to reduce or resolve uncertainty”
Our Goal • Develop a new kind of database management system (DBMS) in which: • Data • Uncertainty • Lineage • are all first-class interrelated concepts • With all the “usual” DBMS features
The “Usual” DBMS Features • Efficient, • Convenient, • Safe, • Multi-User storage of and access to • Massive amounts of • Persistent data
Standard Relational DBMSs • Persistent; Convenient • All data stored in simple tables (relations) • Queries and updates via simple but powerful declarative language (SQL) • Multi-User; Safe • Transactions, Recovery • Massive; Efficient • Storage and indexing structures • Query optimization
Trio: What Changes • Persistent; Convenient • All data stored in “uncertain” relations • Queries and updates via simple but powerful declarative language • Multi-User; Safe • Transactions, Recovery • Massive; Efficient • Storage and indexing structures • Query optimization
Trio: What Changes • Persistent; Convenient • All data stored in “uncertain” relations • Queries and updates via simple but powerful declarative language • Multi-User; Safe – standard DBMS underneath • Transactions, Recovery • Massive; Efficient – standard DBMS underneath • Storage and indexing structures • Query optimization
Another “Trio” in Trio • Data Model • Simplest extension to relational model that’s sufficiently expressive • Query Language • Simple extension to SQL with well-defined semantics and intuitive behavior • System • A complete open-source DBMS that people want to use
Another “Trio” in Trio • Data Model • Uncertainty-Lineage Databases (ULDBs) • Query Language • TriQL • System • Trio-One— built on top of standard DBMS
Remainder of Talk • Data Model • Uncertainty-Lineage Databases (ULDBs) • Query Language • TriQL • System Trio-One • Real-World Applications PharmGKB.org SERF
Running Example: Crime-Solving • Saw (witness, color, car)// may be uncertain • Drives (person,color, car)// may be uncertain • Suspects (person)= πperson(Saw ⋈ Drives)
In Standard Relational DBMS Create Table Suspects as Select person From Saw, Drives Where Saw.color = Drives.color And Saw.car = Drives.car
Data Model: Uncertainty • An uncertain database enodes a set of possible instances based on different kinds of uncertainty: • Jimmy drives either a Toyota or a Mazda, but not both. • Amy saw either a Honda or a Toyota, or nothing? • Hank is a suspect with confidence 0.7 • Betty saw either an Acura with confidence 0.5 or a Toyota with confidence 0.3 or nothing with confidence 0.2 ….
Our Model for Uncertainty • 1. Alternatives • 2. ‘?’ (Maybe) Annotations • 3. Confidences
Our Model for Uncertainty • 1. Alternatives:uncertainty about value • 2. ‘?’ (Maybe) Annotations • 3. Confidences Three possible instances
Our Model for Uncertainty • 1. Alternatives • 2.‘?’ (Maybe): uncertainty about presence • 3. Confidences ? Six possible instances
Our Model for Uncertainty • 1. Alternatives • 2.‘?’ (Maybe):uncertainty about presence • 3. Confidences absent unknown ?
Our Model for Uncertainty • 1. Alternatives • 2. ‘?’ (Maybe) Annotations • 3. Confidences: weighted uncertainty ? Six possible instances, each with a probability
Our Model is Not Closed Suspects= πperson(Saw ⋈ Drives) CANNOT Does not correctly capture possible instances in the result (not 12)! ? ? ?
to the Rescue Lineage • Lineage (provenance): “where data came from” • Internal lineage – from another Trio relation • External lineage – from an external source • In Trio: A Boolean functionλ from data elements to other data elements (or external sources)
Example with Lineage Suspects= πperson(Saw ⋈ Drives) ? ? ? λ(31) = (11,2)˄(21,2) λ(32,1) = (11,1)˄(22,1) ; λ(32,2) = (11,1)˄(22,2) λ(33) = (11,1) ˄ 23
Example with Lineage Correctly captures possible instances in the result (7) Suspects= πperson(Saw ⋈ Drives) λ(31) = (11,2)˄(21,2) ? λ(32,1) = (11,1)˄(22,1); λ(32,2) = (11,1)˄(22,2) ? λ(33) = (11,1) ˄ 23 ?
Trio Data Model Uncertainty-Lineage Databases (ULDBs) • Alternatives • ‘?’ (Maybe) Annotations • Confidences • Lineage ULDBs are closed and complete [Databases with Uncertainty and Lineage, VLDB J. Special Issue 2/08]
ULDBs: Lineage • Conjunctive lineage sufficient for most operations (,,⋈) • Disjunctive lineage for duplicate-elimination () • Negative lineage for set difference () • General case after several queries: • Boolean formula • Can represent anyfinite (sub)set of possible instances • Can capture arbitraryBoolean constraints among tuples
ULDBs: Minimality • A ULDB relation R represents a set of possible • instances • Does every tuple inRappear in some possible instance? (no extraneous tuples) • Does every maybe-tuple inRnotappear in some possible instance?(no extraneous ‘?’s) • Also Data-minimality Lineage-minimality
Data Minimality Examples • Extraneous ‘?’ λ(40,1)=(10,1) λ(40,2)=(10,2) ? extraneous ? ?
Data Minimality Examples • Extraneous tuple λ(40,1)=(10,1)˄(10,2) ? extraneous ? ?
ULDBs: Complexity Questions • Does a given tuple t appear in some(all)possible instance(s)of an uncertain relationR? • Is a given tableTone of(all of)the possible instances ofR? Polynomial algorithms based on data-minimization • Confidence computation over arbitrary Boolean expressions: • Efficient approximation algorithms • (E.g. Monte Carlo simulations) NP-hard #P-complete
Querying ULDBs TriQL • Simple extension to SQL • Formal semantics, intuitive meaning • Query uncertainty, confidences, and lineage
Formal Semantics • Relational (SQL) query Qon ULDB D implementation of Q D D + Result operational semantics possible instances representation of instances Qon each instance D1, D2, …, Dn Q(D1), Q(D2), …, Q(Dn)
Operational Semantics: Example SELECT person FROM Saw, Drives WHERE Saw.car = Drives.car
Operational Semantics: Example SELECT person FROM Saw, Drives WHERE Saw.car = Drives.car
Operational Semantics: Example SELECT person FROM Saw, Drives WHERE Saw.car = Drives.car
Operational Semantics: Example SELECT person FROM Saw, Drives WHERE Saw.car = Drives.car ?
Operational Semantics: Example SELECT person FROM Saw, Drives WHERE Saw.car = Drives.car ?
Operational Semantics: Example SELECT person FROM Saw, Drives WHERE Saw.car = Drives.car ?
Operational Semantics: Example SELECT person FROM Saw, Drives WHERE Saw.car = Drives.car ? ?
Operational Semantics: Example CREATE TABLE Suspects AS SELECT Drives.person FROM Saw, Drives WHERE Saw.car = Drives.car ? λ( ) = ... λ( ) = ... ? • Efficiently implementable on top of standard SQL • Data computation overhead: O(n log(n))
Confidences • Confidences supplied with base data • Trio computes confidences on query results • Default probabilistic interpretation (#P-complete) • Plug in different arithmetics, e.g., Min(linear) 0.3 0.4 Probabilistic Min 0.6
TriQL: Querying Confidences Built-in function:Conf() • SELECT person • FROM Saw, Drives • WHERE Saw.car = Drives.car • AND Conf(Saw) > 0.5 AND Conf(Drives) > 0.8 Similarly:Conf(*) • SELECT person • FROM Saw, Drives • WHERE Saw.car = Drives.car • AND Conf(*) > 0.5 Naïve data-minimization: WHERE Conf(*) > 0
TriQL: Querying Lineage Built-in join predicate:Lineage() • SELECT Suspects.person • FROM Suspects, Saw • WHERE Saw.witness = ‘Amy’ • AND Lineage(Suspects,Saw) X ==> Y shorthand for Lineage(X,Y)
More TriQL Constructs • “Horizontal subqueries” • Refer to tuple alternatives as a relation • Aggregations: • (sum, count, avg, min, max) (low, high, expected) • Fetch modifiers: • Merged (eliminate horizontal duplicates) • Flatten, GroupAlts, NoLineage, NoConf, NoMaybe • Set operators: Union [All], Intersect, Except • Conjunctive, Disjunctive & Negative Lineage
Horizontal Subquery Example • List suspects with conf values based on accuser credibility