610 likes | 779 Views
UNCERTAINTY. LINEAGE. DATA. Trio: A System for Data, Uncertainty, and Lineage. Jennifer Widom et al Stanford University. Depiction of Trio Project. Some Context. Trio Project They’re building a new kind of DBMS in which: Data Accuracy Lineage
E N D
UNCERTAINTY LINEAGE DATA Trio: A System for Data, Uncertainty, and Lineage Jennifer Widom et al Stanford University
Some Context Trio Project • They’re building a new kind of DBMS in which: • Data • Accuracy • Lineage • are all first-class interrelated concepts
The “Trio” in Trio • Data Student #123 is majoring in Econ:(123,Econ) Major • Uncertainty Student #123 is majoring in Econ or CS: (123, Econ ∥ CS) Major With confidence 60% student #456 is a CS major: (456, CS0.6) Major • Lineage 456HardWorkerderived from: (456, CS) Major “CS is hard” some web page
Depiction Data Uncertainty Lineage (“sourcing”)
Original Motivation for the Project New Application Domains • Many involve data that is uncertain • (approximate, probabilistic, inexact, incomplete, imprecise, fuzzy, inaccurate,...) • Many of the same ones need to track the lineage (provenance) of their data Neither uncertainty nor lineage is supported in current database systems
Importance of Lineage • [technical] Lineage: • Enables simple and consistent representation of uncertain data • Correlates uncertainty in query results with uncertainty in the input data • Can make computation over uncertain data more efficient • [fluffy] Applications use lineage to reduce or resolve uncertainty
Sample Applications • Information extraction • Find & label entities in unstructured text • Often probabilistic • Information integration • Combine data from multiple sources • Inconsistencies • Scientific experiments • Inexact/incomplete data • Many levels of “derived data products”
Sample Applications • Sensor data management • Approximate readings • Missing readings • Levels of data aggregation • Deduplication (“data cleaning”) • Object linkage, entity resolution • Often heuristic/probabilistic • Approximate query processing • Fast but inexact answers
The “Usual” DBMS Features (From first lecture of any Intro to Databases class) • Efficient, • Convenient, • Safe, • Multi-User storage of and access to • Massive amounts of • Persistent data
Completeness vs. Closure Completeness: blue=yellow All sets-of-instances Representable sets-of-instances Closure: arrow stays in blue Op2 Op1 Proposition: An incomplete representation is still interesting if it’s expressive enough and closed under all required operations
Operations: Semantics • Easy and natural (re)definition for any standard • database operation (call it Op) D D’ Closure: up-arrow always exists Op’ direct implementation possible instances rep. of instances Op on each instance I1, I2, …, In J1, J2, …, Jm Note: Completeness Closure
Incompleteness Instance3 Instance2 Instance1 generates 4th instance: empty relation ? ?
Non-Closure Under Join ⋈ Result has two possible instances: Instance2 Instance1 Not representable withor-sets and?
Another “Trio” in Trio • Data Model • Simplest extension to relational model that’s sufficiently expressive • Query Language • Simple extension to SQL with well-defined semantics and intuitive behavior • System • A complete open-source DBMS that people want to use
Another “Trio” in Trio • Data Model • Uncertainty-Lineage Databases (ULDBs) • Query Language • TriQL • System • Trio-One— built on top of standard DBMS
Remainder of Talk • Data Model • Uncertainty-Lineage Databases (ULDBs) • Query Language • TriQL • System • Trio-One— built on top of standard DBMS 4. Demo
Quote from Jennifer • We are not about machine learning or probabilistic reasoning! • We are about efficient and convenient storage, manipulation, and retrieval of large data sets (with uncertainty and lineage in them)
Running Example: Crime-Solving • Saw (witness, color, car)// may be uncertain • Drives (person,color, car)// may be uncertain • Suspects (person)= πperson(Saw ⋈ Drives)
In Standard Relational DBMS Create Table Suspects as Select person From Saw, Drives Where Saw.color = Drives.color And Saw.car = Drives.car
Data Model: Uncertainty • An uncertain database represents a set of • possible instances • Amy saw either a Honda or a Toyota • Jimmy drives a Toyota, a Mazda, or both • Betty saw an Acura with confidence 0.5 or a Toyota with confidence 0.3 • Hank is a suspect with confidence 0.7
Their Model for Uncertainty • 1. Alternatives • 2. ‘?’ (Maybe) Annotations • 3. Confidences
Our Model for Uncertainty • 1. Alternatives:uncertainty about value • 2. ‘?’ (Maybe) Annotations • 3. Confidences Three possible instances
Our Model for Uncertainty • 1. Alternatives • 2.‘?’ (Maybe): uncertainty about presence • 3. Confidences ? Six possible instances
Our Model for Uncertainty • 1. Alternatives • 2. ‘?’ (Maybe) Annotations • 3. Confidences: weighted uncertainty ? Six possible instances, each with a probability
Models for Uncertainty • Our model (so far) is not especially new • We spent some time exploring the space of models for uncertainty • Tension between understandability and expressiveness • Our model is understandable • But it is not complete, or even closed under common operations
Our Model is Not Complete or Closed Suspects= πperson(Saw ⋈ Drives) CANNOT Does not correctly capture possible instances in the result ? ? ?
to the Rescue Lineage • Lineage (provenance): “where data came from” • Internal lineage • External lineage • In Trio: A functionλ from data elements to • other data elements (or external sources)
Example with Lineage Correctly captures possible instances in the result Suspects= πperson(Saw ⋈ Drives) λ(31) = (11,2),(21,2) ? λ(32,1) = (11,1),(22,1); λ(32,2) = (11,1),(22,2) ? λ(33) = (11,1), 23 ?
Example with Lineage Suspects= πperson(Saw ⋈ Drives) ? ? ?
Trio Data Model Uncertainty-Lineage Databases (ULDBs) • Alternatives • ‘?’ (Maybe) Annotations • Confidences • Lineage ULDBs are closed and complete
Proof of Completeness • Proof: Construct R with x-relations S1, .., Sn, corresponding to R1, .. ,Rn • Construct an extra relation PW that encodes the possible instances. PW contains exactly one x-tuple: (1)|(2)|...|(m). • Each Si is constructed as follows, • For every Pj , each tuple t in Ri forms a maybe x-tuple with just one alternative with value t. • Duplicates within and across possible instances are preserved in Si. • We add (j) in PW to the lineage of alternatives in tuples copied from Pj . • This now exactly encodes the data in each of the possible instances.
Continuous, Proof … • The correct lineage is obtained as follows, • We look at the lineage j in Pj and mimic it in the x-tuples it contributes in S1 through Sn. • For example, if j(t1) = {t2} in Pj, where t1 is a subset of R1 and t2 is a subset of R2, then the x-tuple that t2 gave in S2 is added to the lineage of the x-tuple from t1 in S1. • As a final step, we remove the extra relation PW but retain its symbols as external lineage. • Therefore, each possible LDB of D now has the same schema as each Pj , and represents exactly the same data and internal lineage.
ULDBs: Minimality • A ULDB relation R represents a set of possible • instances • Does every tuple inRappear in some possible instance? (no extraneous tuples) • Does every maybe-tuple inRnotappear in some possible instance?(no extraneous ‘?’s) • Also Data-minimality Lineage-minimality
Data Minimality Examples • Extraneous ‘?’ λ(20,1)=(10,1); λ(20,2)=(10,2) ? extraneous
Data Minimality Examples • Extraneous tuple ? extraneous ? ?
Querying ULDBs TriQL • Simple extension to SQL • Formal semantics, intuitive meaning • Query uncertainty, confidences, and lineage
Simple TriQL Example Create Table Suspects as Select person From Saw, Drives Where Saw.car = Drives.car λ(31)=(11,2),(21,2) ? λ(32,1)=(11,1),(22,1); λ(32,2)=(11,1),(22,2) ? ? λ(33)=(11,1),23
Formal Semantics • Relational (SQL) query Qon ULDB D implementation of Q D D’ D + Result operational semantics possible instances representation of instances Qon each instance D1, D2, …, Dn Q(D1), Q(D2), …, Q(Dn)
TriQL: Querying Confidences Built-in function:Conf() • SELECT person • FROM Saw, Drives • WHERE Saw.car = Drives.car • AND Conf(Saw) > 0.5 AND Conf(Drives) > 0.8
TriQL: Querying Lineage Built-in join predicate:Lineage() • SELECT Suspects.person • FROM Suspects, Saw • WHERE Lineage(Suspects,Saw) • AND Saw.witness = ‘Amy’ X ==> Y shorthand for Lineage(X,Y)
Operational Semantics SELECT attr-list FROM X1, X2, ..., Xn WHERE predicate Over standard relational database: For each tuple in cross-product of X1, X2, ..., Xn • Evaluate the predicate • If true, project attr-list to create result tuple
Operational Semantics SELECT attr-list FROM X1, X2, ..., Xn WHERE predicate Over ULDB: For each tuple in cross-product of X1, X2, ..., Xn • Create “super tuple” T from all combinations of alternatives • Evaluate predicate on each alternative in T; keep only the true ones • Project attr-list on each alternative to create result tuple • Details: ‘?’, lineage, confidences
Operational Semantics: Example SELECT person FROM Saw, Drives WHERE Saw.car = Drives.car
Operational Semantics: Example SELECT person FROM Saw, Drives WHERE Saw.car = Drives.car
Operational Semantics: Example SELECT person FROM Saw, Drives WHERE Saw.car = Drives.car
Operational Semantics: Example SELECT person FROM Saw, Drives WHERE Saw.car = Drives.car
Operational Semantics: Example SELECT person FROM Saw, Drives WHERE Saw.car = Drives.car