1 / 67

Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

UNCERTAINTY. LINEAGE. DATA. Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS. Martin Theobald Jennifer Widom Stanford University. Outline. Motivation and Applications Working Model for ULDBs Trio-One System Architecture Efficient Confidence Computations. Motivation.

Download Presentation

Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. UNCERTAINTY LINEAGE DATA Trio-One:Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University

  2. Outline • Motivation and Applications • Working Model for ULDBs • Trio-One System Architecture • Efficient Confidence Computations

  3. Motivation • Many applications involve data that is uncertain • (approximate, probabilistic, inexact, incomplete, • imprecise, fuzzy, inaccurate, ...) • Many of the same applications need to track the lineage of their data • Neither uncertainty nor lineage are supported by conventional Database Management Systems (DBMSs) Coincidence or Fate?

  4. Sample Applications • Deduplication / Entity Resolution • Uncertainty:Match and merge • Lineage:Source records • Information extraction • Uncertainty:Extracted labels and values • Lineage:Original context • Information integration • Uncertainty:Inconsistent information • Lineage:Original sources

  5. Sample Applications • Scientific experiments • Uncertainty: Captured (and derived) data • Lineage: Layers of views • Sensor data • Uncertainty: Sensor values, aggregations, missing readings • Lineage: Original readings, views

  6. Claim Coincidence or Fate? • The connection between uncertainty and lineage goes deeper than just a shared need by several applications

  7. Lineage and Uncertainty • Lineage... • Enables simple and consistent representation of uncertain data • Allows for intuitive and complete working models • Correlates uncertainty in query results with uncertainty in the input data • Can make confidence computations over uncertain data more efficient • Applications use lineage to reduce or resolve • uncertainty

  8. Goal • A new kind of DBMS in which: • Data • Uncertainty • Lineage are all first-class interrelated concepts } Trio

  9. The Trio Trio • Data Model • Simplest extension to relational model that’s sufficiently expressive • Query Language • Simple extension to SQL with well-defined semantics and intuitive behavior • System • A complete open-source DBMS that people want to use

  10. The Present • Data Model • Uncertainty-Lineage Databases (ULDBs) • Query Language • TriQL • System • Trio-One: Layered prototype deployable on top of any standard relational DBMS

  11. Running Example: Crime-Solving • Saw(witness,car) // may be uncertain • Drives(person,car) // may be uncertain • Suspects(person)= πperson(Saw ⋈car Drives)

  12. Data Model: Uncertainty • An uncertain database represents a set of • possible instances of data values • Amy saw either a Honda or a Toyota • Jimmy drives a Toyota, a Mazda, or both • Betty saw an Acura with confidence 0.5 or a Toyota with confidence 0.3 • Hank is a suspect with confidence 0.7

  13. Our Model for Uncertainty • 1. Alternatives (mutually exclusive) • 2. ‘?’ (Maybe) Annotations • 3. Confidences

  14. Our Model for Uncertainty • 1. Alternatives:uncertainty about data value(s) • 2. ‘?’ (Maybe) Annotations • 3. Confidences Three possible instances =

  15. Our Model for Uncertainty • 1. Alternatives • 2.‘?’ (Maybe): uncertainty about presence • 3. Confidences ? Six possible instances

  16. Our Model for Uncertainty • 1. Alternatives • 2. ‘?’ (Maybe) Annotations • 3. Confidences: weighted uncertainty ? Six possible instances, each with a probability

  17. Models for Uncertainty • Our model (so far) is not especially new • We spent some time exploring the space of models for uncertainty • Tension between understandability and expressiveness • Our model is understandable • But it is not complete, or even closed under common operations

  18. Closure and Completeness • Completeness • Can represent any finite set of possible instances • Closure • Can represent any result of relational operators • Note: Completeness Closure

  19. Models for Uncertainty • A-Tuples – Uncertainty on attribute values • (Amy, {Toyota, Mazda}) • Not closed • X-Tuples – Uncertainty on entire tuples • { (Amy, Toyota), (Amy, Mazda) } • Still not closed • C-Tables – Tuples + Boolean constraints • { (Amy, Toyota, X=1), • (Amy, Mazda, X<>1), • (Cathy, Honda, X<>1) } • Closed and complete! • Understandable?

  20. Our Model is Not Closed Suspects= πperson(Saw ⋈car Drives) CANNOT Does not correctly capture possible instances in the result ? ? ?

  21. to the Rescue Lineage • Lineage (provenance): “where data came from” • Internal lineage – pointers to base data • External lineage – files, URLs, web portals, etc. • In Trio: A Boolean functionλ from alternatives to other alternatives (or external sources)

  22. Example with Lineage Correctly captures possible instances in the result Suspects= πperson(Saw ⋈car Drives) λ(31) = (11,2),(21,2) ? λ(32,1) = (11,1),(22,1); λ(32,2) = (11,1),(22,2) ? ? λ(33) = (11,1), 23

  23. Trio Data Model Uncertainty-Lineage Databases (ULDBs) • 1. Alternatives (mutually exclusive) • 2. ‘?’ (Maybe) Annotations • 3. Confidences • 4. Lineage • ULDBs are closed and complete

  24. ULDBs: Lineage • Conjunctive lineage sufficient for most operations • Disjunctive lineage for duplicate-elimination • Negative lineage for difference

  25. ULDBs: Minimality • A ULDB relation R represents a set of possible • instances • Does every tuple inRappear in some possible instance? (no extraneous tuples) • Does every maybe-tuple inRnotappear in some possible instance?(no extraneous ‘?’s) Also Data-minimality Lineage-minimality

  26. Data Minimality Examples • Extraneous ‘?’ λ(40,1)=(10,1); λ(40,2)=(10,2) ? extraneous ? ?

  27. Data Minimality Examples • Extraneous tuple ? extraneous ? ?

  28. ULDBs: Membership Questions • Does a given tuple t appear in some(all)possible instance(s)ofR? • Is a given tableTone of(all of)the possible instances ofR? Polynomial algorithms based on data-minimization NP-Hard

  29. Non-Theorists... • Wake Back Up!

  30. Querying ULDBs TriQL • Simple extension to SQL • Formal semantics, intuitive meaning • Query uncertainty, confidences, and lineage

  31. Initial TriQL Example SELECT Drives.person INTO Suspects FROM Saw, Drives WHERE Saw.car = Drives.car λ(31) = (11,2),(21,2) ? λ(32,1)=(11,1),(22,1); λ(32,2)=(11,1),(22,2) ? ? λ(33) = (11,1), 23

  32. Formal Semantics • Query Qon ULDB D implementation of Q D D’ D + Result operational semantics possible instances representation of instances Qon each instance D1, D2, …, Dn Q(D1), Q(D2), …, Q(Dn)

  33. Operational Semantics SELECT attr-list [ INTO table ] FROM X1, X2, ..., Xn WHERE predicate • Over conventional relational database: For each tuple in cross-product of X1, X2, ..., Xn • Evaluate the predicate • If true, project attr-list to create result tuple • If INTO clause, insert into table

  34. Operational Semantics SELECT attr-list [ INTO table ] FROM X1, X2, ..., Xn WHERE predicate • Over ULDB: For each tuple in cross-product of X1, X2, ..., Xn • Create “super tuple” T from all combinations of alternatives • Evaluate predicate on each alternative in T; keep only the true ones • Project attr-list on each alternative to create result tuple • Details: ‘?’, lineage, confidences

  35. Operational Semantics: Example SELECT Drives.person FROM Saw, Drives WHERE Saw.car = Drives.car

  36. Operational Semantics: Example SELECT Drives.person FROM Saw, Drives WHERE Saw.car = Drives.car

  37. Operational Semantics: Example SELECT Drives.person FROM Saw, Drives WHERE Saw.car = Drives.car

  38. Operational Semantics: Example SELECT Drives.person FROM Saw, Drives WHERE Saw.car = Drives.car

  39. Operational Semantics: Example SELECT Drives.person FROM Saw, Drives WHERE Saw.car = Drives.car

  40. Operational Semantics: Example SELECT Drives.person FROM Saw, Drives WHERE Saw.car = Drives.car

  41. Operational Semantics: Example SELECT Drives.person FROM Saw, Drives WHERE Saw.car = Drives.car

  42. Operational Semantics: Example SELECT Drives.personINTO Suspects FROM Saw, Drives WHERE Saw.car = Drives.car ? λ( ) = ... λ( ) = ... ?

  43. Confidences • Confidences supplied with base data • Trio computes confidences on query results • Default probabilistic interpretation • Can choose to plug in different arithmetic ? ? Probabilistic Min 0.3 0.4 ? 0.6

  44. TriQL: Querying Confidences • Built-in function:Conf() • SELECT Drives.person INTO Suspects • FROM Saw, Drives • WHERE Saw.car = Drives.car • AND Conf(Saw) > 0.5 AND Conf(Drives) > 0.8 • SELECT Drives.person INTO Suspects • FROM Saw, Drives • WHERE Saw.car = Drives.car • AND Conf(*) > 0.4

  45. TriQL: Querying Lineage • Built-in join predicate:Lineage() • SELECT Saw.witness INTO AccusesHank • FROM Suspects, Saw • WHERE Lineage(Suspects,Saw) • AND Suspects.person = ‘Hank’ • Also works for transitive lineage: • SELECT witness • FROM AccusesHank • WHERE Lineage(Saw,AccusesHank)

  46. Additional Query Constructs • “Horizontal subqueries” Refer to tuple alternatives as a relation • Unmerged (allow horizontal duplicates) • Flatten, GroupAlts • NoLineage, NoConf, NoMaybe • Query-computed confidences • SQL-like data modification statements

  47. Final Example Query Credibility • List suspects with conf values based on accuser credibility

  48. Final Example Query Credibility SELECT suspect, score/[sum(score)] as conf FROM (SELECT suspect, (SELECT score FROM Credibility C WHERE C.person = P.accuser) FROM PrimeSuspect P)

  49. Final Example Query Credibility SELECT suspect, score/[sum(score)] as conf FROM (SELECT suspect, (SELECT score FROM Credibility C WHERE C.person = P.accuser) FROM PrimeSuspect P)

  50. Final Example Query Credibility SELECT suspect, score/[sum(score)]as conf FROM (SELECT suspect, (SELECT score FROM Credibility C WHERE C.person = P.accuser) FROM PrimeSuspect P) “Scaled as conf” “Uniform as conf” …

More Related