1 / 43

Trio: A System for Data , Uncertainty , and Lineage

UNCERTAINTY. LINEAGE. DATA. Trio: A System for Data , Uncertainty , and Lineage. Search “stanford trio”. Stanford Report: March ‘06. Trio. Data Student #123 is majoring in Econ: (123,Econ)  Major Uncertainty Student #123 is majoring in Econ or CS: (123, Econ ∥ CS)  Major

keelty
Download Presentation

Trio: A System for Data , Uncertainty , and Lineage

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. UNCERTAINTY LINEAGE DATA Trio: A System for Data, Uncertainty, and Lineage Search “stanford trio”

  2. Stanford Report: March ‘06

  3. Trio • Data Student #123 is majoring in Econ:(123,Econ) Major • Uncertainty Student #123 is majoring in Econ or CS: (123, Econ ∥ CS) Major With confidence 60% student #456 is a CS major: (456, CS: 0.6) Major • Lineage (456)HardWorkerderived from: (456, CS) Major “CS is hard” some web page

  4. The Picture Data Uncertainty Lineage (“sourcing”)

  5. Why Uncertainty + Lineage? • Many applications seem to need both • Information extraction systems • Scientific and sensor data management • Information integration • Deduplication (data cleaning) • Approximate query processing

  6. Why Uncertainty + Lineage? • From a technical standpoint, it turns out that • lineage... • Enables simple and consistent representation of uncertain data • Correlates uncertainty in query results with uncertainty in the input data • Can make computation over uncertain data more efficient

  7. Goal • A new kind of DBMS in which: • Data • Uncertainty • Lineage are all first-class interrelated concepts With all the usual DBMS features Scalable, reliable, efficient, ad-hoc declarative queries and updates, … } Trio

  8. The Trio Trio • Data Model • Simplest extension to relational model that’s sufficiently expressive • Query Language • Simple extension to SQL with well-defined semantics and intuitive behavior • System • A complete open-source DBMS that people want to use

  9. The Trio Trio • Data Model • Uncertainty-Lineage Databases (ULDBs) • Query Language • TriQL • System • First prototype built on top of standard DBMS

  10. Running Example: Crime-Solving • Saw(witness,car) // may be uncertain • Drives(person,car) // may be uncertain • Suspects(person)= πperson(Saw ⋈ Drives)

  11. Data Model: Uncertainty • An uncertain database represents a set of • possible instances • Amy saw either a Honda or a Toyota • Jimmy drives a Toyota, a Mazda, or both • Betty saw an Acura with confidence 0.5 or a Toyota with confidence 0.3 • Hank is a suspect with confidence 0.7

  12. Our Model for Uncertainty • 1. Alternatives • 2. ‘?’ (Maybe) Annotations • 3. Confidences

  13. Our Model for Uncertainty • 1. Alternatives:uncertainty about value • 2. ‘?’ (Maybe) Annotations • 3. Confidences Three possible instances =

  14. Our Model for Uncertainty • 1. Alternatives • 2.‘?’ (Maybe): uncertainty about presence • 3. Confidences ? Six possible instances

  15. Our Model for Uncertainty • 1. Alternatives • 2. ‘?’ (Maybe) Annotations • 3. Confidences: weighted uncertainty ? Six possible instances, each with a probability

  16. Deficiency in Model Suspects= πperson(Saw ⋈ Drives) CANNOT Does not correctly capture possible instances in the result ? ? ?

  17. Lineage to the Rescue • Lineage (provenance): “where data came from” • Internal lineage • External lineage • In Trio: A functionλ from alternatives to other • alternatives (or external sources)

  18. Example with Lineage Correctly captures possible instances in the result Suspects= πperson(Saw ⋈ Drives) λ(31) = (11,2),(21,2) ? λ(32,1) = (11,1),(22,1); λ(32,2) = (11,1),(22,2) ? ? λ(33) = (11,1), 23

  19. Uncertainty-Lineage Databases (ULDBs) • 1. Alternatives • 2. ‘?’ (Maybe) Annotations • 3. Confidences • 4. Lineage • The ULDB model is “complete”

  20. Querying ULDBs TriQL • Simple extension to SQL • Formal semantics, intuitive meaning • Query uncertainty, confidences, and lineage

  21. Initial TriQL Example SELECT Drives.person INTO Suspects FROM Saw, Drives WHERE Saw.car = Drives.car λ(31) = (11,2),(21,2) ? λ(32,1)=(11,1),(22,1); λ(32,2)=(11,1),(22,2) ? ? λ(33) = (11,1), 23

  22. Formal Semantics • Query Qon ULDB D implementation of Q D D’ D + Result operational semantics possible instances representation of instances Qon each instance D1, D2, …, Dn Q(D1), Q(D2), …, Q(Dn)

  23. TriQL: Querying Confidences • Built-in function:Conf() • SELECT Drives.person INTO Suspects • FROM Saw, Drives • WHERE Saw.car = Drives.car • AND Conf(Saw) > 0.5 AND Conf(Drives) > 0.8

  24. TriQL: Querying Lineage • Built-in join predicate:Lineage() • SELECT Saw.witness INTO AccusesHank • FROM Suspects, Saw • WHERE Lineage(Suspects,Saw) • AND Suspects.person = ‘Hank’

  25. Operational Semantics SELECT attr-list [ INTO table ] FROM X1, X2, ..., Xn WHERE predicate • Over conventional relational database: For each tuple in cross-product of X1, X2, ..., Xn • Evaluate the predicate • If true, project attr-list to create result tuple • If INTO clause, insert into table

  26. Operational Semantics SELECT attr-list [ INTO table ] FROM X1, X2, ..., Xn WHERE predicate • Over ULDB: For each tuple in cross-product of X1, X2, ..., Xn • Create “super tuple” T from all combinations of alternatives • Evaluate predicate on each alternative in T; keep only the true ones • Project attr-list on each alternative to create result tuple • Details: ‘?’, lineage, confidences

  27. Operational Semantics: Example SELECT Drives.person FROM Saw, Drives WHERE Saw.car = Drives.car

  28. Operational Semantics: Example SELECT Drives.person FROM Saw, Drives WHERE Saw.car = Drives.car

  29. Operational Semantics: Example SELECT Drives.person FROM Saw, Drives WHERE Saw.car = Drives.car

  30. Operational Semantics: Example SELECT Drives.person FROM Saw, Drives WHERE Saw.car = Drives.car

  31. Operational Semantics: Example SELECT Drives.person FROM Saw, Drives WHERE Saw.car = Drives.car

  32. Operational Semantics: Example SELECT Drives.person FROM Saw, Drives WHERE Saw.car = Drives.car

  33. Operational Semantics: Example SELECT Drives.person FROM Saw, Drives WHERE Saw.car = Drives.car

  34. Operational Semantics: Example SELECT Drives.personINTO Suspects FROM Saw, Drives WHERE Saw.car = Drives.car ? λ( ) = ... λ( ) = ... ?

  35. Confidences • Confidences supplied with base data • Trio computes confidences on query results • Default probabilistic interpretation • Can choose to plug in different arithmetic ? ? Probabilistic Min 0.3 0.4 ? 0.6

  36. Additional Query Constructs • “Horizontal subqueries” Refer to tuple alternatives as a relation • Unmerged (horizontal duplicates) • Flatten, GroupAlts • NoLineage, NoConf, NoMaybe • Query-computed confidences • Data modification statements

  37. Final Example Query Credibility • List suspects with conf values based on accuser credibility

  38. Final Example Query Credibility SELECT suspect, score/[sum(score)] as conf FROM (SELECT suspect, (SELECT score FROM Credibility C WHERE C.person = P.accuser) FROM PrimeSuspect P)

  39. Final Example Query Credibility SELECT suspect, score/[sum(score)] as conf FROM (SELECT suspect, (SELECT score FROM Credibility C WHERE C.person = P.accuser) FROM PrimeSuspect P)

  40. Final Example Query Credibility SELECT suspect, score/[sum(score)]as conf FROM (SELECT suspect, (SELECT score FROM Credibility C WHERE C.person = P.accuser) FROM PrimeSuspect P)

  41. Trio System: Version 1 TrioExplorer (GUI client) • DDL commands • TriQL queries • Schema browsing • Table browsing • Explore lineage • On-demand • confidence • computation Command-line client Trio API and translator (Python) • “Verticalize” • Shared IDs for • alternatives • Columns for • confidence,“?” Standard SQL • Table types • Schema-level • lineage structure Standard relational DBMS • conf() • lineage() “==>” Encoded Data Tables Trio Metadata • One per result • table • Uses unique IDs Lineage Tables Trio Stored Procedures

  42. Future Features (sample) • Uncertainty • Incomplete relations • Continuous uncertainty • Correlated uncertainty • Lineage • External lineage • Update lineage • Query processing • “Top-K” by confidence

  43. UNCERTAINTY LINEAGE DATA but don’t forget the lineage…

More Related