250 likes | 406 Views
UNCERTAINTY. LINEAGE. DATA. Trio: A System for Data, Uncertainty, and Lineage. Search “stanford trio” http://i.stanford.edu/trio. People. Current Jennifer Widom (faculty) Omar Benjelloun (post-doc) Parag Agrawal, Anish Das Sarma, Shubha Nabar (PhD) Michi Mutsuzaki (MS)
E N D
UNCERTAINTY LINEAGE DATA Trio: A System for Data, Uncertainty, and Lineage Search “stanford trio” http://i.stanford.edu/trio
People • Current • Jennifer Widom (faculty) • Omar Benjelloun (post-doc) • Parag Agrawal, Anish Das Sarma, Shubha Nabar (PhD) • Michi Mutsuzaki (MS) • Tomoe Sugihara (visitor) • Incoming • Martin Theobald (post-doc) • Raghu Murthy (MS) • Ander de Keijzer (visitor) • Alums • Alon Halevy, Ashok Chandra (visitors) • Chris Hayworth (MS)
Why Uncertainty + Lineage? • Many applications seem to need both • From a technical standpoint, it turns out that • lineage... • Enables simple and consistent representation of uncertain data • Correlates uncertainty in query results with uncertainty in the input data • Can make computation over uncertain data more efficient
Trio Components • Data Model • ULDBs (Uncertainty-Lineage Databases): • Simple extension to relational model • Query Language • TriQL: Simple extension to SQL, well-defined semantics and intuitive behavior • System • Version 1: Complete system and GUI built on top of conventional DBMS
Running Example: Crime-Solving • Saw(witness,car) // may be uncertain • Drives(person,car) // may be uncertain • Suspects(person)= πperson(Saw ⋈ Drives)
Our Model for Uncertainty • 1. Alternatives • 2. ‘?’ (Maybe) Annotations • 3. Confidences
Our Model for Uncertainty • 1. Alternatives:uncertainty about value • 2. ‘?’ (Maybe) Annotations • 3. Confidences Three possible instances =
Our Model for Uncertainty • 1. Alternatives • 2.‘?’ (Maybe): uncertainty about presence • 3. Confidences ? Six possible instances
Our Model for Uncertainty • 1. Alternatives • 2. ‘?’ (Maybe) Annotations • 3. Confidences: weighted uncertainty ? Six possible instances, each with a probability
Models for Uncertainty • Our model (so far) is not especially new • We spent some time exploring the space of models for uncertainty [ICDE 06, journal] • Tension between understandability and expressiveness • Our model is understandable • But it is not complete, or even closed under common operations
Our Model is Not Closed Suspects= πperson(Saw ⋈ Drives) CANNOT Does not correctly capture possible instances in the result ? ? ?
Lineage to the Rescue • Lineage • Captures “where data came from” • In Trio:A functionλ from alternatives to other alternatives (or external sources)
Example with Lineage Correctly captures possible instances in the result Suspects= πperson(Saw ⋈ Drives) λ(31) = (11,2),(21,2) ? λ(32,1) = (11,1),(22,1); λ(32,2) = (11,1),(22,2) ? ? λ(33) = (11,1), 23
Uncertainty-Lineage Databases (ULDBs) • 1. Alternatives • 2. ‘?’ (Maybe) Annotations • 3. Confidences • 4. Lineage • ULDBs are closed and complete • [VLDB 06]
ULDBs: Lineage • Conjunctive lineage sufficient for most operations • Duplicate-elimination: Disjunctive lineage • Difference: Negative lineage • General case after multiple operations/queries: Boolean formula
ULDBs: Interesting Questions • Data-minimality: extraneous alternatives, extraneous “?” • Lineage-minimality: harder • Membership: tuple and table, some-instance and all-instances • Coexistence: multiple tuples • Extraction: remove tables, retain possible-instances
Example: Extraneous Data ? extraneous ? ?
Example: Coexistence ? Can’t coexist ? ? ?
Querying ULDBs: Semantics • Query Qon ULDB D implementation of Q D D’ D + Result operational semantics possible instances representation of instances Qon each instance D1, D2, …, Dn Q(D1), Q(D2), …, Q(Dn)
Querying ULDBs: TriQL • Basic TriQL: SQL with new semantics • Obeys commutative diagram for uncertain data • Tracks lineage • Query results: new table or on-the-fly • Implemented TriQL: also built-in predicates conf(),lineage(), lineage*()
Additional TriQL Constructs • [Language manual on web site] • “Horizontal subqueries” Refer to tuple alternatives as a relation • Unmerged (horizontal duplicates) • Flatten, GroupAlts • NoLineage, NoConf, NoMaybe • Query-specified confidences [done] • Data modification statements
Confidence Computation • Confidences computed on-demand based on lineage • Confidence of alternative A is function of confidences in λ*(A) • Permits any query plan for data computation • Default probabilistic interpretation, but queries can override SELECT person, min(conf(Saw),conf(Drives)) as conf FROM Saw, Drives WHERE Saw.car = Drives.car
Trio System: Version 1 TrioExplorer (GUI client) • DDL commands • TriQL queries • Schema browsing • Table browsing • Explore lineage • On-demand • confidence • computation Command-line client Trio API and translator (Python) • “Verticalize” • Shared IDs for • alternatives • Columns for • confidence,“?” Standard SQL • Table types • Schema-level • lineage structure Standard relational DBMS • conf() • lineage() “==>” • lineage*() “==>>” Encoded Data Tables Trio Metadata • One per result • table • Uses unique IDs Lineage Tables Trio Stored Procedures
Current & Future Topics • Algorithms: confidence computation, coexistence • extraneous data • Minimize lineage traversal • Memoization • Batch operations • System • Full query language • More internal processing ? • Storage and indexing • Statistics and query optimization
Current & Future Topics • Top-K by confidence • Extend basic uncertainty model • Incomplete relations • Continuous uncertainty • Correlated uncertainty ? • External lineage, • update lineage, • versioning