370 likes | 545 Views
Uncertainty Lineage Data Bases. Very Large Data Bases. 1975. 2006. UNCERTAINTY. LINEAGE. DATA. ULDBs: Databases with Uncertainty and Lineage. Omar Benjelloun, Anish Das Sarma , Alon Halevy, Jennifer Widom Stanford InfoLab. Mot iv ation.
E N D
Uncertainty Lineage Data Bases Very Large Data Bases 1975 2006
UNCERTAINTY LINEAGE DATA ULDBs: Databases with Uncertainty and Lineage Omar Benjelloun, Anish Das Sarma, Alon Halevy, Jennifer Widom Stanford InfoLab
Motivation • Many applications involve data that is uncertain (approximate, probabilistic, inexact, incomplete, • imprecise, fuzzy, inaccurate, ...) • Many of the same applications need to track the lineageof their data • Neither uncertainty nor lineage are supported by conventional DBMSs Coincidence or Fate?
Sample Applications Needing Uncertainty and Lineage • Scientific databases • Sensor databases • Data cleaning • Data integration • Information extraction
Trio Project Building a new kind of DBMS in which: • Data • Uncertainty • Lineage are all first-class interrelated concepts
Coincidence or Fate? Lineage and Uncertainty • Lots of independent work in lineage and uncertainty (related work at end of talk) • Turns out: The connection between uncertainty and lineage goes deeper than just a shared need by several applications
Lineage and Uncertainty • Lineage... • Enables simple and consistent representation of uncertain data • Correlates uncertainty in query results with uncertainty in the input data • Can make computation over uncertain data more efficient
Outline of the Talk • The ULDB data model • Querying ULDBs • ULDB properties • Membership and extraction operations • Confidences • Current, related, and future work
Running Example: Crime Solver Saw(witness,car) Drives(person,car) Suspects(person) = πperson(Saw ⋈ Drives)
Uncertainty • Anuncertain database represents a set of possible instances. Examples: • Amy saw either a Honda or a Toyota • Jimmy drives a Toyota, a Mazda, or both • Betty saw an Acura with confidence 0.5 or a Toyota with confidence 0.3 • Hank is a suspect with confidence 0.7
Uncertainty in a ULDB 1. Alternatives 2. ‘?’ (Maybe) Annotations 3. Confidences
Three possible instances Uncertainty in a ULDB 1.Alternatives:uncertainty about value 2. ‘?’ (Maybe) Annotations 3. Confidences =
Uncertainty in a ULDB 1. Alternatives 2. ‘?’ (Maybe):uncertainty about presence 3. Confidences ? Six possible instances
Uncertainty in a ULDB 1. Alternatives 2. ‘?’ (Maybe) Annotations 3. Confidences:weighted uncertainty ? Six possible instances, each with a probability
Data Models for Uncertainty • Our model (so far) is not especially new • We spent some time exploring the space of models for uncertainty [ICDE 2006] • Tension between understandability and expressiveness • Our model is understandable • But it is not complete, or even closed under common operations
Closure and Completeness • Completeness Can represent all sets of possible instances • Closure Can represent results of operations • Note: Completeness Closure
Model (so far) Not Closed Suspects= πperson(Saw ⋈ Drives) CANNOT Does not correctly capture possible instances in the result ? ? ?
Lineage to the Rescue • Lineage: “where data came from” • Internal lineage • External lineage (not covered in this talk) • In ULDBs: A functionλ from alternatives to sets of alternatives (or external sources)
Correctly captures possible instances in the result Example with Lineage Suspects= πperson(Saw ⋈ Drives) λ(31) = (11,2),(21,2) ? λ(32,1) = (11,1),(22,1); λ(32,2) = (11,1),(22,2) ? λ(33) = (11,1), 23 ?
ULDBs • Alternatives • ‘?’ (Maybe) Annotations • Confidences • Lineage ULDBs are Closed and Complete
Outline of the Talk • The ULDB data model • Querying ULDBs • ULDB properties • Membership and extraction operations • Confidences • Current, related, and future work
Querying ULDBs • Query Qon ULDB D implementation of Q D D’ D + Result possible instances representation of instances Qon each instance D1, D2, …, Dn Q(D1), Q(D2), …, Q(Dn)
Well-Behaved ULDBs • If we start with a well-behaved ULDB and perform standard queries, it remains well-behaved • Intuitively (details in paper): • Acyclic:No cycles in the lineage • Deterministic:Non-empty lineages of distinct alternatives are distinct • Uniform: Alternatives of same tuple are derived from the same set of tuples
ULDB Minimality • Data-minimality • Does every alternative appear in some possible instance? (no extraneous alternatives) • Does every maybe-tuple in Rnot appear in some possible instance? (no extraneous ‘?’s) • Lineage-minimality
Data-Minimality Examples Extraneous ‘?’ λ(20,1)=(10,1); λ(20,2)=(10,2) ? extraneous
Data-Minimality Examples Extraneous alternative ? extraneous ? ?
Data-Minimization • Extraneous alternative theorem: • An alternative is extraneous iff it is (possibly transitively) derived from multiple alternatives of the same tuple. • Extraneous “?” theorem • A “?” on tuple t is extraneous iff • it is derived from base tuples without “?” • t has as many alternatives as the product of the number in its base tuples • Minimization algorithm based on the theorems (see paper)
ULDB Properties and Operations Data-minimize Lineage-minimal Queries Data-minimal Lineage-minimal Data-minimal Extraction Membership Lineage-minimize
R possible instances I1, I2, …, In Membership Questions • Does a given tuple t appear in some (all) possible instance(s) of R? • Polynomial algorithms based on Data-minimization • Is a given table T one of (all of) the possible instances of R? • NP-Hard t? , T?
Extraction Drives Saw • Extraction algorithm in paper Eats Suspects
Outline of the Talk • The ULDB data model • Querying ULDBs • ULDB properties • Membership and extraction operations • Confidences • Current, related, and future work
Confidences • Confidences supplied with base data • Trio computes confidences on query results • Default probabilistic interpretation • Can choose to plug in different arithmetic ? ? Probabilistic Min 0.3 0.4 ? 0.6
Query Processing with Confidences • Previous approach (probabilistic databases) • Each operator computes confidences during query execution • Only certain query plans allowed • In ULDBs • Confidence of alternative A is function of confidences in its transitive lineage • Our approach: Decouple data and confidence computation • Use any query plan for data computation • Compute confidences on-demand using lineage • Can give arbitrarily large improvements
Current Work: Algorithms • Algorithms: confidence computation, extraneous data, membership questions • Minimize lineage traversal • Memoization • Batch computations
The Trio Trio • Data Model • ULDBs (Coming: incomplete relations; continuous uncertainty; correlation uncertainty) • Query Language • Simple extension to SQL • Query uncertainty, confidences, and lineage • System • Did you see our demo? • Version 1: Entirely on top of conventional DBMS • Surprisingly easy and complete, reasonably efficient TriQL
Brief Related Work • Uncertainty • Modeling • C-tables [IL84], Probabilistic Databases [CP87], using Nested Relations [F90] • Systems • ProbView [LLRS97], MYSTIQ [BDM+05], ORION [CSP05], Trio [BDHW05] • Lineage • DBNotes [CTV05], Data Warehouses [CW03]
UNCERTAINTY LINEAGE DATA but don’t forget the lineage… Thank You Search “stanford trio” (or, http://i.stanford.edu/trio)