230 likes | 239 Views
This presentation discusses the concept of approximate lineage in probabilistic databases and its importance in tracking correlations, explaining query results, and improving query processing efficiency. The approach presented focuses on reducing redundancy and approximating the most important correlations in the lineage. It also includes experiments and conclusions.
E N D
Approximate Lineage for Probabilistic Databases Christopher Ré and Dan Suciu University of Washington
Approximate Lineage in One Slide • Lineage (Provenance) • In QP used to track correlations • Explain query/view results • VLDBs have lots of lineage • Chokes QP • Hard for users to understand • Obs: lineage contains a lot of redundancy! In a view, lineage is all derivations of a tuple probabilistic databases Especially with complex queries/views This work: Approximate the lineage, by keeping only the most important correlations
Overview • Motivation & Preliminaries • An apx lineage approach: Sufficient Lineage • Experiments • Conclusions
Inspired by the Geneontology (GO) Database A Protein Database Standard pDB, e.g. Mystiq, Trio Data are from somewhere Process (P) Atoms Lineage (l) is important Manually Created Lineage from a wide variety of sources – not all trusted the same Machine inferred Some with confidence, too!
PRA[Fuhr&Rolleke 97], Trio [Widom 05], Mystiq [R,Dalvi,S07] Review: Lineage tracking Lineage propagates with queries /views “Proteins related to same process as `Aac11’” How do we derive the lineage ? V(y) :- P(x,y),P(`Aac11’, y), x `Aac11’ l1 Lineage tracks allderivations Process (P) Prob QP: Pr[V(‘AGO2’)] = Pr[l1] Big DB = Big Lineage (GO) 1 tuple 10MB lineage! Big Lineage chokes the engine!
Problems with Large Lineage in pDB This talk • Lineage is used to: • Process Queries • Give explanations to users • Find influential atoms • Large: chokes QP • Large:Many redundant explanations • Large:Needle in a haystack On VLDBs, helpful to shrink (approximate) the lineage
Approximate Lineage Approach Original VLDB Level 2 Database (Small lineage) Level 1 Database (Big lineage) error, e a l smaller, approximate formula All (most) querying on Level 2 database (using a instead of l) Focus is on the Level 2 database
Overview • Motivation & Preliminaries • An apx lineage approach: Sufficient Lineage • Experiments • Conclusions
Sufficient lineage (SL) • Represent as? • Use as to: • Answer queries? • Provide explanations? • Find influential tuples? • Build good a, efficiently? DNF formulae, that logically imply l Reuse existing systems! a is a lower bound l See paper The remainder of this talk Nugget: An algorithm that always finds small, good SL
Formalizing “good as” Choosing an approximation a for a lineage function, l Formalizing this, Atoms E[l – a] e An atom is a Boolean proposition. A world is a set of the true atoms. Expectation of difference over all worlds, should be small Intuition: a should agree on most worlds NB: really standard ℓ2 distance
Illustrating Good Lineage E[l – a] = E[l] – E[a] e e= 0.054 Intuition: Pr[a] high means good lineage 0.9 *(1 - (1 - 0.8)(1-0.3)) 0.9 * 0.8 = 0.72 = 0.9 *0.86 = .774
1st step: Lineage DNFs to “graphs” X1 Y1 (X1˄ Y1) ˅(X2˄ Y1) X2 Y2 We can think of DNFs as graphs (k-DNF a k-hypergraph) Atoms = nodes Ym Xn Monomials = edges Trick: matching is an SL formula. Goal: Given error e, find a subset of edges with error smaller than eand small size, i.e. a best lower bound;
How big a matching could we need? Assume Pr[Xi ] = Pr[Yj] = 0.5 X1 Y1 X2 Y2 Pr[M] = 1- (1-0.25)|M| Matching of size 9 implies Pr[M] > .9 For any e > 0.1 ; M can always < 9 Ym Xn Subtle: size bound depends on k, e and Pr[Xi] – not # of tuples If l has a small good matching, take a to be matching. Call this a “good enough matching”
There is not always a good-enough matching X1 ˄ APX(Y1 ˅ Y2 ˅ … Ym) ˅ (X2 ˅ Z) X1 Y1 (Y1 ˅ Y2 ˅ … Ym) – a (k-1)-DNF Y2 Y5 Formally, {X1,X2} is a small cover Must apx the (k-1)-DNF w. smaller e to account for correlations Ym X2 Z Obs: no “good-enough matching”, then cover must be small Best matching is 0.4 , but formula very close to 0.625! nodes in any maximal matching
SL is always small THM (SL is always small) Size of SL is constant in data. Two Cases: Small-good matching Small-cover of important nodes We’re done! Recurse on k-1 DNF Requires “non-vanishing” probs In datasets, usually, Pr > 10-3 Exponential in query Similar to data-complexity Problem: Maximum matching in general hypergraphs is NP-hard need a maximal matching – pick greedily! Apx NP-hard!
Summary of Constructing SL • For SL, good lineage = big lineage • Not true in general. • Gave an algorithm that always finds small SL • Constant in the data • Exponential in almost everything else • Main trick: Don’t try to find optimal solutions, when sloppy is good enough!
Other fun results in the paper • Sufficient Lineage (SL) • Error bounds for QP • Finding influential tuples • Polynomial Lineage (PL): DNF to polynomial • Use Taylor/Fourier approximation of poly • Algos for QP, explanations and influential tuples • Leverage extensive prior art! PL smaller than SL, but not usable in pDBs (Mystiq, Trio).
Overview • Motivation & Preliminaries • An apx lineage approach: Sufficient Lineage • Experiments • Conclusions
Experiments • Geneontology Database • Publically available • Predefined views • Atoms = “evidence codes” • Discuss a single view • 6 tables • 2 sources of evidence • 1119 tuples • 141MB Similar results on IMDB data not presented “All proteins associated with a single protein”
Compression Ratio v. Error Compress Ratio 30x compression 141MB to 4MB Good compression ratio even for stringent error e, error level (smaller is more conservative)
Effect on QP Compute each tuple in the view Original Lineage Running Time Seconds (Log10 Scale) Sufficient Lineage e, error level (smaller is more conservative)
Which ls give the biggest gain? Original Lineage Win: Compressing big terms # Terms Sufficient Lineage Compressing Single View Top 500 formula in descending size (# is rank)
Conclusion • Discussed approximate lineage approach • Goal: Fast QP, Explanations • Sufficient Lineage • Can be used by standard QPs • Improves QP dramatically • Apx lineage is more general, e.g. Polynomial