350 likes | 508 Views
Circuits for Datalog Provenance. Daniel Deutch Tel Aviv Univ. Tova Milo Tel Aviv Univ. Sudeepa Roy Univ. of Washington Val Tannen Univ. of Pennsylvania. A Simple Example of Data Provenance. “ Boolean Provenance/Lineage ” as a Boolean formula Q is true on D F Q,D is true
E N D
Circuits for Datalog Provenance Daniel Deutch Tel Aviv Univ. Tova Milo Tel Aviv Univ. Sudeepa Roy Univ. of Washington Val Tannen Univ. of Pennsylvania
A Simple Example of Data Provenance • “Boolean Provenance/Lineage” as a Boolean formula • Q is true on D FQ,D is true • Poly-size, Poly-time computable (data complexity) • But Q is a RA+ query • This talk: What if Q is a Datalog Program? y1 x1 z1 y2 x2 z2 y3 Database D Boolean query Q: x y AsthmaPatient(x) Friend (x, y) Smoker(y) FQ,D = (x1y1z1) (x1y2z2) (x2y3z2)
Motivation • Provenance • Reliability and repeatability • View management and deletion propagation • Trust and security management • Query answering in probabilistic database, …. • Datalog • Datalog is popular again! (two keynotes this ICDT/EDBT) • Data extraction in Web, declarative networking • Academic/commercial systems (Webdamlog, LogicBlox, Dedalus, Dyna) • Finding suitable “Provenance for Datalog” is important • Both from theoretical and practical viewpoints • How do we compute, store, and interpret provenance for datalog programs efficiently and effectively?
Overview of Our Results • Can we get poly-size Boolean formulas for datalog provenance? No, even if we allow unbounded time • Do we have a solution? Yes! Use Boolean Circuits! • What about general “provenance semirings” beyond Boolean provenance? ref. [Green et. al. ’07] It depends on the semiring
Outline • Background • Circuits for Boolean Provenance • Circuits for General Provenance Semirings
Outline • Background • Circuits for Boolean Provenance • Circuits for General Provenance Semirings
Datalog • Datalog program for Transitive Closure and Single-source Reachability • EDB (base) relation for edges: R • IDB (derived) relations • Transitive closure (T) • Single-source reachability from vertex ‘a’ (S) T(x, y) :- R(x, y) T(x, y) :- R(x, z), T(z, y) S(x) :- T(a, x) EDB (Extensional Databases) IDB (Intensional Databases)
Boolean Provenance PosBool(X)-Database • Tuples are annotated with variables from a set X • Here X = {x1, x2, y1, y2, ….} • For n tuples in X, 2n possible worlds by assignments : X {True, False} • Useful in query evaluation on incomplete or probabilistic databases y1 x1 z1 y2 x2 z2 y3 PosBool(X)-database D
RA+ over PosBool(X)-Database • Annotation propagates from input to output • Join = , Projection/Union = • Output tuples are annotated by monotone Boolean formula • FQ,D is the annotation of the unique output tuple y1 x1 z1 y2 x2 z2 y3 PosBool(X)-Database D RA+Q: x y AsthmaPatient(x) Friend (x, y) Smoker(y) FQ,D = (x1y1z1) (x1y2z2) (x2y3z2)
Two Important Properties:RA+ over PosBool(X)-Database For all RA+ query Q, D, and assignment • (Faithful Representation) Q(D)= [Q(D)] • (Poly-size overhead) The size of FQ,D is poly in |D| and can be computed in poly-time. y1 x1 True z1 False True y2 x2 False z2 True False y3 True PosBool(X)-Database D RA+Q: x y AsthmaPatient(x) Friend (x, y) Smoker(y) = False FQ,D = (x1y1z1) (x1y2z2) (x2y3z2) = False
Datalog over PosBool(X) Database T(a, b) T(x, y) :- R(x, y) T(x, y) :- R(x, y), T(y, z) S(x) :- T(a, x) • Semantics using Derivation Trees (Green et al. 2007) • Annotation of T(a, b): b p T(a, b) R(a, b) q a R(a, a) T(a, b) T(a, b) Trees Leaves t of R(a, b) Annot(t) R(a, a) T(a, b) = q = (q) (pq) (ppq) … R(a, a) T(a, b) • Infinitely many trees • But always has a finite equivalent form … R(a, b) But not necessarily poly-size
Lower Bound: Boolean formulas for Datalog Provenance on PosBool(X) Theorem: Given PosBool(X)-database D and datalog program P, provenance of tuples in P(D) cannot have a faithful representation using Booleanformulas of size polynomial in |D| Proof outline: • st-connectivity on n nodes requires n(logn)-size monotone Boolean formula • Karchmer-Wigderson, 1988 • Faithful representation requires: for all True/False assignments to X, • P(D)= [P(D)] • Reduce to the hard instance with right when P = transitive closure Solution: Boolean Circuit!
Outline • Background • Circuits for Boolean Provenance or PosBool(X) • Circuits for General Provenance Semirings
Boolean Circuits b a • Circuit is a DAG • use common subexpressions • Boolean formula = tree • Leaf nodes: • EDB vars in X • Internal nodes • : IDB/EDB vars used in one derivation • : Alternative derivations • Roots: • IDB vars T(x, y) :- R(x, y) T(x, y) :- R(x, y), T(y, z) S(x) :- T(a, x) p q XT(a, b) XT(a, b) p q XR(a, a) XR(a, b)
Upper Bound: Boolean Circuits for PosBool(X) Theorem: Given any PosBool(X)-database D and datalog program P, provenance of tuples in P(D) can be faithfully represented using monotone Boolean Circuits of poly-size in |D| (and can be computed in poly-time)
Proof Skecth Two key ideas from previous work 1. Datalog Provenance can be represented by a system of equations by instantiating vars in the datalog program P to EDB/IDB tuples[Green et al. 2007] • EDB tuples constants, IDB tuples variables • Iteratively solve this system of equations • Fixpoint = provenance for all IDB tuples 2. A System of equations with N Boolean variables can be solved in N+1 iterations [Esparza et al. 2011] • N = #IDB tuples • Build a circuit with N+1 layers from the system of equations
Illustration T(x, y) :- R(x, y) T(x, y) :- R(x, y), T(y, z) S(x) :- T(a, x) Step1 : Build system of equations by all possible instantiations: x, y, z a, b XT(a, a) = p (p XT(a, a)) XT(a, b) = q (p XT(a,b)) XS(b) = XT(a, b) XS(a) = XT(a, a) Step 2: Build a circuit with 4 + 1 layers (N = 4) … b p a q Const var
Illustration Multiple roots for multiple IDB vars XT(a, a) = p (p XT(a, a)) XT(a, b) = q (p XT(a,b)) XS(b) = XT(a, b) XS(a) = XT(a, a) XT(a,a),2 XS(a),2 XS(b),2 XTa,a),2 XT(a,b),2 Level 2 Level 1 XS(a),1 XT(a,a),1 XS(b),1 XT(a,a),1 XT(a,b),1 XS(b),0 XS(a),0 XT(a,b),0 XT(a,a),0 XT(a,a),0 p false q false false false false Assign leaf IDB vars to false
Optimizations • Store only two levels of circuit instead of N+1 levels • Evaluate iteratively • Embed circuit construction in semi-naïve evaluation • Check for new derivations, not only new IDB variables • Sound and Complete • Remove self-dependency of IDB vars • works for PosBool(X) and also some other semirings… XT(a, a)= p (p XT(a, a)) XT(a, b) = q (p XT(a,b)) XS(b) = XT(a, b) XS(a) = XT(a, a)
Illustration (From here…) XT(a,a),2 XS(a),2 XS(b),2 XTa,a),2 XT(a,b),2 Level 2 Level 1 XS(a),1 XT(a,a),1 XS(b),1 XT(a,a),1 XT(a,b),1 XS(b),0 XS(a),0 XT(a,b),0 XT(a,a),0 XT(a,a),0 p false q false false false false
Illustration (…To here) With all these optimizations XT(a,a),top XS(a),top XT(a,b),top Top Level Bottom Level q p XS(a),bottom XT(a,b),bottom XT(a,a),bottom
Applications of PosBool(X)-Circuits • Linear-time deletion propagation (in circuit-size) • Approximation for probabilistic databases • even when only the circuit (and not the database) is available • Circuits can be computed “offline” • Only linear-time evaluation is required when needed (e.g. deletion propagation) • compared to storing and solving a system of equations iteratively, or • re-evaluating datalog program • Can use existing techniques for efficient and parallel circuit evaluation
Outline • Background • Circuits for Boolean Provenance or PosBool(X) • Circuits for General Provenance Semirings
Commutative Semirings • (K, +K, K, 0K, 1K) • domain K • +K, K : associative, commutative, have neutral elements 0K, 1K • K distributes over +K , i.e. a K (b +K c) = a K b +K a K c • 0K cancels any element in K, i.e. a K 0K = 0K K a = 0K Examples: • (B, , , False, True) • Set semantics • (N, +, , 0, 1) • Bag semantics • (N {}, min, +, , 0) • Tropical semiring to compute cost (e.g. cost of a shortest path)
Provenance Semirings • Generalization of PosBool(X) • (K, +K, K, 0K, 1K) • Tuples are annotated with variables from X • K is of the form Prov(X) • +K denotes alternative usage • K denotes joint usage • Examples: • (PosBool(X), , , False, True) • (Lin(X), , , , ) • tracks contributing tuples[Cui et. al. ’00] • (Why(X), , , , {}) • : pairwise union of subsets, tracks contributing tuples in alternative derivations [Buneman et. al. ’01]
Provenance Specialization • Key property needed for applications like deletion propagation, trust management, cost computation, … • Prov(X) specializes correctly to K, if any valuation v : X K extends uniquely to a homomorphism hv : Prov(X) K (which correctly maps +, of Prov(X) to that of K) • Further, some provenance semirings are “more informative” than the others
Provenance Semiring Hierarchy N[X] More informative Less informative Defined later N (bag) Sorp(X) Why(X) Tropical PosBool(X) Lin(X) Specializes correctly Security Boolean (set)
Datalog Provenance for General Semirings PosBool(X) Trees Leaves t of Annot(t) k +k Trees Leaves t of Annot(t) General Prov(X) • Infinite sums should be well-defined • Need to consider “–continuous semirings” and “–continuous homomorphism”
Provenance Semiring Hierarchy Need to add N[[X]] and N Finite so -continuous N[X] N[[X]] : Most informative provenance semiring [Green et al. ’07] N (bag) Sorp(X) Why(X) Tropical PosBool(X) Lin(X) Security Boolean (set)
How good is N[[X]] w.r.t. Size of Datalog Provenance? • Poly-size overhead is not valid because of infinite sum • But can outputs have finite annotations (with X, , +) that specializes correctly to semirings with finite domains? Theorem: • It is not possible to annotate with finite provenance expressions • the output of datalog programs following N[[X]] -semantics • that specialize “correctly” to the semiring Why(X) Finite annotations won’t specialize correctly to Why(X) Theorem: However, we can generate poly-size circuits in poly-time directly for Why(X) • Need more levels in the circuit from system of equations • Need a different argument for correctness
Can we still have a good general semiring w.r.t. size? • We propose Sorp(X) • Most general absorptive semiring • a + a.b = a • N[X] but keep polynomials that are not “absorbed” by the others • e.g. pq + p2q3 pq p2q + pq2 p2q + pq2 • The same algorithm, proof, and optimizations to construct poly-size circuits hold • Circuits are more general than Boolean circuit • Specializes correctly to interesting semirings • Outputs can be annotated by poly-size circuits
Provenance Semiring Hierarchy N[X] N (bag) Sorp(X) Why(X) Tropical PosBool(X) Lin(X) Security Boolean (set)
Related Work • Data Provenance • e.g. [Cui et. al.’00, Buneman et al. ’08, Cheney et al. ’09, Benjelloun et al. ’08] • Circuits • Circuit complexity (size, /depth, parallelism) has been studied for decades, e.g. [Arora-Barak ’09] (book) • Provenance for Datalog • System of equations, derivation trees, infinite sum [Grahne’91, Green et al. ’07] • Poly-size c-tables with Boolean formulas for datalog with contradictions [Abiteboul et al. 2014]
Conclusions • Circuits to represent and store Datalog Provenance • for PosBool(X) and other semirings • Semantics, Algorithms, Limitations, Applicability • Preliminary experiments support our results • we compared circuits for deletion propagation with iteratively solving system of equations and reevaluation of datalog from scratch • Future Work: • A complete implementation, evaluation, new applications
Thank You Questions?