260 likes | 491 Views
Provenance Semirings. T.J. Green, G. Karvounarakis, V. Tannen University of Pennsylvania. PODS 2007. Provenance. First studied in data warehousing Lineage [ Cui,Widom,Wiener 2000 ] Scientific applications (to assess quality of data) Why-Provenance [ Buneman,Khanna,Tan 2001 ]
E N D
Provenance Semirings T.J. Green, G. Karvounarakis, V. TannenUniversity of Pennsylvania PODS 2007
Provenance • First studied in data warehousing • Lineage [Cui,Widom,Wiener 2000] • Scientific applications (to assess quality of data) • Why-Provenance [Buneman,Khanna,Tan 2001] • Our interest: P2P data sharing in the ORCHESTRA system (project headed by Zack Ives) • Trust conditions based on provenance • Deletion propagation PODS 2007
Annotated relations • Provenance: an annotation on tuples • Our observation: propagating provenance/lineage through views is similar to querying • Incomplete Databases (conditional tables) • Probabilistic Databases (independent tuple tables) • Bag Semantics Databases (tuples with multiplicities) • Hence we look at queries on relations with annotated tuples PODS 2007
Incomplete databases: boolean C-tables R boolean variables semantics: a set of instances { } , ; I(R)= , , , , , , PODS 2007
Imielinski & Lipski (1984): queries on C -tables R union of conjunctive queries (UCQ) r s r q(x,z) :- R(x,_,z), R(_,_,z) q(x,z) :- R(x,y,_), R(_,y,z) r r q(R) p=true r=false s=true = PODS 2007
Why-provenance/lineage Which input tuples contribute to the presence of a tuple in the output? same query q(R) R tuple ids [Cui,Widom,Wiener 2000] [Buneman,Khanna,Tan 2001] PODS 2007
C –tables vs. Lineage c-table calculations lineage calculations The structure of the calculations is the same! PODS 2007
Another analogy, with bag semantics R tuple multiplicities c-table calculations same query q(R) multiplicity calculations The structure of the calculations is the same! PODS 2007
Abstracting the structure of these calculations These expressions capture the abstract structure of the calculations, which encodes the logical derivation of the output tuples We shall use these expressions as provenance abstract calculations PODS 2007
Technical Development • Abstractly annotated relations (K-relations) and their relational algebra • K must be semiring • For provenance, K is semiring of polynomials • Datalog on K-relations • For provenance, K consists of (possibly infinite) formal power series PODS 2007
K-relations • Annotations are elements from an algebraic structure (K,+,¢, 0, 1) • IfD is the domain of database values, an n-ary K-relationis a function: R: Dn! K Although the notation resembles arithmetic, these are abstract operations All possible tuples PODS 2007
K-relations, annotated tables • K-relationcorresponds to table: R: Dn! K • If R(t)=k, then t“is annotated by k” • For all but finitely many tuples t, R(t) = 0 • we omit those tuples from the table representation PODS 2007
Positive K-relational algebra • We define an RA+ on K-relations: • The ¢ corresponds to join: • The + corresponds to union and projection • 0and 1 are used for selection predicates • Details in the paper (but recall how we evaluated the UCQ q earlier and we will see another example later) PODS 2007
RA+ identities imply semiring structure! • Common RA+ identities • Unionandjoinareassociative, commutative • Join distributesoverunion • etc. (but notidempotence!) These identities hold for RA+ onK-relations iff (K, +,¢, 0, 1) is a commutative semiring (K,+,0)is a commutative monoid (K,¢,1)is a commutative monoid ¢distributes over+, etc PODS 2007
Calculations on annotated tables are particular cases PODS 2007
Provenance Semirings • X = {p, r, s, …}: indeterminates (provenance “tokens” for base tuples) • N[X]: multivariate polynomials with coefficients in Nand indeterminates inX • (N[X], +, ¢, 0, 1)is the most “general” commutative semiring: its elements abstract calculations in all semirings • N[X] –relations are the relations with provenance! • The polynomials capture the propagation of provenance through (positive) relational algebra PODS 2007
same lineage, different provenance A provenance calculation q(x,z) :- R(x, _,z), R(_, _,z) q(x,z) :- R(x,y, _), R(_ ,y,z) q(R) R Lineage • Not just why-but alsohow-provenance (encodes derivations)! • More informative than lineage PODS 2007
Trust assesment q(x,z) :- R(x, _,z), R(_, _,z) q(x,z) :- R(x,y, _), R(_ ,y,z) q(R) R 2 alternatives, both need Moe, twice Needs both Moe andLarry One alternative needs Larry and Curly Two others only need Larry, twice p: justified by Moe r: justified by Larry s: justified by Curly Which output tuples can be trusted after Larry is jailed? PODS 2007
More Technical Development • The semiring structure on annotations works out nicely for (positive) relational algebra. • What more do we need for Datalogqueries? • -continuous semirings (so fixed points exist)! • N is not -continuous, butN1≜ N[ {1} is • Here we show only what we need for Datalog provenance (formal power series) PODS 2007
… q(a d) q(a d) r2 r2 r2 r2 q(a b) q(b d) r1 r1 q(a d) R(a b) R(b d) r2 r2 q(d d) q(d d) q(a b) q(b d) r1 r1 r1 r1 R(a b) R(b d) R(d d) R(d d) Beyond RA+: Datalog r1: q(X, Y) :- R(X, Y) r2: q(X, Y) :- q(X, Z), q(Z, Y) R PODS 2007
Provenance: Encoding Infinite Derivations • Polynomials do not suffice, since they are finite! • Instead, we use infinite formal power series • Nonetheless, provenance is finitely representable through a system of equations PODS 2007
Provenance equations r1: q(X, Y) :- R(X, Y) r2: q(X, Y) :- q(X, Z), q(Z, Y) q(R) R Polynomials are the provenance of the immediate consequence operator (in RA+) The provenances x,y etc. are the power series that solve this system of equations (see next) PODS 2007
Coefficients have the form 2k! k!(k+1)! Solutions: formal power series x =m + np y =n z =p v =s + s2 + 2s3 + 5s4 + 14s5 + … u =rv* w =r(m+np)(v*)2 where v*≜ 1 + v + v2 + v3 + … In general we need coefficients from:N1≜ N[ {1} PODS 2007
Algorithmic results for Datalog provenance • Given tq(I), it is decidable whether the provenance of t is a proper (infinite) power series; • From CFG ambiguity, we know that testing whether all coefficients are · 1 is undecidable • However, given tq(I), and a monomial , the coefficient of in the power series that is the provenance oftis computable (including when it is 1) PODS 2007
Related Work • Foundations: semirings/systems of equations/formal power series first used in CS in theory of formal languages [Chomsky,Schutzenberger 1963] • Our work is related to and shares similar goals with “Debugging schema mappings with routes” [Chiticariu,Tan VLDB2006], where “routes” are like minimal finite portions of our how-provenance • See also tutorial at SIGMOD tomorrow! PODS 2007
Further work • Application: P2P data sharing in the ORCHESTRA system (thanks to our collaborator Zack Ives): • Need to express trust conditions based on provenance of tuples • Incremental propagation of deletions • Semiring provenance itself is incrementally maintainable • See demo of ORCHESTRA in SIGMOD on Thursday! • Future extension: full relational algebra. For difference we need semirings with “proper subtraction” PODS 2007