470 likes | 573 Views
Provenance for Database Transformations Val Tannen University of Pennsylvania Joint work with J.N. Foster T.J. Green G. Karvounarakis. IPAW ’08 Salt Lake City June 17, 2008. Motivation. Some of the work in IPAW!
E N D
Provenance for Database TransformationsVal TannenUniversity of Pennsylvania Joint work with J.N. Foster T.J. Green G. Karvounarakis IPAW ’08 Salt Lake City June 17, 2008
Motivation Some of the work in IPAW! Data integration [Wang,Madnick 1990, Lee,Bressan,Madnick 1998] Data warehousing • Lineage [Cui,Widom,Wiener 2000] Scientific applications • Why-Provenance [Buneman,Khanna,Tan 2001] Collaborative data sharing networks in the ORCHESTRA system (project headed by Zack Ives) • Trust conditions based on provenance • Deletion propagation [Green,Ives,Karvounarakis,T. 2007, Karvounarakis,Ives 2008]
Database transformations, e.g., views V R 1 2 3 View V = q(R) CREATE VIEW V AS (SELECT u.1, v.3 FROM R u, R v WHERE u.3 = v.3 UNION SELECT u.1, v.3 FROM R u, R v WHERE u.2 = v.2) white box (Ludäscher)
Database transformations, e.g., views V R 1 2 3 View V = q(R) Datalog without recursion V(x,z) :- R(x,_,z), R(_,_,z) V(x,z) :- R(x,y,_), R(_,y,z) Relational algebra V := ¼12((¼13(R) ⋈ ¼23(R)) [ (¼12(R) ⋈ ¼23(R)))
Provenance questions V R View V = q(R) ? ? t ? Which input tuples contributed in some way to t being in the output? Which sets of input tuples support each way for t to be in the output? What are all possible ways in which twas caused to be in the output?
Provenance answers tuple ids t Which input tuples contributed in some way to t being in the output? {r,s} lineage [CWW 00] Which sets of input tuples support each way for t to be in the output? {{r},{r,s}} proof why-provenance [BKT 01] see [PODS 08] What are all possible ways in which twas caused to be in the output? 2r2 + rs prov. polynomials [Green,Karvounarakis,T. 2007]
More generality: annotated relations Provenance: an annotation on tuples Other instances of relationswith annotated tuples • incomplete databases (conditional tables) [Imielinski,Lipski 1984] • probabilistic databases (independent tuple tables) [Fuhr, Rölleke 1997, Zimányi 1997, others] • bag semantics databases (tuples with multiplicities) […SQL!] How do annotations combine as they propagate through queries? (Is there an algebra of annotations?) Imielinski and Lipski already computed some form of provenance!!
Incomplete databases: boolean c-tables [IL 84] R boolean variables semantics: a set of instances { } , ; I(R)= , , , , , ,
Queries on c -tables R r s r V(x,z) :- R(x,_,z), R(_,_,z) V(x,z) :- R(x,y,_), R(_,y,z) r r V But… simplifying like this misses the general idea! =
Probabilistic independent-tuple tables R Events “tuple in instance” areindependent View V may not be representable as an independent-tuple table V R view computation: similar to c-tables, but for algebra of sets eventss
C –tables vs. Lineage c-table calculations lineage calculations [CWW 00] The structure of the calculations is the same!
Another analogy, with bag semantics tuple multiplicities c-table calculations R V multiplicity calculations Again, the structure of the calculations is the same!
Abstracting the structure of these calculations These expressions capture the abstract structure of the calculations We will end up using these expressions as provenance! abstract calculations
Technical Development: K-relations Annotations are elements from an algebraic structure (K,+,¢, 0, 1) IfD is the domain of database values, an n-ary K-relationis a function: R: Dn! K Although the notation resembles arithmetic, these are abstract operations All possible tuples
K-relations, annotated tables K-relationcorresponds to table: R: Dn! K If R(t)=k, then t“is annotated by k” For all but finitely many tuples t, R(t) = 0 we omit the tuples annotated with 0
Positive K-relational algebra We define an RA+ on K-relations: The ¢ corresponds to joint use (join) The + corresponds to alternative use (union and projection) 0and 1 are used for selection predicates
Positive K-relational algebra: details • Natural join: [R1⋈R2](t)= R1(t1)¢R2(t2) t on attrs(R1) =t1, t on attrs(R2) = t2 • Union: [R1[R2](t) = R1(t) + R2(t) • Projection: [VR](t) =t'=tonVandR(t’)0R(t') • Selection: [PR](t) = R(t)¢P(t) P(t) = 0 or 1
RA+ identities imply semiring structure! Common RA+ identities • Unionandjoinareassociative, commutative • Join distributesoverunion • etc. (but notidempotence!) These identities hold for RA+ onK-relations iff (K, +,¢, 0, 1) is a commutative semiring
Semiring Bestiary • (B, Ç, Æ, ?, >) Usual rel. alg. (sets) • (N, +, ¢, 0, 1) Bag semantics • (PosBool(X), Ç, Æ, ?, >) Boolean c-tables, also Minimal why-provenance [BKT 01] • (P(), [, Å, ;, ) Event tables (prob. db) • (P(P(X)), [, d, ;, {;}) Proof why-provenance where AdB := {a[b : a2A, b2B} • (P(X), [, [, ;,;) ★Lineage • (N[X], +, ¢, 0, 1) Provenance polynomials
Provenance polynomials X = {p, r, s, …}: indeterminates (provenance “tokens” for base tuples) N[X]: multivariate polynomials with coefficients in Nand indeterminates inX (N[X], +, ¢, 0, 1)is the free commutative semiring generated by X ; its elements abstract calculations in all semirings The polynomials capture the propagation of provenance through (positive) relational algebra in the most general way allowed by commutative semiring-based semantics
Provenance calculations R proof why- prov. minimal why- prov. boolean c-table annot. provenance polynomials ≈ lineage V Three derivations: two of them use r, twice, and the third uses r and s, once each
Trust assesment 2 alternatives, both need Moe, twice V R Needs both Moe andLarry One alternative needs Larry and Curly Two others only need Larry, twice p: certified by Moe r: certified by Larry s: certified by Curly Which output tuples can be trusted after Larry is jailed?
A glimpse at work by T.J. Green:Provenance and Query Optimization • Many kinds of semiring-based provenance annotations to choose from: • Lineage • Proof why-provenance • Minimal why-provenance • Provenance polynomials • ... • They keep track of more/less information • A fundamental question, asked repeatedly by Peter Buneman: how does this affect query optimization?
Choice of K Affects Query Optimization K = N (bag semantics) differs from K = B (set semantics) e.g., the conjunctive queries Q1(x) :- R(x,y), R(x,z) Q2(u) :- R(u,v) are set-equivalent, but not bag-equivalent
A Hierarchy of Semiring Provenance (1) • Provenance polynomials (N[X], +, ¢, 0, 1) – tracks calculations abstractly; most general e.g., 2p2r + 3ps + ps3 • Drop coefficients to get (B[X], +, ¢, 0, 1) p2r + ps + ps3 • Drop exponents to get proof why-prov. (P(P(X)), [, d, ;, {;}) {{p,r},{p,s}} • Flatten set-of-sets to get lineage {p,r,s} • Drop, flatten, etc. correspond to surjective semiring homomorphisms
A Hierarchy of Semiring Provenance (2) Definition:K1¹LK2 means that for all queries P, Q in language L P´K2Q implies P´K1Q Languages of interest: CQ and UCQ (equivalent to RA+) Definition:K1¼LK2 means K1¹LK2 and K2¹LK1 Proposition: If there exists a surjective homomorphism h : K1K2 then K1¹UCQK2 Proposition (from [GKT 07]) If K is a distributive lattice then B¼UCQK (In particular B¼UCQ PosBool(X) )
A Hierarchy of Semiring Provenance (3) Definition: A semiring is positive if 0=1 and a+b = 0 implies a=0 and b=0 and a¢b = 0 implies a=0 or b=0 All the semirings we consider are positive. Proposition: For any positive K (and “big enough” X) B¹UCQK¹UCQN[X] Moreover: Proposition (Provenance Hierarchy): B¹UCQ lineage ¹UCQ proof why-prov. UCQ¹B[X] ¹UCQN[X]
Separating the Models for ´ of CQs BÁCQ lineage: Q1(x,y) :- R(x,y), R(x,z) Q2(x,y) :- R(x,y) Q1 ´BQ2 but Q1´lin Q2 lineage ÁCQ why: Q1(x) :- R(x,y), R(x,z) Q2(x) :- R(x,y) Q1 ´lin Q2 but Q1´why Q2
Summary: Provenance Hierarchy More importantly, Green’s results also show decidability for containment and equivalence of CQs and UCQs under the various provenance semantics
Extension to annotated XML • Data model: unordered XML datawith semiring annotations (K-UXML) • Query language: positive, unordered XQuery fragment (K-UXQuery) • Sanity checks: agrees with encoded relational queries, bag semantics, probabilistic XML, ... • Applications: security, incomplete XML databases, ...
K-UXML • No attributes, no text values, no repeated children (inessential); no order (essential!) • Each subtree decorated with a value k from semiring K (1 “neutral,” 0 “not present”) • K-collection: a finite set of elements annotated with values from K • The child subtrees of a node form a K-collection
K-UXML Example a In NRCK: {ha, {hb, {ha, {hc, {}iy3, hd, {}i1}i1}ix1, hc, {...}iy1 }i1} a b c x1 y1 ´ bx1 cy1 1 a y2 x2 c c d b b 1 a d cy2 bx2 y3 1 1 d d a a c c cy3 d a Annotations are on elements of K-collections. There are 5 K-collections in this tree (all colored differently). To annotate whole tree, must include in singleton K-collection.
K-UXQuery Semantics: for-Loops Query: for $t in $S return $t/* Source, $S: Computation: Answer: a b c by ax cz x y z , , , , du dv f ew dv du ew f x du , y dy , y ew , z f by ax cz dxu , dvy , eyw , fz , , dv du ew f dxu + yv , eyw , fz
K-UXQuery Semantics: // Operator • Annotation of result is a sum over products of annotations along paths to root Query:<r> $S//c </r> Source, $S: Answer: a r cx1¢y3 + y1¢y2 bx1 cy1 cy1 a d cy2 bx2 d cy2 bx2 cy3 d a a
Application: Access Control • Data annotated with clearance levels from total order C : P < C < S < T < 0 • Joint use of data (¢) requires access to both (max of clearances); alternative use of data (+) requires access to either (min of clearances) • (C, min, max, 0, P) is a commutative semiring Query:<p> $S/*/* </p> a p p bC cC dC eT dmin(max(P,C,C), max(P,C,S)) emax(P,C,T) eT dS dC
Security Condition: Non-Interference • For any given clearance level (e.g., C), want the following diagram to commute: a query bC cC dC eT eT dS dC erase > C erase > C a p p query bC cC dC dC
Application: Incomplete XML • Data annotated with Boolean expressions; tree T represents set of possible worlds Rep(T) a a 7 possible worlds b c b cx T = a d b a cz d b a d a a a a cy d b c b b , , ,..., Rep(T) = a d c b a a c d a d c d
Correctness: Possible Worlds • For every incomplete tree T, and every UXQuery query q, want this diagram to commute: Rep T Rep(T) q q Rep q(Rep(T)) = Rep(q(T)) q(Rep(T)) q(T)
Commutation with Homomorphisms Theorem: Let h : K1K2 be a semiring homomorphism. Then for any RA+/NRC/UXQuery query q, and for any K1- instance D, we have h(q(D)) = q(h(D)). • Ex: access control hc : CChc(k) := if k·c then k else 0 • Ex: incomplete databases º : Vars B Evalº : PosBool(Vars) B • Ex: duplicate elimination ± : NB±(k) := if k = 0 then ? else >
Provenance Polynomials are Universal Corollary: The semantics of RA+/NRC/UXQuery evaluation on K-instances for any commutative semiring Kfactors through evaluation using provenance polynomials N[X]. e.g., for any K-UXML document D, for any K-UXQuery q, we have q(D) = Evalº(q(D’)) where D’ is obtained by replacing K-annotations in D with fresh variables from X º : XK is the corresponding valuation Evalº : N[X] K is the unique semiring homomorphism such that for the one-variable monomials, Evalº(x) = º(x).
Datalog? The semiring structure on annotations works out nicely for positive relational algebra, positive nested relational calculus (NRC), a large fragment of XQuery,. What more do we need to capture recursion, i.e., for Datalogqueries? -complete semirings with -continuous operations (so fixed points exist!) -continuous semirings N is not, butN1≜ N[ {1} is.
Datalog may have infinite derivations! Polynomials do not suffice, since they are finite! Nonetheless, the calculations are finitely representable through a system of equations The equations have a least solution in any -continuous semiring For provenance, we must generalize from polynomials to formal power series (in general, infinitely many monomials)
Related Work • Foundations: semirings/systems of equations/formal power series first used in CS in theory of formal languages [Chomsky,Schutzenberger 1963] • Our work is related to and shares similar goals with “Debugging schema mappings with routes” [Chiticariu,Tan VLDB2006], where “routes” are like minimal finite portions of our provenance polynomials
More Related Work • Bag semantics for NRC[Libkin&Wong 97] • Incomplete XML [Kanza+ 99, Abiteboul+ 06] • Probabilistic XML [Nierman&Jagadish 02, van Keulen+ 05, Abit.&Senellart 06, Sen.&Abit. 07, Hung+ 07] • XML provenance [Buneman+ 01] • NRC provenance [Hidders+ 07] • Soft CSPs [Bistarelli et al] • Semiring-annotated XPath [Grahne+ 07] • Negation, expressiveness of RAK[Geerts&Poggi 08]
Related Work for T.J. Green • Already mentioned • Set-cont. and equiv. of CQs [Chandra&Merlin 77] • Set-cont. and equiv. of UCQs [Sagiv&Yannakakis 80] • Bag-cont. of UCQs [Ioannidis&Ramakrishnan 95] • Bag-equiv. of CQs [Chaudhuri&Vardi 93] • Containment of CQs with where-provenance [Tan 03] • Bag-set semantics [CV 93], combined semantics [Cohen 06] • For K-relations: support operator of [Geerts&Poggi 08] generalizes duplicate elimination • Bag-containment of CQs [Jayram+ 06]
Conclusion • Annotations forming a commutative semiring seem to fit well with database transformations expressed in positive query languages, be they relational, even recursive, or for complex values or tree data. • We obtained explanations for a number of puzzles related to why-provenance in a broad sense. • Provenance polynomials also capture tuple multiplicity and serve well systems such as Orchestra. • Big open questions: negation (although see work by Geerts, Poggi) and order
Future Work I have the feeling that we have only scratched the surface so far… I am working on marrying this approach with data exchange, with a broader perspective on security, with integrity constraints, with a broader perspective on mapping/view maintenance and update…