210 likes | 362 Views
Containment of Relational Queries with Annotation Propagation. Wang-Chiew Tan University of California, Santa Cruz. Annotation Management System. A system that is able to propagate meta-data that is associated with a piece of data along with the data as the data is being moved around
E N D
Containment of Relational Queries with Annotation Propagation Wang-Chiew Tan University of California, Santa Cruz
Annotation Management System • A system that is able to propagate meta-data that is associated with a piece of data along with the data as the data is being moved around • Main feature: • To trace the provenance and flow of data a2 transformation a1 a1 a2
Tracing the Provenance and Flow of Data a2 b2 transformation a1 b1 a1 a2 b3 a3 transformation b1 b2 b3 a1 a2
Other Applications • Keep information that cannot be otherwise stored in the current database design • Highlight wrong data • Errorneous data may be copied but the comment that it is wrong goes along with it • Security • Annotate security level of data items • Quality metric • Annotate quality level of data items
Main Question • Are the annotated outcomes the same for equivalent queries? • Why this question? • A query optimizer rewrites a query. Will the rewritten query have the same annotation propagation behavior?
Result1 1 2 3 a b Result2 1 2 3 a A Simple Example Given two relation schemas: R(A,B), S(B,C) SELECT * FROM R NATURAL JOIN S versus SELECT r.A, r.B, s.C FROM R r, S s WHERE r.B = s.B R 1 2 S 2 3 a b =a
In a More Concise Notation a b Ans(x,y,z) :- R(x,y), S(y,z) { x ! 1, y ! 2, z ! 3 } Ans(x,y,z) :- R(x,y), S(y’,z), y = y’ { x ! 1, y! 2, y’! 2, z ! 3 } • A location is a triple (R, t, A) • Annotations of values that reside in different locations but are bound to the same variable are unioned together Ans(y) :- R(x,y) Ans(y) :- S(y,z) Ans(2 ) • Annotations that belong to the same output location are unioned together =a a b a b
More Examples R 1 2 3 1 4 5 1 8 4 8 9 5 Q1: Ans(x,v) :- R(x,y,u), R(x,z,v), R(t,w,z) Q2: Ans(x,v) :- R(p,q,v), R(x,z,v), R(t,w,z) First answer: Ans(1, 5 ) Second answer: Ans(1, 5 ) a b c d a b c d b c
A sufficient condition for annotation containment Theorem If Q1 and Q2 are equivalent and Q1 is minimal, then Q1 is annotation-contained in Q2 • Intuition of proof: • If Q1 is minimal, then no proper subquery of Q1 is equivalent to Q1 • The minimal query of Q2 is isomorphic to Q1 up to variable renaming. Assume that they are identical. • Any valuation for Q1 can be simulated by a valuation ± h that carries annotations in the same way as of Q1 (h is the homomorphism from Q2 to its minimal subquery)
Is the sufficient condition too strong? • Is it true that if Q1 is equivalent to Q2, then Q1 is annotation-contained in Q2? • Answer: No. • Is it true that if Q1 is contained in Q2 and Q1 is minimal, then Q1 is annotation contained in Q2? • Answer: No. Q1: Ans(x) :- R(x, y), S(x, y) Q2: Ans(x) :- R(x, y) Ans(1 ) Ans (1 ) • Both Q1 and Q2 are minimal queries but neither Q1 nor Q2 are annotation-contained in each other R 1 2 1 3 S 1 2 a c b a c a b
Necessary and Sufficient condition? ith column pth subgoal jth column • If Q1 carries an annotation of the jth column of some S-tuple to the output, there is a way for Q2 to simulate this behavior via homomorphism h Q1: H(… x …) :- … S(… x …) … h(y) = x, h maps the qth subgoal of Q2 to the pth subgoal of Q1 Q2: H(… y …) :- … S(… y …) … ith column qth subgoal jth column
A necessary and sufficient condition for annotation-containment via homomorphisms Theorem Q1 is annotation-contained in Q2 iff for every distinguished variable x that occurs at the ith column in the head and jth column of the pth subgoal in the body of Q1, there exists a homomorphism h from Q2 to Q1 such that • h maps the body of Q2 into the body of Q1 and the head of Q2 to the head of Q1 • Let the qth subgoal Q2 be the preimage of the pth subgoal of Q1 under h. The variable that occurs at the jth column of the qth subgoal of Q2 is identical to the variable that occurs at the ith column in the head of Q2
Can a single homomorphism do the job? Q1: Ans(x) :- R(x,y), R(x,z) Q2: Ans(x) :- R(x,y) • Every homomorphism from Q2 to Q1 maps the body of Q2 to only one subgoal of Q1
Complexity of Annotation-Containment Proposition It is NP-complete to decide if Q1 is annotation-contained in Q2
Propagating annotations back • If we wish to attach an annotation on a piece of data in the output, on which source data should we attach an annotation? • The user should be given the choice • Alert the user of a side-effect-free annotation when there is one
Annotation Placement Problem • Given the source database, the query, the output data that we wish to annotation, it is DP-hard to decide if there is a side-effect-free annotation • Upper-bound is not DP • Conjecture: in a class slightly above DP
Related Work • Idea is not new though annotations were never explicitly stated as provenance-based: Wang & Madnick [VLDB 90], Lee, Bressan & Madnick [WIDM 98], Bernstein & Bergstraesser [IEEE Data Eng. 99] • Annotations of Web Documents • Annotations on genomic sequences
Open Issues • Are there polynomial time algorithms for deciding annotation-containment for the class of queries with bounded treewidths • Is query minimization church-rosser? • Exact complexity of the annotation placement problem? • Annotation and propagation for XML data • Relationship between annotation-containment and containment of conjunctive queries under bag semantics
Open Issues (contd) • Other annotation propagation semantics other than basing on provenance? • Querying the annotations?
Other results that do not carry over • Query Minimization • We can no longer minimize a query and preserve annotation-equivalence by discarding one subgoal at a time • Answering Queries using Views • Some classical results no longer hold • [LMSS95] if a query Q has p subgoals and a query Q’ is a complete minimal rewriting of Q using a set of views V, then Q’ has at most p subgoals
Example Q: A(x) :- R(x,z,v), R(x,u,z), R(x,z’,t), R(x,s,z’) Q’: A(x) :- R(x,u,z), R(x,z’,t), R(x,s,z’) Qmin: A(x) :- R(x,z’,t), R(x,s,z’) R 1 2 3 1 3 2 1 4 5 1 4 6 : a1 a2 a3 a4