Views as Incomplete Databases – Certain & Possible Answers

Views as Incomplete Databases – Certain & Possible Answers Views – an incomplete representation Certain and possible answers Complexity results for certain answers certain

Views – an incomplete representation Given: a view def V, view extension I Sound V: I is contained in V(D) Complete V: I contains V(D) Precise V: I = V(D) V may also be mixed: some views are sound, others are complete In general, more than one db D may exist s.t. certain

Example : teams in World Cup Soccer Tournament Global scheme : Team(country, group) (gr – assignment for 1st round) Source1: S-C(C) – the countries that participate Source2 : S-Q(C) -- countries that participated in qualifying games Source3 : S-T(C) – teams whose games will be on T.V For all three, the logical mapping is v(X) :- Team(X, Y) certain

Given V (including a specification in s/c/p) and I poss(V,I) = {D | D is a db for which I is a possible view} Since we have only the views, this is the set of possible databases. For sound views : an infinite set For complete views : contains the empty db For precise views : may be empty -- inconsistent views Example : v1(X, Y) :- R(X, Y, Z), v1={(a, b), (b, c)} v2(X,Z) :- R(X, Y, Z), v2={(a, d), (c, e)} * The above changes when the global db is known to satisfy constraints (e.g. keys) certain

Certain and possible answers Now, assume also a query Q cert(Q, V, I) – seems easier to compute, always finite poss(Q, V, I) – may be infinite and where do we obtain values not in I? A possible approach: a finite representation of a possibly infinite family of partially unknown databases certain

We concentrate on certain answers -- an absolute notion of answering queries using views Cert(Q, V, I) depends on soundness/completeness of views Example : global : p(x, y) v1(x) :- p(x, y), v2(y):- p(x, y) I = {v1(a), v2(b)} Q: q(x, y) :- p(x, y) Sound views : cert(Q, V, I) is empty Precise views : cert(Q, V, I) is {(a, b)} certain

An issue in query processing : For same example, let Q’ : s(x) :- p(x, y) To allow relational algebra manipulation of certain answers, we need more than a simple relational representation! We need algorithms for performing operations on representations of partially unknown db’s (not in this course) certain

From now : sound views, certain answers Was investigated for views defined in L1, query defined in L2, where L1, L2 in {CQ, CQ!=, NR-Datalog, Datalog, FO} Results include: • Complexity – lower bounds • Algorithms – upper bounds certain

Complexity results for certain answers Thm : for V in L1 , Q in L2, the following are equivalent: (a) computing cert(Q, V, I) (b) deciding containment: is Q1 (in L1) contained in Q2 (in L2)? • (a) is decidable iff (b) is • When decidable, combined complexity of (a) = query complexity of (b) • data complexity of (a) <= query complexity of (b) [ Data complexity: function of db size Query complexity: function of query size Combined : both ] certain

Proof (sketch) :  given t, how hard to decide if t is in cert(Q, V, I)? Let I = {vi(tij)}, define Q’ by Q’ contains the rules that define V, and one more “large” rule: (t follows from facts in I) Claim: Hence deciding if t in cert(Q, V, I) is no harder than this containment (Note: for L1 = CQ, need to “massage” Q’ into CQ) certain

How hard to check containment of Q1 in Q2? let p be a new predicate Define V by: rules of Q1, and v(c) :- q1(X), p(X) , let I = {v(c)} Define Q by: rules of Q2 , and q(c) :- q2(X), p(X) Then: (c) is in cert(Q, V, I) iff Q1 is contained in Q2 certain

Consequences : computing certain answers (depends on L1,L2) Is: undecidable for Datalog, FO decidable if: one side <= datalog, other side <= nr-datalog For decidable cases, the above gives combined complexity, We are interested more in data complexity; here it is Co-NP data complexity is bad: impractical to compute, no datalog plan! We will not prove co-NP complexity results same certain

Claim : For Q in Datalog, V in CQ(!=), let V~ be the same view def, with inequalities omitted Then cert(Q, V, I) = cert(Q, V~, I) (Computing the certain answers from I using V w/o the inequalities gives same results) Proof : (b) If t is in cert(Q, V~, I), then for each D in poss(V~, I), t in Q(D) If D also in poss(V, I) -- fine If D not in poss(V, I), exists larger D’ in poss(V, I) s.t. t is in Q(D’) Hence, t is in cert(Q, V, I) certain

Proof of last claim: some s in I, but s not in V(D), because of some inequality Since s is in V(D’’), inequality involves attribute in view body • can add some tuples to D so obtain D1, s.t. s is in V(D1) • adding for all such s gives D’ that contains D, s.t. D’ is in poss(V, I) • If t in Q(D’), since Q has no inequalities, t also in Q(D) certain

For CQ views, Datalog queries, Query plan: datalog program P on V exp(P) – replace views by their definitions (using fresh names for existential variables) P is maximally-contained in Q: • exp(P)(D) is contained in Q(D) • exp(P’)(D) is contained in ep(P)(D) for all other plans P’ Such a plan is best among all plans (This is a language-dependent notion – given a more expressive language, P may not be best any more) But, if a plan delivers cert(Q, V, I) it is absolutely best certain

Thm : For CQ sound views, Datalog queries, the inverse rules algorithm computes cert(Q, V, I) (Thus, for this case, a Datalog query plan can give the absolute best possible answer) Corollary: If P is max-cont(Q) then, for all view instances, I P(I) = cert(Q, V, I) we proceed to prove the theorem certain

Def: A tableau is a collection of atoms, with constants and variables A tableau T represents a db D: there is a valuation from T into D Rep(T) = {D | for some h, D contains H(T) } certain

Claim : For a Datalog query Q, tableau T cert(Q, rep(T)) = the tuples w/o variables in Q(T) Proof : • Can consider only D in rep(T) s.t. D = h(T) every tuple in Q(D’) but not in Q(D) where D’ is larger than h(T) is not in cert(Q, rep(T)) (b) For such D, h(Q(T)) = Q(D)  a ground tuple in Q(T) is in cert(Q, rep(T)) (c) For a non-ground t tuple in Q(T), can find D1, D2 in rep(T) that give different values to variables in t  no instance of this tuple is in cert(Q, rep(T)) certain

The inverse rules of V create from a view I a database with elements that are skolem functions. Consider each skolem term to be a distinct variable • This is a tableau T(V, I) Claim : T(V, I) represents poss(V, I) Proof : easy Corollary : is cert(Q, V, I) This is precisely what the inverse rule algorithm produces: For each I, the inverse rules produce T(V, I), then apply Q end of story Next: one more (last) algorithm, for CQ queries and views, that is fastest so far certain

Views as Incomplete Databases – Certain & Possible Answers