420 likes | 593 Views
Query Containment for Conjunctive Queries With Regular Expressions. Daniela Florescu, Alon Levy, Dan Suciu. PODS 1998 Slides by Gala Yadgar. Outline. Semi structured data and conjunctive queries Query containment for different query classes StruQL 0 and the data model for it
E N D
Query Containment for Conjunctive Queries With Regular Expressions Daniela Florescu, Alon Levy, Dan Suciu. PODS 1998 Slides by Gala Yadgar.
Outline • Semi structured data and conjunctive queries • Query containment for different query classes • StruQL0 and the data model for it • Substitutions and canonical databases • Semantic criteria for query containment • Query mappings • Syntactic criteria for query containment • Containment for simple StruQL0 queries
Semi Structured Data הגר אברהם son • Data is irregular: • Attributes may be missing • The type and cardinality of an attribute may not be known • The set of attributes may not be known in advance • The schema is unknown in advance • This is an example of a data model where graphs represent databases son y. brother wife יצחק ישמעאל רבקה o. brother son wife יעקב רחל wife לאה
Languages • Relational calculus: Datalog • Ancestor(X,Y) :- Father(X,Y) • Ancestor(X,Y) :- Ancestor(X,Z), Ancestor(Z,Y) • Notice we have union and recursion. Can also have negation • Conjunctive queries: • Brother(X,Y) :- Son(Z,X), Son(Z,Y) • No union (one rule only), no recursion, no negation • StruQL: • Runs on graphs, the result is a graph • Query Q: where Person{X}, X (“paper”|“publication”) Y Collect Page{PersonPage(X),PaperPage(Y)} Link RootPage()“person”PersonPage(X), PersonPage(X)“paper” PaperPage(Y)
Query Containment • Find out whether the results of one query are contained in the results of another query • For all databases • Formal definition will be given shortly • Good for: • Finding redundant subgoals in a query • Testing whether two formulations of a query are equivalent • Determining independence of database updates • Rewriting queries using views
Known Results • Query containment for first order conjunctive queries is decidable (and NP-Complete) • Brother(X,Y) :- Son(Z,X), Son(Z,Y) • OlderBrother(X,Y) :- Son(Z,X), Son(Z,Y), Older(X,Y) • Queries in StruQL can be translated into datalog where Person{x}, X (“paper”|”publication”) Y Collect Page{PersonPage(X),PaperPage(Y)} Link RootPage()”person”PersonPage(X), PersonPage(X)”paper” PaperPage(Y) • PaperPage(Y) :- Person(X),WrotePaper(X,Y) • PersonPage(X) :- Person(X),WrotePaper(X,Y) • Containment in datalog programs is undecidable • All positive results for containment so far are restricted to the case when one of the programs is non-recursive
New Results • Define StruQL0 as a subset of StruQL • Leaving out restructuring capabilities • Similar to conjunctive queries for relational calculus • Give semantic and syntactic criteria for query containment • StruQL0 identifies a subset of datalog for which containment is decidable • Show that query containment for a fragment of StruQL0 is NP-complete
The Data Model d u2 a u1 a c • Labeled directed graphs • Nodes correspond to objects • Labels on the edges correspond to attributes • Formally: • A universe of constants D • A universe of object identifiers I (I ∩D = Ф) • A database DB is a pair (V,E): • In the example: • D = {a,b,c,d} • V = I = {u1,u2,u3,u4,u5,u6} • E = {(u1,c,u6), (u1,a,u5),…} u3 u6 b u4 c u5 b
A StruQL0 Query • Queries are allowed to include regular path expressions over the attributes • Give the ability to deal with lack of schema • R := ε | a | _ | L | (R1.R2) | (R1|R2) | R* • Q1 : q1(X,Z) :– XL+Z, YaZ, X(a+|(a.b*)) Z • The relation RQ(X,Y,Z,L) has arity 4 • RQ contains 4 tuples: {(u1,u1,u5,c),(u1,u1,u5,a),(u2,u1,u5,d),(u2,u2,u3,a)} • Q(DB) is the projection of RQ on X and Z: {(u1,u5),(u2,u5),(u2,u3)} R1 R2 R3 d u2 a u1 a c u3 u6 b u4 c u5 b
A StruQL0 Query R1 R2 R3 • Q1 : q1(X,Z) :– XL+Z, YaZ, X(a+|(a.b*)) Z. • Formally: • Regular variables range over the nodes in the graph. • Denoted by capital letters • Arc variables range over the labels of edges in the graph. • Denoted by L or Li • A regular path expression is defined by the grammar:R := ε | a | _ | L | (R1.R2) | (R1|R2) | R* • ε is the empty string • a is a label constant • _ denotes any label • L is a label variable
A StruQL0 Query - Components • Q : q(X) :– Y1R1Z1,…, YnRnZn • nvar(Q) ≡ {Y1,…,Yn,Z1,…,Zn} (node variables) • Need not be distinct • Regular path expressions: {R1,…,Rn} • avar(Q) ≡ the set of arc variables occurring in R1,…,Rn • var(Q) ≡ nvar(Q) U avar(Q) • (head variables) • Atoms(Q) ≡ the set of constants occurring in R1,…,Rn • YiRiZi i=1,…n are conjuncts
A StruQL0 Query - Semantics • Semantics: a substitution is a function • Q : q(X) :– Y1R1Z1,…, YnRnZn • Node variables are mapped to I • Arc variables are mapped D • Denote • φ(YiRiZi) is the path in DB corresponding to the conjunct (YiRiZi) • Each substitution defines a tuple in the relation RQ • The answer to Q is the projection of RQ on the variables in x • The result of applying Q to a database is Q(DB)
A StruQL0 Query אברהם • Notice the advantages for semi-structured data: • Regular path expressions • Arc variables For example: • Q2 : q2(X,Y) :– XLY, • Query for first degree relatives • L can be older brother, younger brother, son, wife, and maybe more (first wife? X-wife?) • Q3 : q3(X,Y) :– X(“son”|“daughter”)+(ε|L)Y • Query for descendants and their relatives son son y. brother יצחק ישמעאל o. brother son wife יעקב רחל wife לאה
Containment • A query Q1 is contained in a query Q2 , written if for all databases DB • The queries Q1 and Q2 are equivalent, written Q1≡Q2 , if Example: • Q1 : q1(X,Z) :– XL+Z, YaZ, X(a+|(a.b*)) Z • Q2 : q2(X,Z) :– Xa+Z • Q1(DB)= {(u1,u5),(u2,u5),(u2,u3)} • Q2(DB)= {(u1,u5),(u2,u3)} d u2 a u1 a c u3 u6 b u4 c u5 b
Canonical Databases - Intuition • A canonical database for Q is a pair (DB,ξ) • ξ is a substitution • A bifurcation node for each node variable • A corresponding internal path for each conjunct • Q: q(X1,X2) :– X1(a.L.(_)*))X2, X2(b.c)*Y, X2(a|L)*Z, Y(c|d)X1 How many canonical databases for a query? Internal node e a a a L f L L a a Z a X1 X2 Bifurcation node b d Internal path c Y b c
Canonical Databases – Formal definition • Q: q(X1,X2) :– X1(a.L.(_)*))X2, X2(b.c)*Y, X2(a|L)*Z, Y(c|d)X1. • Each internal node belongs to one internalpath, with one outgoingand one incoming edge • The mapping of node variables to bifurcation nodes is surjective • Each arc variable L is mapped to itself • For each conjunct YiRiZi, the path ξ(YiRiZi) is internal and the mappingis one to one Internal node e a a a L f L L a a Z a X1 X2 Bifurcation node b d Internal path c Y b c
Semantic Criteria for Query Containment: • Query Q has head variables X1,…Xn, and canonical database (DB, ξ) • (ξ(X1),…ξ(Xn)) is the canonical tuple Proposition 1 • Given two queries, Q, Q’: for any canonical database (DB, ξ) for Q, its canonical tuple is in the answer of Q’
Proposition 1 • Proof • If Q is contained in Q’ then for any canonical database (DB, ξ) for Q, its canonical tuple is in the answer of Q’ • StruQL0 queries are generic:if Q is contained in Q’ for databases over the universe D, then it is also contained in Q’ for databases over D’, where • D’ ≡ D U avar(Q) • (DB, ξ) contains constants in D, with addition of the arc variables of Q D’ • If Q is contained in Q’ then its canonical tuple for each DB over D is contained in Q’(DB) • According to 1, the canonical tuple of Q is contained in Q’(DB’) over D’
Proposition 1 • Proof • If Q is contained in Q’ then for any canonical database (DB, ξ) for Q, its canonical tuple is in the answer of Q’ Example: • Q1 : q1(X,Z) :– XL+Z, YaZ, X(a+|(a.b*)) Z • Q2 : q2(X,Z) :– Xa+Z • Q1(DB)= {(u1,u5),(u2,u5),(u2,u3)} • Q2(DB)= {(u1,u5),(u2,u3)} • D’ ≡ D U avar(Q) = {a,b,c,d,L} • Canonical databasefor Q2: d u2 a u1 a c u3 u6 b u4 c u5 b Z a X L
Proposition 1 • Proof • If for any canonical database (DB, ξ) for Q, its canonical tuple is in the answer of Q’ then Q is contained in Q’ • Assume the contrary – Q is not contained in Q’ • There exists some database DB and some tuple of nodes and/or label constants u=(u1,…uk) in DB, such that u is in Q(DB) but not in Q’(DB) • We will construct a canonical database which will contradict the assumption
Proposition 1 • Proof • There exists a substitution φ : Q DB so that φ(X)=u • We construct (DB0,ξ) • The bifurcation nodes are {φ(X)| X is in nvar(Q)} • Define ξ(X) = φ(X) for all X in nvar(Q) • So the mapping of node variables is the same in both databases. • For each conjunct YRZ we consider the path φ(YRZ) in DB. • This path is not necessarily simple • It may contain bifurcation nodes • This is because DB is not canonical Example: • Q: q(X1,X3):- X1aX2, X2bX3, X1L.L.LX3 • φ(X1,X2,X3,L)=(A,B,C,a) • Bifurcation nodes in DB0: A,B,C a a b A B C a
Proposition 1 • Proof • Introduce a fresh internal node for every occurrence of a node on the path φ(YRZ) • This results in a simple path • In the example: Q: q(X1,X3):- X1aX2, X2bX3, X1L.L.LX3 a a b a b A B C A B C a a a a
Proposition 1 • Proof • Now replace some labels: • Let A be some non-deterministic automaton equivalent to R, where arc variables are viewed as constants • By definition, the labels on ξ(YRZ) are accepted by A • Replace each label causing a transition in the run of A on ξ(YRZ) with the corresponding arc variable L • In the example: Q: q(X1,X3):- X1aX2, X2bX3, X1L.L.LX3 a a b b A B C A B C a L a L a L
Proposition 1 • Proof Example: • Q: q(X1,X3):- X1aX2, X2bX3, X1L.L.LX3 • φ(X1,X2,X3,L)=(A,B,C,a) • Bifurcation nodes in DB0: A,B,C DBDB0 φ’:Q’DB0 a ψ:DB0-->DB a b A B C a a b A B C L L L
Proposition 1 • Proof • DB0 is a canonical database • We have a graph morphism ψ: DB0 DB • Bifurcation nodes are sent to themselves • Internal nodes are sent to their originating nodes • We assumed that Q is not contained in Q’ even though the canonical tuple for (DB0,ξ) is in the answer of Q’(DB) • So we must have a substitution φ’ : Q’ DB0 • Compose φ’ with ψ and get a substitution φ’ ○ψ : Q’ DB0 DB • This implies that u is in the answer of Q’ too, contradicting the assumption □
Decidability of containment • We still have an infinite number of canonical databases: • The internal paths can be of any length • Q: q(X,Y) :- XL*Y • The number of substitutions can be infinite • Q: q(X,Y) :- X_Y • It is sufficient to examine only databases whose internal path is no longer than some N which depends only on Q and Q’ • Only a set of n x N constants is sufficient, with N from above and n the number of conjuncts in Q. (Only the constants in DQ,Q’ U avar(Q)) • The resulting algorithm for containment is of triple exponential space • But it shows decidability
Path Length is bounded • Remember Ai is the non-deterministic automaton equivalent to Ri, for each conjunct YiRiZi • The path between ξ(Y) and ξ(Z) represents a run of Ai • Its length is bounded by N = |nvar(Q)|x|states(Ai)|+2 • If a variable appears in the path ξ(YiRiZi) more than |states(Ai)| times, it can be cut short, and still satisfy Ri • Q: q(X1,X3):- X1aX2, X2bX3, X1L+X3 • We must check all the runs ofautomata in Q’ on paths inthe canonical DB of Q. • Proof in Appendix A ofthe full version of the paper. a b A B C L L L
Containment by Query mapping • A query mapping f:Q’DB sends conjuncts in Q’ to some path in the canonical database of Q • There exist only finitely many mappings • They can be encoded in polynomial space • A query mapping f:Q’Q can ‘cover’ a canonical database DB for Q • all query mappings together cover all canonical databases • All canonical DBs for a query can be described in a regular language WQ • For each mapping f, there is a regularexpression for all databases covered by it, Wf • Exponential space Q’ Y’ A’ Z’ f p3 pn p2 Pn-1 Q p1
Simple StruQL0 queries • The regular expressions Ri in Q are of the form r1. r2... rn, where each ri is either * or a label constant • Examples: • a.*.b.* and *.*.a.a.* are simple regular expressions • a*.b or _._ are not simple regular expressions • Given two regular expressions, their containment can be checked in polynomial space • The containment problem of two simple queries is NP-complete • By reduction to conjunctive queries • First subset including recursion for which containment decision is no harder than for conjunctive queries.
Summary • StruQL0 – conjunctive queries with regular expressions • Canonical databases - semantic criteria for query containment • Containment is decidable • But in triple exponential space • Query mappings - syntactic criteria for query containment • Exponential space • Simple StruQL0 queries – a subset for which containment is NP-complete
Backup slides:Containment by Query Mapping • We will show that Q is contained in Q’ iff a certain condition holds on all query mappings from Q’ to Q • Q is a query with n conjuncts:Q: q(X):- Y1R1Z1,…, YnRnZn • nvar(Q) = {Y1,Z1,…,Yn,Zn} • Ai is a fixed non deterministic automaton for each regular expression Ri • A point in Q is either • A node variable (variable-point) • A pair (Ai,s) where s is a state in Ai (automaton-point) • points(Q) is the set of points in Q
Canonical DB and query points • Nodes in a canonical database DB for Q correspond to points in Q • Several internal nodes in DB may correspond to the same automaton-point • Bifurcation nodes in DB correspond both to variable-points and to automaton points (Ai,s) where s is an initial or terminal state
Path in a Query • Given a query Q, a path of points in Q is a sequence p1,…,pn, n≥2 • p2,…,pn-1 are all variable-points (p1,pn can be automaton points) • Any two adjacent points are connected in Q: • If pj, pj+1 are variable points • there is a conjunct YiRiZi in Q with pj=Yi and pj+1=Zi • If p1 is an automaton-point (p2 is a variable point) • there exists a conjunct YiRiZi in Q so that Ai is the automaton associated with Ri, and p2 = Zi • If pn is an automaton-point (pn-1 is a variable point) • there exists a conjunct YiRiZi in Q so that Ai is the automaton associated with Ri, and pn-1 = Yi • If n=2, and both p1 and P2 are automaton points • they refer to the same automaton
Canonical DB and query path • Let U = u1,u2,…,um be a path in a canonical database DB for Q. • Suppose we drop all internal nodes from u2,…,um-1 • Let u1=ui1,ui2,…,uin-1,uin=um be the resulting subsequence • We say that U corresponds to the path of points p1,…,pn iff each uik corresponds to pk, for k=1,…,n • Paths of points rephrase paths in canonical databases
Query mapping • Consider some other query Q’ • Ai’ is a nondeterministic automaton for each Ri’ in Q’ • Let X, X’ be head variables in Q,Q’ respectively • A query mapping f: Q’ Q consists of: • Two mappings, • f: nvar(Q’)points(Q) and • f: avar(Q’) DQ,Q’ U avar(Q), • so that f(X’ )= X • A mapping from conjuncts Yi’Ri’Zi’in Q’ to paths of points in Q, • f(Yi’Ri’Zi’) = p1,…,pn so that n≤|nvar(Q)|x|states(Ai)|+2 • f(Yi) = p1, f(Zi’) = pn • For each conjunct YiRiZi in Q, a total preorder on those variables Z’ in nvar(Q’) for which f(Z’) is an automaton point corresponding to Ai • Whenever X’≤Y’ and Y’≤X’ then f(X’)=f(Y’)
Query mapping • For some canonical database (DB,ξ) a substitutionφ:Q’DB is canonical if φ(X’) is the canonical tuple in DB Condition 1: • A substitution now sends conjuncts in Q’ to some path in the canonical database, and not variables to nodes and arc variables to arcs Q’ Y’ A’ Z’ f p3 pn p2 Pn-1 Q p1
Path Length is bounded Condition 2: • The path of points p1,…,pn may have cycles • Its length is bounded by |nvar(Q)|x|states(Ai)|+2 • If a variable appears in the path f(Y’R’Z’) more than |states(Ai)| times, it can be cut short, and still satisfy R’ Q’ Y’ A’ Z’ f p3 pn p2 Pn-1 Q p1
Preorder Condition 3: • The preorder defines: • Equivalence classes on the variables (X’≤Y’ and Y’≤X’ X’≡Y’) • A total order on the equivalence classes • The query mapping imposes such an order on all variables sent by f to points on the same automaton (A,s1), (A,s2), (A,s3)…
Substitutions and mappings • A canonical substitution φ:Q’DB corresponds to a query mapping f: Q’ Q if: • For each conjunct Y’R’Z’ in Q’ the path φ(Y’R’Z’) corresponds to the path of points f(Y’R’Z’) • For any internal path in DB corresponding to YRZ, the preorder on all variables mapped by φ onto that path coincides with the preorder given by f • There is always a query mapping between two queries • For given Q,Q’, there exist only finitely many mappings • Each mapping can be encoded in polynomial space
Containment • A query mapping f:Q’Q covers a canonical database DB for Q, if there is some canonical substitution φ:Q’DB which corresponds to f • Some query mappings don’t cover any canonical database. • Q is contained in Q’ iff all query mappings together cover all canonical databases • All canonical databases for a query can be described in a regular language WQ • For each mapping f, there is a regular expression for all databases covered by it, Wf • This can be computed in exponential space
The connection between the syntactic and semantic criteria • all query mappings together cover all canonical databases • If a query mapping covers a canonical database for Q, then the canonical tuple in the database is in the answer of Q’. • This is implied by the definitions of canonical substitution, of correspondence between a mapping an a substitution, and of “covering” a database. • Both criteria (syntactic and semantic) rely on Proposition 1, but present different algorithms to check containments of two queries.
Known results for regular expressions • Containment of regular expressions is PSPACE complete • L.J. Stockmeyer and A.R. Meyer. Word problems requiring exponential time. In 5th STOC, pages 1-9. ACM, 1973. • Containment of simple regular expressions is in PTIME • Tova Milo and Dan Suciu. Index structures for path expressions. In 7th ICDT, pages 277–295. Springer-Verlag, 1999.