390 likes | 572 Views
Efficient Algorithm For Processing XPath Queries. Author: Georg Gottlob, Christoph Koch, Reinhard Pichler. Structure. Introduction Experimental results for existing XPath processors Basic notations XPath Axes Sematics of XPath Bottom-up Algorithm Top-down Algorithm
E N D
Efficient Algorithm For Processing XPath Queries Author: Georg Gottlob, Christoph Koch, Reinhard Pichler
Structure • Introduction • Experimental results for existing XPath processors • Basic notations • XPath Axes • Sematics of XPath • Bottom-up Algorithm • Top-down Algorithm • Linear time XPath fragments • Conclusion
Introduction • Claim: • Current implementations of XPath processors do not live up to their potential. • The way XPath is defined in W3C XPath recommendation motivates an inefficient implementation (exponential-time). • This paper propose more efficient way (polynomial-time)
State-of-the-art of XPath Systems • Apache XALAN • James Clark’s XT • Microsoft Internet Explorer 6 • Tested XML document • <a><b/>…<b/></a> N times : tree contained n+1 node
Basic query evaluation strategy of XALAN, XT Procedure process-location-step(n0, Q) /* n0 is the context node; query Q is a list of location steps */ Begin node set S := apply Q.first to node n0; if (Q.tail is not empty) then for each node n ∈ S do process-location-step(n, Q.tail); End Time(|Q|) = |D| * Time(|Q|-1) or |D||Q| when |Q| > 0 1 when |Q| = 0
Experiment 1 • Fixed document size, various queries. • Queries constructed as //a/b/parent::a/b…/parent::a/b i-1 times
Experiment 1 Exponential-time query complexity of XT and XALAN
Experiment 2 • Path and arithmetics, using IE6 • Query //a/b[count(parent::a/b)>1] //a/b[count(parent::a/b [count(parent::a/b)>1] • For four document size (2, 3, 10, 200)
Experiment 2 Exponential-time query complexity of IE6 for document sizes 2, 3, 10, and 200
Experiment 3 • Fixed query, various document size for IE6. • Query: ‘//a’ + q(20) + ‘//b’ q(i) := ‘//b[ancestor::a’ + q(i-1) + ‘//b]/ancestor::a’ i > 0 ‘’ i = 0 • Example: ‘//a’ + q(2) + ‘//b’ //a//b[ancestor::a//b[ancestor::a//b]/ancestor::a//b]/ancestor::a//b
Experiment 3 Quadratic-time data complexity of IE6. f` and f`` are the first and second derivatives, respectively, of the graph of timings f
Basic Notation (1) • In the paper, simplified XML model is used. • XML Document: unranked, ordered, and labeled tree. • Let dom is the set of all nodes in the tree. • Firstchild: dom dom returns the first child of the node • Nextsibling: dom dom returns neighboring node to the right.
Basic Notation (2) • Firstchild -1 and nextsibling -1 are the inverse function. • Use binary relation instead of function • {<x, f(x))> | x ∈ dom, f(x) ≠ null}
XPath Axes (1) • Binary relations Ҳ⊆ dom X dom • R1.R2 (Concatenation) • R1 ∪ R2 (Union) • R1 * (reflexive and transitive closure) Axis definition in terms of “firstchild”, “nextsibling”, and their inverses.
XPath Axes (2) • Definition: Let Ҳ denote an XPath axis relation. define the function Ҳ : 2dom 2dom as Ҳ(X0) = {x|∃x0∈ X0 : x0Ҳ x} (where X0⊆dom is a set of node)
XPath Axes (3) • Algorithm (Axis Evaluation) Input: A set of nodes S and an axis Ҳ Output: Ҳ(S) Method: eval Ҳ(S) Functioneval (R1 ∪…∪ Rn)* (S) begin S`:= S; /* S is represented as a list */ while there is a next element x in S` do append {Ri(x)|1≤i≤n, Ri(x) ≠ null, Ri(x)!∈S`} to S`; return S`; End;
XPath Axes (4) Function eval Ҳ(S) := eval E(Ҳ) (S) Function eval self (S) := S Function eval e1.e2 (S) := eval e2 (eval e1 (S)) Function eval R (S) := {R(x)| x ∈ (S)} Function eval Ҳ1∪Ҳ2(S) := evalҲ1(S) ∪evalҲ1(S) • e1, e2 : regular expression • R1, R2 : primitive expression
XPath Axes (5) • O(|dom|) running time • Each eval functions – visit each node at most once • Number of calls to eval function and relation joined by union – constant
Semantics of XPath (1) • The main structural feature of XPath are expressions, which are one of four types (nodeset, number, string, or boolean). • Each expression evaluates relative to a context c = <x, k, n> (Context node x, context position k, and context size n.)
Semantics of XPath (2) • Definition (Sematics of XPath) Each XPath expression returns a value of one of the following four types: node set, number, string, or boolean.
Semantics of XPath (3) • Semantics [e]: CT ( Domain of context C, Xpath expression e and an expression type T) • [п](<x,k,n>) := P[п](x) • [position()] (<x,k,n>) := k • [last()] (<x,k,n>) := p • (where пis a location path)
Semantics of XPath (4) The function P P[Ҳ::t[e1]…[em]](x) := begin S := {y| x Ҳ y, y ∈ T(t)}; for 1≤i≤m (in ascending order) do S:={y∈S|[e1](y,idx(y,S),|S|)=true}; return S; end; P[п1|п2](x) := P[п1](x) ∪ P[п2](x) P[/п](x) := P[п](root) P[п1/п2](x) := ∪y∈ P[п1](x)P[п2](y) Where idx(y,S) is a index of y in S
Bottom-up evaluation of XPath • The main principle to obtain an XPath evaluation algorithm with polynomial-time complexity is the notion of a context-value table. (i.e. a relation for each expression) • The context-value table of expression e specifies all the valid combinations of context c and value v • The size of each tables has a polynomial bound and each of the combination steps can be effected in polynomial time.
Bottom-up evaluation of XPath • Expression types and associated relations
Bottom-up evaluation of XPath • Definition (Semantics) E↑: Expressionnset ⋃ num ⋃ str ⋃ bool • Let e be an arbitrary XPath expression. Then, for context node x, position k, and size n, the value of e is v, where v is the unique value such that <x,k,n,v>∈E↑[e]
Bottom-up evaluation of XPath • Expression relations for location paths, position(), and last()
Bottom-up evaluation of XPath Query Evaluation Algorithm (Bottom-up algorithm for XPath) Input: An XPath query Q; Output: E↑[Q] Method: Let Tree(Q) be the parse tree of query Q; R:=Ø; For each atomic expression l ∈ leaves(Tree(Q)) do compute table E↑[l] and add it to R; While E↑[root(Tree(Q))]! ∈ R do Begin take an Op(l1,…ln) nodes(Tree(Q)) s.t. E↑[l1],… E↑[ln] ∈ R; compute E↑[Op(l1,…ln)] using E↑[l1],…, E↑[ln]; add E↑[Op(l1,…ln)] to R; End; Return E↑[root(Tree(Q))]
Bottom-up evaluation of XPath • Example • Dom = {a, b1, … , b4} • XPath query: descendant::b/following-sibling::*[position()!= last()] • Parse tree • N1: descendant::b/N2 • N2: following-sibling::*[N3] • N3: N4 != N5 • N4: position() • N5: last()
Bottom-up evaluation of XPath Typo b2 b3 b3
Bottom-up evaluation of XPath • O(|Q|) relations are created • Space • Bool O(|D3|), nset O(|D4|), string/number O(|D|4*|Q|) • Overall space bounds O(|D|4*|Q|2) • Time • Each relation O(|D|5*|Q|) and O(|Q|) Query
Top-down evaluation of XPath • Based on vector computation where the computation of the large number of irrelevant results is avoided • Introduce an auxilary semantics definition S↓ • S↓[п](X1,…Xk) = <Y1,…,Yk> • Given locationPath п and a list<X1,…,Xk> of node sets, S↓determines a list <Y1,…,Yk> of node sets s.t. for every i, the nodes reacherable from the context nodes in Xi via location path пare node in Yi. • A node y is in Yi iff there is an x ∈ Xi and some p,s, such that <x,p,s,y>∈Е↑[п]
Top-down evaluation of XPath • Е↓[e] (c1,…cl) := <r1,…,rl> • Е↓[п] (c1,…cl) := S↓[п](ss<> (proj1<> (c1,…cl))) • Е↓[position()] (<x1,k1,n1>,…,<xl,kl,nl>) := <k1,…,kl> • Е↓[last()] (<x1,k1,n1>,…,<xl,kl,nl>) := <n1,…,nl> • Е↓[Op(e1,…,em)] (c1,…cl) := F[Op] <>( Е↓[e1] (c1,…cl),…, Е↓[em] (c1,…cl))
Top-down evaluation of XPath S↓ [Ҳ::t[e1]…[em]](X1, … , Xk) := begin S := {<x,y>| x ∈∪ki=1 Xi, x Ҳ y and y ∈ T(t)}; for each 1≤i≤m (in ascending order) do begin fix some order S→= <<x1,y1>,…,<xl,yl>> for S <r1,…,rl> := [ei](t1,…tl) where tj =<yj,idx(yj,Sj),|Sj|)> and Sj := {z|<xj,z> ∈ S} S:={<xi,yi>| ri = true}; end; for each 1≤i≤k do R:={y|<x,y>∈S, x∈Xi}; return <R1,…,Rk>; end; S↓[/п](X1,…,Xk) := S↓[п]({root},…, {root}) S↓[п1/п2] (X1,…,Xk) := S↓[п1] (S↓[п1] (X1,…,Xk)) S↓[п1|п2] (X1,…,Xk) := S↓[п1] (X1,…,Xk) ∪<>S↓[п2] (X1,…,Xk))
Top-down evaluation of XPath • Example • Query: /descendant::a[count(descendant::b/child::c) + position() < last()]/child::d • List L=<<y1,1,l>,…,<yl,l,l>> where yi is reachable node from the root through the descendant axis and which are labeled ‘a’. • Top-down evaluation • S↓[child::d] (S↓[descendant::a[e]] ({root})) • Е↓[e] (L) := count(п)+Е↓[position()](L)) < Е↓[last()](L) п = S↓[child::c] (S↓[descendant::b](ss(proj1(L))))
Top-down evaluation of XPath • The functional implementation of Е↓evaluates XPath queries in polynomial time (combined complexity), since the recursions in the definition os S↓ and Е↓ correspond to recursive function calls of the respective evaluation functions.
Linear-time fragments of XPath • Core XPath • Fragment of XPath • Only manipulates set of nodes (no arithmetical or string operations) • Supports condition predicates (‘exist’) and boolean operator (‘and’, ‘or’, and ‘not’) • Mapping each query to simple algebra over set operations (∩,∪,-,Ҳ) and an operation dom/root(S) := {x∈dom|root∈ S}, plus inverse of theirs.
Linear-time fragments of XPath • Example: • /descendant::a/child::b[child::c/child::d or not(following::*)] • Query tree
Linear-time fragments of XPath • Core XPath queries evaluation is bound to O(|D|*|Q|) • Rewriting query into algebraic expression E takes: O(|Q|) • Each operation: O(|D|)
Conclusion • This is the first XPath query evaluation algorithm that runs in polynomial time with respect to the size of both the data and the query (linear in the size of queries and quadratic in the size of data) • No optimization, strictly coheres to the specification given in the paper. • Benchmark results in seconds for IE6 vs. top-down algorithm