1 / 99

XPath Query Processing DBPL9 Tutorial, Sept. 8, 2003, Part 2

XPath Query Processing DBPL9 Tutorial, Sept. 8, 2003, Part 2. Georg Gottlob, TU Wien Christoph Koch, U. Edinburgh. Based on joint work with R. Pichler . Contents. Part 1 Xpath Basics Axis Evaluation Experiments with current systems Polynomial-time evaluation of Core Xpath

hall
Download Presentation

XPath Query Processing DBPL9 Tutorial, Sept. 8, 2003, Part 2

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. XPath Query ProcessingDBPL9 Tutorial, Sept. 8, 2003, Part 2 Georg Gottlob, TU Wien Christoph Koch, U. Edinburgh Based on joint work with R. Pichler

  2. Contents Part 1 • Xpath Basics • Axis Evaluation • Experiments with current systems • Polynomial-time evaluation of Core Xpath • Core XPath and datalog • Polynomial-time evaluation of full Xpath Part 2 • Context simplification and efficient evaluation of Xpath • Parallel complexity of Xpath • Automata-based techniques: • Xpath on Streaming XML • Expressive queries and automata. • Further relevant work

  3. Context Simplification and Efficient Evaluation of XPath

  4. Time and space bound Bottom-up evaluation based on CVT: • Time O(|data|5 * |query|2), space O(|data|4 * |query|2). Space bound (n … number of nodes in input document.): • Contexts are at most triples: at most n^3 contexts. • Sizes of values: • Node sets: at most O(n) • Strings, numbers: at most O( |data|* |query|) – (iterated concatenation of strings, multiplication of numbers) • Each CVT is of size (|data|4 * |query|). Time bound: most expensive computation is O(n^2) – Relational operation “=“ on node sets (e.g. a/b//c[d//e/f/g = h/i//j])

  5. Alternative context representation • Contexts represented as (“previous context node, “current context node”)rather than (“context node”, “position”, “size”). • Need to recompute “position” and “size” on demand. • Complexity lowered to time O(|data|4 * |query|2), space O(|data|3 * |query|2). 0:c //a/b[position() + 1 = size()] 1:a 5:a child::b … { (1,2), (1,3), (1,4), (5,6), (5,7) } child::b[position()+1=size()] … { (1,3), (5,6) } 2:b 3:b 4:b 6:b 7:b

  6. Context Simplification Technique • Only materialize relevant context. • Core Xpath evaluation algorithm for outermost and innermost paths //a/b/c//d[…]/e[…(a/b/c)]. • Treating “position” and “size” in a loop. • Because of tree shape of query, loops never have to be nested. /child::b[ ] (cn,cp, cs) - loop descendant::a position() = count( ) (cn) Compute node set for which child::b[…] is true (cn,cp, cs) - loop child::b[ ] position() +1 = last()

  7. Linear Space Fragment • “Wadler Fragment” [Wadler, 1999]: Core Xpath + position(), last(), and arithmetics. • Evaluation in quadratic time and linear space. • For x in [[//a]] compute contexts (y,p,n) in x.[[b]] Compute Y = { y | (y,p,n) 2 x.[[b]] and p*2=n }. • Similarly, compute Z = { z | z.[[ d[position()*3 = last()] ]] is true}. • Compute X = { x | z 2 Z, x 2 z.[[ child::c ]]-1 } – in linear time. • Result is { w | v \in X \cap Y, w \in v.[[descendant::e]] }. (cn,cp,cs) (cn,cp,cs) //a/b[position() * 2 = last() and c/d[position()*3 = last()]]//e (cn) (cn) (cn) (cn)

  8. Summary Full XPath • Bottom-up algorithm based on CVT • Time O(|data|5 * |query|2), space O(|data|4 * |query|2). • Top-down evaluation • Time O(|data|4 * |query|2), space O(|data|3 * |query|2). • Context-reduction technique • Time O(|data|4 * |query|2), space O(|data|2 * |query|2). Wadler fragment • Time O(|data|2 * |query|2), space O(|data| * |query|). Core Xpath • Time and space O(|data| * |query|).

  9. Parallel Complexity of XPath

  10. Parallel Complexity of XPath • Known: Xpath is in P w.r.t. combined complexity[G., K., and Pichler, VLDB 2002]. • P-hardness => unlikely that there is an efficient parallel algorithm (conjecture: P > NC) • Even quite restrictive fragments of Xpath are P-hard • Core Xpath using only child, parent, and descendant axes, no “branching” of tree patterns. • Proof by encoding circuits, somewhat involved! • But: without negation, Core Xpath is in LOGCFL (< NC2, highly parallelizable!!)

  11. PF – Path Query Fragment • PF = Core XPath without conditions. • E.g. //a/b//c/parent::d//f/g/ancestor::a/* • Theorem: PF is NL-complete w.r.t. combined complexity (and L-reductions). • Membership: paths easy to guess and check in NL. • NL-Hardness by reduction from Graph Reachability …

  12. Where can we go from v2 in one step?

  13. Where can we go from v2 in one step?

  14. Where can we go from v2 in one step?

  15. Where can we go from v2 in one step?

  16. Where can we go from v2 in one step?

  17. Where can we go from v2 in one step?

  18. Where can we go from v2 in one step? • Reachable from v2 in one step: v1, v3!

  19. PF is NL-hard. • Reachability in precisely m steps: • Add loop at each node to graph => reachability in at most m steps. • Set m = |E|.

  20. Further fragments with low parallel complexity Combined complexity of Core Xpath is in L if: • Only one-step axes are used (child, parent; self). • Only transitive downward axes are used (descendant, descendant-or-self, …).

  21. Increasing the Size of the LOGCFL Fragment • “positive Wadler fragment” [Wadler, 2000]: just like positive Core XPath, but with position arithmetics in conditions. • child::a[position()+1 = last()] … get the second-last child labeled “a”. • No iteration of predicates: child::a[…][…]. • Theorem (combined complexity): the positive WF is • LOGCFL-complete; • with iterated predicates (already when iterated at most twice), it is P-complete.

  22. Increasing the Size of the LOGCFL Fragment • pXPath: “positive”/parallel XPath. • No negation • No iterated predicates […][…] • Depth of nesting of arithmetic operations inside a predicate is bounded by some constant. • Forbidden built-in functions: count, sum, string, local-name, name, namespace-uri, string-length, normalize-space. • Forbidden: relational operations on booleans. • Theorem. pXPath is LOGCFL-complete (combined complexity). • Maximal parallelizable fragment of Xpath, unless P = NC. • Adding any of the features (1) – (5) leads to P-hardness.

  23. Combined Complexity of XPath

  24. Data and Query Complexity • Theorem. PF is L-complete under NC1-reductions (data complexity). • Theorem. XPath w/o multiplication, concatenation is in L w.r.t. query complexity. • Surprisingly, data complexity and query complexity are low; combined complexity is higher! XPath PF L-complete (NC1-red.) L Data complexity

  25. Processing Xpath on Streams using Finite Automata

  26. FSA on Streams • Translate Xpath path query into FSA, process stream of (e.g.) SAX events. • Very good scalability, low memory consumption (stack needed) • Selective dissemination of information (SDI) / publish-subscribe(cf. Xfilter [Altinel and Franklin, VLDB 2000], Xtrie [Chan et al., ICDE 2002]). • Boolean queries. • Extensions to support branching tree patterns, condition predicates, backward axes, … • Goal is to evaluate multiple queries at once (10^4 – 10^6 queries.)

  27. Example: $x in //a/b b (0) a a a b b a $x $x b NFA DFA

  28. Example: //a/b b (0) a (01) a a b b a $x $x b NFA DFA

  29. Example: //a/b b (0) a (01) (01) a a b b a $x $x b NFA DFA

  30. Example: //a/b b (0) a (01) (01) a a b (02) b a $x $x $x b NFA DFA

  31. Example: //a/b b (0) a (01) (01) a a b b a $x $x $x b NFA DFA

  32. Example: //a/b b (0) a (01) a a b b a $x $x $x b NFA DFA

  33. Example: //a/b b (0) a (01) (01) a a b b a $x $x $x b NFA DFA

  34. Example: //a/b b (0) a (01) a a b b a $x $x $x b NFA DFA

  35. Example: //a/b b (0) a (01) (02) a a b $x b a $x $x $x b NFA DFA

  36. Example: //a/b b (0) a (01) (02) a a b $x (01) b a $x $x $x b NFA DFA

  37. Example: //a/b b (0) a (01) (02) a a b $x (01) (02) b a $x $x $x b $x NFA DFA

  38. Example: //a/b b (0) a (01) (02) a a b $x (01) b a $x $x $x b $x NFA DFA

  39. Example: //a/b b (0) a (01) (02) a a b $x b a $x $x $x b $x NFA DFA

  40. Example: //a/b b (0) a (01) a a b $x b a $x $x $x b $x NFA DFA

  41. Example: //a/b b (0) a a a b $x b a $x $x $x b $x NFA DFA

  42. Size of DFAs //a/*/*/b

  43. Size of DFAs • Exponential in the size of Xpath statement, but • Only exponential in number of occurrences of “*”. • In case of automaton for multiple queries, exponential in number of occurrences of “//”. • Lazy evaluation of DFA • Computation of states and transitions only on demand. • Saves much time and space in practice: documents usually from quite restrictive language. [Green, Miklau, Onizuka, Suciu, ICDT 2003]

  44. Extensions • Branching tree patterns. • Condition predicates. • Backward axes • Boolean queries (“Can tree pattern be embedded into XML document?”) • Rather than node-selecting queries.

  45. Highly Expressive Queries and Automata

  46. Motivation • Scalability in databases = (all three points at the same time) • Strictly linear time. • Little main memory required (DB in secondary storage). • Little jumping around in the data, sequential scans of disk preferred (streaming). • Paged sequential reading much faster than random access. • Node-selecting queries on unranked trees (XML) • Higher expressiveness than what is possible with single pass. • Folklore: unary MSO queries can be evaluated in two passes through the tree.

  47. The Arb Query Processor Evaluates node-selecting queries • In two sequential scans of the data. • Memory requirements: O(depth(tree)), otherwise independent of size of DB. • Highly parallelizable. • Tree Automata-based. • High expressiveness: unary Monadic Second Order Logic (MSO). • Succinct representation of automata. [Frick, Grohe, K., LICS 2003; K., VLDB 2003]

  48. Selecting Tree Automata (STAs) • STA: Nondeterministic bottom-up tree automatawith a set of selecting states. • Select a node if it is assigned a selecting state in all (or one) accepting runs:or • Expressive power: unary MSO queries on trees. [Neven’s thesis]; [Frick, Grohe & K., LICS 2003]

  49. Two-Phase Query Evaluation From STA • Deterministic bottom-up tree automaton • compute reachable states. • Deterministic top-down tree automaton (with selection) • Eliminate state-to-node assignments that do not lead to accepting run. • Select nodes of query result. [Frick, Grohe & K., LICS 2003]

More Related