240 likes | 383 Views
Efficient Physical Operators for a cost-based XPath Execution Engine. Haris Georgiadis Minas Charalambides Vasilis Vassalos. Athens University of Economics and Business. Motivation (1). XPath query: /s/r/*/it[ mb /m/to=‘x’]//k Three navigation alternatives (among others):.
E N D
Efficient Physical Operators for a cost-based XPath Execution Engine Haris Georgiadis Minas Charalambides Vasilis Vassalos Athens University of Economics and Business
Motivation (1) • XPath query: /s/r/*/it[mb/m/to=‘x’]//k • Three navigation alternatives (among others): • Straightforward navigation • retrieve all it elements under /s/r/*/it; keep those having at least one to descendant under /mb/m/to with text value ‘x’. For the it elements left, return their k descendants. • Starting from k • return all k elements with at least one it ancestor, which in turn: • has a to descendant under /mb/m/to with text value ‘x’ and • has a s document element ancestor via relative path • parent::*/parent::r/parent::s. • Starting from to • return all to elements under /s/r/*/it/mb/m/to, keep only those with text value ‘x’, then go backward via parent::m/parent::mb/parent::it and, for the it elements left, return their k descendants Athens University of Economics and Business
Motivation (2) • Many XPath processing algorithms • PPFS+ , Staircase Join, Sort Merge-based structural joins, PathStack, Twig2Stack etc • Many physical data models and storage techniques : • Shredding on relations: • Schema-based mapping vs. edge-based mapping • Storage into disk pages preserving XML hierarchy • Structural encodings: • Region Encoding vs. Prefix based encoding • Data structures: XB-trees, F&B Index, Path indexes Athens University of Economics and Business
Contribution I • GeCOEX: the first generic Xpath cost-based execution and optimization framework • Agnostic to the underlying XML storage system and the access methods it supports • Independent of the techniques and algorithms available for XPath processing. • Encapsulated in operator implementations, and rewriting rules • Cost based optimization Athens University of Economics and Business
Contribution II • XPalgebra: A novel XPath logical algebra • Good fit with many XPath processing techniques • Lookup and SM: two novel and efficient families of physical operators for Xpath • Multiple storage engines • Experimental evaluation: Direct comparison of operator implementations Athens University of Economics and Business
Descriptors Descriptors Physical Operator Descriptors Cost Models Physical Operator Descriptors Cost Models Descriptors Descriptors Physical Operators Physical Operators GeCOEX System Architecture XPath query Query Optimization XPA Driver XPA API Rewriting Rules Parser Database Statistics Database Statistics Physical Plan Selector Primitive Access Method Cost Models Primitive Access Method Cost Models Query Execution Primitive Access Methods Primitive Access Methods Physical Plan Executor result Data Model Athens University of Economics and Business
XPalgebra • Generic sequence-based logical algebra for a subset of XPath • Forward and backward axes • Non-positional predicates involving conjunctive booleanexpressions • Maintains the navigation nature of Xpath • Data Model • Element • Sequence • Duplicate-free list of elements in document order • Sequence Operators: (mainly) navigation • Input and Output: Sequence • Boolean Operators: used for filtering • Input: Element • Output: True or False Athens University of Economics and Business
XPalgebra – Sequence Operators • Both the input and the output of a Sequence operator are sequences of nodes • The input sequence is called context sequence BoolExpr: const | Ъ1^Ъ2^ … ^Ъn, where Ъi : Boolean Operator Athens University of Economics and Business
XPalgebra – Boolean Operators • applied on single nodes only • the input element is called context element • return boolean values f(S, Ъfp/d//c) …[d//c] BoolExpr: const | Ъ1^Ъ2^ … ^Ъn, where Ъi : Boolean Operator Athens University of Economics and Business
XPalgebra - examples /s/r/*/it[mb/m/to=‘x’]//k dk(f(fp/s/r/*/it(root), Ъfp/mb/m/to(Ъvftext()=x))) Athens University of Economics and Business
Physical Operators • Implements the Sequence interface of XPA API • Access the XML data using the AccessMethods interface of the XPA API Example: a physical operator implementation That’s how physical operators are agnostic to the physical data model Athens University of Economics and Business
Physical Operators • Large number of physical operators, divided roughly into four ‘families’: • Lookup operators (LU) • Inspired by indexed nested loops join • dLUa: for each element n from input sequence S make a lookup using XPAAPI.Descs(n, a) • SortMerge-based operators(SM) • Inspired by Sort Merge join • dSMa: scan all elements from input sequence S and all a elements (using XPAAPI.Descs(root, a)) and find ‘ancestor-descendant’ matches • Staircase Join operators[Grust 2003] • PathStack operators [Bruno 2002] Athens University of Economics and Business
Physical Operators **: inspired by original Athens University of Economics and Business
5 XML Storage Systems and their XPA drivers Descriptors Descriptors Physical Operator Descriptors Cost Models Physical Operator Descriptors Cost Models Descriptors Descriptors Physical Operators Physical Operators • The PE-basic Native XML storage system • Dewey encoding, 1 B-Tree per tag name • The RE-basic Native XML storage system • Pre/Post/Level encoding, 1 B-Tree per tag name • The PE-Path Native XML storage system • Dewey encoding, 1 B-Tree per tag name, Paths B-Tree • The RE-Path Native XML storage system • Pre/Post/Level encoding, 1 B-Tree per tag name, • Paths B-Tree • The Edge-RE Native XML storage system • Pre/Post/Level encoding, 1 B-Tree for all elements XPath query Query Optimization XPA Driver XPA API Rewriting Rules Parser Database Statistics Physical Plan Selector Primitive Access Method Cost Models Query Execution Primitive Access Methods Physical Plan Executor XML Storage System result Data Model Athens University of Economics and Business 22
Lookup Operators • Novel efficient algorithms for holistically evaluating forward and backward multi-step paths • Based on root-to-node filtering. • buffered-leaping: a new technique for pipelined duplicate elimination and document order preservation • Search a minimum window of elements for each element in the context sequence • window: the result of calling the method from the AccessMethods interface of the XPA API (e.g. Descs(), Ancs()) corresponding to the XPath axis (e.g. descendant, ancestor) for a given context element
r b1 b2 b3 b8 b9 b4 c c f4 c b6 b7 c c d f16 b5 c c c f1 c f15 f14 e f3 f5 f11 c d c f8 d f9 f17 f2 f12 f13 f6 f7 f10 regExprFilter(f4.getRTNPath(), /c//f, 1)= false f8 descendant of b3 and regExprFilter(f8.getRTNPath(), /c//f, 1)= false f16 is not reachable from b9 via /c//f f12 is reachable from b7 via /c//f f11 again not reachable from any of b3, b5, b7 via /c//f f10 again not reachable from any of b3, b5, b7 via /c//f f9 again not reachable from any of b3, b5, b7 via /c//f f8 not descendant of b7 f8 not descendant of b5 f17 is reachable from b9 via /c//f f7 descendant of b3 and regExprFilter(f7.getRTNPath(), /c//f, 1)= false f6 not descendant of b7 regExprFilter(f2.getRTNPath(), /c//f, 1)= false regExprFilter(f3.getRTNPath(), /c//f, 1)= true regExprFilter(f5.getRTNPath(), /c//f, 1)= true regExprFilter(f1.getRTNPath(), /c//f, 1)= true f6 descendant of b3 and regExprFilter(f6.getRTNPath(), /c//f, 1)= false f13 is reachable from b7 via /c//f f7 descendant of b5 and regExprFilter(f7.getRTNPath(), /c//f, 3)= true f6 descendant of b5 and regExprFilter(f6.getRTNPath(), /c//f, 3)= false Example: fpLU/c/f next() f1 contextEl chain rootAnc window =XPAPI.Descs(b9,‘f’); window =XPAPI.Descs(b3,‘f’); window =XPAPI.Descs(b1,‘f’); window =XPAPI.Descs(b2,‘f’); next() f3 b1 b1 f5 next() The size of chain at any time is very small and upper bounded by the depth of the XML document b2 b2 next() f7 b3 b3 b5 b7 b9 b5 next() f12 b7 f13 next() b9 f17 next() null b3 not a descendant of b2 b5 is a descendant of b3 b9 is not a descendant of b3 b7 is a descendant of b3 context sequence is exhausted b2 not a descendant of b1
r b1 b2 b3 b8 b9 # # # # # # b1 b5 b4 b3 b7 b2 b4 c c f4 c b6 b7 c c d f16 b5 c c c f1 c f15 f14 e f3 f5 f11 c d c f8 d f9 f17 f2 f12 f13 f6 f7 f10 reverseOf(parent::c/ancestor::b)=/c//f V: regExprFilter(f3.getRTNPath(), /c//f, 1)=true Example: bpLUparent::c/ancestor::b f6 not a descendant of b2 f5 not a descendant of b1 f8 is a descendant of b3 f3 is a descendant of b1 f11 is a descendant of b3 next() b1 next() b2 next() b4 sortedElements contextEl V window =XPAPI.Ancs(f2,‘b’); f2 window =XPAPI.Ancs(f3,‘b’); f3 V • Cheap implementation of Ancs() in the PE-Path driver • Dewey(f2)=1.1.2.1.1 • RTN(f2)= /r/b/c/f => there is a ‘b’ ancestor b’ at level 2 • Dewey(b’)= substr(dewey(f2), …) = 1.1 RTN(b’)=substr(RTN(f2), …) = /r/b • Ancs() outputs n without actually retrieving b1 from the database. n is the virtual representation of b1, denoted as #b1 f5 window =XPAPI.Ancs(f5,‘b’); V window =XPAPI.Ancs(f6,‘b’); f6 f8 window =XPAPI.Ancs(f8,‘b’); f11 window =XPAPI.Ancs(f11,‘b’); null
SM Operators • Inspired by sort-merge join algorithms • Traverse two sequences of elements, left and right • left: the context sequence (the input sequence) • right: always consists of all the elements of the requested tag name • Keeping track of the current elements on left and right, try to find matching pairs according to the appropriate navigation axis and condition • Novel techniques for holistic SM-based forward path and backward path operators with guaranteed low memory requirements
Sensitivity to context selectivity descendant ancestor forward path
Conclusions I • Novel techniques for evaluating forward and backward multi-step paths • pipelined duplicate elimination and document order preservation • Lookup fp, Lookup bp, Lookup cs, SM fp, SM bp, SM cs • Fast backwards navigation that fully exploits the capabilities of the underlying storage system • Algorithms perform well across a variety of different physical storage models • First steps towards building cost models for XPath Athens University of Economics and Business
Conclusions II • Operator-based XPath processing provides significant optimization opportunities • Different implementations of logical operators can provide benefits in different circumstances • E.g. context selectivity • Query plans can be much more efficient than (existing) monolithic (twig) techniques in most circumstances Athens University of Economics and Business
Thank you! Athens University of Economics and Business