XML Native Query Processing

XML Native Query Processing Chun-shek ChanMahesh Marathe Wednesday, February 12, 2003

Topics • XML Indexing • “Accelerating XPath Location Steps”Torsten Grust, ACM SIGMOD 2002 • XML Query Optimization • “Multi-level Operator Combination in XML Query Processing”Shurug Al-Khalifa and H.V. Jagadish,ACM CIKM 2002

XML Query Languages • XPath • Developed by the World Wide Web Consortium • Version 1.0 became a W3C Recommendation on November 16, 1999 • Version 2.0 is a working draft.

XML Query Languages • XQuery • Developed by the World Wide Web Consortium as well • Currently a working draft

Axes on XPath Tree • There are 13 axes according to the XPath 2.0 Technical Report • Forward Axes • child, descendant, attribute, self,descendant-or-self, following-sibling, following, namespace (deprecated) • Reverse Axes • parent, ancestor, preceding-sibling, preceding, ancestor-or-self

XML Traversal and Storage • Tree-based traversal • Efficient storage is challenging • Especially for relational databases, which deals with tuples and is not designed to handle recursion or nested elements

Proposed Solutions • “Querying XML Data for Regular Path Expressions”Li and Moon, VLDB 2001 • “A Fast Index for Semistructured Data”Cooper, Sample, Franklin, Hjaltason and Shadmon, VLDB 2001 • “DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases”Goldman and Widom, VLDB 1997

Problems withProposed Solutions • Solutions focus on support of / and // location steps. Inadequate support for XPath. • Proposals rely on technologies outside the relational domain.

Author’s Proposal • XPath Accelerator • Works entirely within relational database. • Uses traditional relational syntax for queries. • Benefits from advanced index technologies, such as R-tree.

XPath Tree Traversal • Context Node: starting point of any traversal • Location Steps: syntactically separatedby /, evaluated from left to right • A step’s axis establishes a subset of document nodes (a document region)

XPath Forward Axes • Child • Descendant • Attribute • Self • Descendant-or-self • Following-sibling • Following • Namespace

XPath Reverse Axes • Parent • Ancestor • Preceding-sibling • Preceding • Ancestor-or-self

Sample XML Tree

Encoding XMLDocument Regions • Formula:v/descendant v/descendant v/following v/preceding v/self • Each node appears once in this formula • What are the ways to uniquely identify different nodes?

Numbering Nodes • Grust: Find out preorder and postorder rank posts • Tatarinov: Global, Local, Dewey • Li & Moon: Order-size pairs

Descendants? Ancestors? Preceding? Following? XML Document Regions

XPath Tree Node Descriptor • desc(v) = {pre(v),post(v),par(v),att(v),tag(v)} • window(α,v) ={condition for each field in desc()} • Example:window(child,v) = {(pre(v),∞),[0,post(v)),pre(v),false,*}

XPath Query Windows

XPath Evaluation • Given an XPath expression e, an axis α, and a node v, we can evaluate this: • query(e/α) =SELECT v’,*FROM query(e) v, accelv’WHERE v’ INSIDE window(α,v) • This pseudo-SQL code can be flattened into a plain relational query with a flatn-ary self-join.

XML Instance Loading • Loading XML Instance into the database means mapping its nodes into the descriptor table. • Can use callback procedures described in text to load element nodes into relational table. • Make separate table for element contents.

Potential Issues • Insertion of node • Need to renumber all nodes to reflect changes • Deletion of node • Only need to remove its entry in accelerator table

Node Descriptor Indexing • Efficiently supported by R-trees. • Can also be supported by B-trees.

Example of pre/postrank distribution

Shrink-wrapping the //-axis • Optimizing window for descendant axis • For each node, we need to determine the ranges of pre and post ranks for its leftmost and rightmost leaf nodes. • For any node v in a tree t, we havepre(v) −post(v) + size(v) = level(v) • For a leaf node v’, size(v’) = 0, thereforepre(v’) − post(v’) = level(v’) ≤ height(t)

Shrink-wrapping the //-axis • For the rightmost leaf v’ of node v:post(v) = post(v’) + (level(v’) − level(v)) • Using the previous equations, we have:pre(v’) ≤ post(v) + height(t) • For the leftmost left v’’ of node v, we have a similar result:post(v’’) ≥ pre(v) − height(t) • Can use these formula to shrink windows

Shrink-wrapping the //-axis • Original window{ (pre(v),∞), [0,post(v)), *, false, * } • New window{ (pre(v),post(v)+height(t)], [pre(v)−height(t),post(v)), *, false, * } • Similar techniques can be used to optimize the query windows of other axes.

Shrink-wrapping the //-axis

Finding Leavesin an XML Tree

XPath Traversals with and without shrunk windows

XPath Acceleratorv. Edge Map

R-Tree v. B-Tree

Performance for the ancestor axis

Performance: XPath Accelerator v. EE/EA-Join

Capabilities ofXPath Accelerator • Runs on top of a relational backend to leverage its stability, scalability, and performance. • Supports the whole family of XPath axes in an adequate manner. • To originate XPath traversals in arbitrary context nodes. • Provides the groundwork for an effective cost-estimation for XPath queries.

XML Query Optimization • Macro-level algebra: manipulates sets of trees directly • heavyweight, but more directly expressive • Micro-level algebra: manipulates sets of elements • In both algebra, basic operators are “intuitive” unit operations such as selections, projections, joins and set operations.

XQuery Expression and Pattern Tree

Macro-algebra • A macro-algebra would implement this entire expression as a single pattern-tree based selection operator (to select matching books), followed by a projection operator (to return titles).

Micro-algebra • A micro-algebra would break up the selection pattern into one selection operator per node (e.g. (tag=“book”), (tag=“year” && content > 1995)) and one containment join operator per edge. • Result of sequence of joins would then be projected on the book element, after which its title can be obtained as before.

Query Processing Implementation • Identify lists of candidate elements in the database to match each node in the specified structural pattern. • Find combinations of candidate elements, one from each list, that satisfy the required structural relationships. • Apply any conditions that involve multiple nodes in the structural pattern to eliminate some combinations.

Containment Join • Given two sets of elements U and V, a containment join returns pairs of elements (u,v) such that • uU and vV • u “contains” v • i.e. node u is an ancestor of node v in the tree representation

Containment Join Implementation • Three main options: • Scan the entire database • Use an index to find candidate nodes for one end of the join, and navigate from there • Use indices to find candidate nodes for both ends of the join, and compute a containment join between these candidate sets

Projection Merging

Set Operations • Union compatibility is not an issue. • In the relational world, union compatibility is an important consideration with respect to set operations. • In XML, since heterogeneous collections are allowed, this is not an issue.

Union in XML • Give two pattern trees PT1 and PT2, let PTC be a common component of the two pattern trees such that: • PT1− PTC = PT’1 and PT2 − PTC = PT’2where PT’1 and PT’2 are both trees • Node i in PTC has node j in PT’1 such that edge (i,j) is in PT1, if and only if node i also has some node k in PT’2 such that edge (i,k) is in PT2.

Different PatternTrees and Plans

Micro-operator Merging: New Access Methods • At macro-level, we considered a pattern tree selection as a single heavyweight operator. • At micro-level, the approach is to break up a pattern tree selection into multiple containment join operators.

Performance: Union

Performance: Intersection

Performance byQuery Structure

Parent-Child Join Performance

XML Native Query Processing