CS 561 Presentation: Indexing and Querying XML Data for Regular Path Expressions

CS 561 Presentation: Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Ming Li

Our Objective • Developing a system that will enable us to perform XML data queries efficiently.

XML Queries Languages • Used for retrieving data from XML files. • Use a regular path expression syntax. • e.g. XPath, XQuery.

Queries Today - Inefficient • Usually XML tree traversals – Inefficient. • Top-Down Approach • Bottom-Up Approach • An example: the query: /chapter/_*/figure (finding all figures in all chapters.)

Our Objective - Refined • Developing a system that will enable us to perform XML data queries efficiently • Developing such a system consists of: • Developing a way to efficiently store XML data. • Developing efficient algorithms for processing regular path expressions (e.g. XQuery expressions).

Storing XML Documents - XISS • XISS - XML Indexing and Storage System. • Provides us with ways to: • efficiently find all elements or attributes with the same name string grouped by document which they belong to. • quickly determine the ancestor-descendant relationship between elements and/or attributes in the hierarchy of XML data hierarchy.

Determining Ancestor-Descendent Relationship • According to Dietz’s: for two given nodes x and y of a tree T, x is an ancestor of y iff x occurs before y in the preorder traversal and after y in the postorder traversal. • Example:

Determining Ancestor-Descendent Relationship – cont. • Advantage: the ancestor-descendent relationship can be determined in constant time. • Disadvantage: a lack of flexibility. • e.g. inserting a new node requires recomputation of many tree nodes.

exclusive Determining Ancestor-Descendent Relationship – cont. • A new numbering scheme: • Each node is associated with a <order, size> pair: • For a tree node y and its parent x: [order(y), order(y) + size(y)] Ì (order(x), order(x) + size(x)] • For two sibling nodes x and y, if x is the predecessor of y in preorder traversal holds: order(x) + size(x) < order(y).

Determining Ancestor-Descendent Relationship – cont. • Fact: for two given nodes x and y of a tree T, x is an ancestor of y iff: order(x) < order(y) £ order(x) + size(x)

Determining Ancestor-Descendent Relationship – cont. • Properties: • the ancestor-descendent relationship can be determined in constant time. • flexibility – node insertion usually doesn’t require recomputation of tree nodes. • an element can be uniquely identified in a document by its order value.

XISS System Overview

Name Index and Value Table • Objective: minimizing the storage and computation overhead by eliminating replicated strings and string comparisons. • Name Index -mapping distinct name strings into unique name identifiers (nid). • Value Table - mapping distinct value strings (i.e. attribute value and text value) into unique value identifiers (vid). • Both implemented as a B+-tree.

The Element Index • Objective: quickly finding all elements with the same name string. • Structure:

The Attribute Index • Objective: quickly finding all elements with the same name string. • Structure: • Same structure as the Element Index except that the record in attribute index has a value identifier vid which is a key used to obtain the attribute from the value table.

The Structure Index • Objectives: • Finding the parent element and child elements (or attributes) for a given element. • Finding the parent element for a given attribute. • Structure:

The Structure Index – cont. • Structure: • B+-tree using document identifier (did) as a key. • Leaf nodes: linear arrays with records for all elements and attributes from an XML document. • Each record: {nid, <order,size>, Parent order, Child order, Sibling order, Attribute order}. • Records are ordered by order value.

Querying Method • Decomposing path expressions into simple path expressions. • Applying algorithms on simple path expressions and their intermediate results.

Decomposition of Path Expressions • The main idea: • A complex path expression is decomposed into several simple path expressions. • Each simple path expression produces an intermediate result that can be used in the subsequent stage of processing. • The results of the simple path expressions are than combined or joined together to obtain the final result of the given query.

(1) (1) (1) (1) (1) (1) (1) / [ ] /_*/ (3) (2) (3) (1) Single Element/Attribute (2) Element-Attribute (3) Element-Element (4) Kleene Closure (5) Union * | (4) (5) / (3) / (3) Basic Subexpressions - Example Decomposition of (E1/E2)*/ E3 / ((E4[@a=V]) | (E5/_*/E6)):

Example: EA-Join: Element and Attribute Join

EA-Join: Element and Attribute Join Input: {E1,…,Em}: Ei is a set of elements having a common document identifier (did); {A1,…,An}: Aj is a set of elements having a common document identifier (did); Output: A set of (e,a) pairs such that the element e is the parent of the attribute a.

EA-Join: Element and Attribute Join The Algorithm: // Sort-merge {Ei} and {Aj} by did. (1) foreachEi and Aj with the same diddo: // Sort-merge Ei and Aj by // PARENT-CHILD relationship (2) foreache ÎEi and aÎAjdo (3) if (e is a parent of a) then output (e,a) end end

Ele <1,3> Ele <3,1> Att <2,0> Att <4,0> EA-Join – Example • Consider the XML document: <Ele Att=“A1”> <Ele Att=“A2”> </Ele> </Ele> • And the query: /Ele[@Att=“A1”]

Ele <1,3> Ele <3,1> Att <2,0> Att <4,0> EA-Join – Querying /Ele[@Att=“A1”] <Ele Att=“A1”> <Ele Att=“A2”> </Ele> </Ele> • Sort-merging “Ele”s and “Att”s by parent-child relation ship will give us the list: <1,3>, <2,0>, <3,1>, <4,0> • Finding the elements “Ele”s with a child attribute “Att” with a value “A1” from the accepted list is easy using the information in the Element Record.

EA-Join – Comments • Only a two-stage sort-merge operation without additional cost of sorting: • First merge: by did. • Second merge: by examining parent-child relationship. • This merge is based on the order values of the element and attribute as defined by the numbering scheme. • Attributes should be placed before their sibling elements in the order of the numbering scheme. • guarantees that elements and attributes with the same did can be merged in a single scan.

Conclusions • XISS can efficiently process regular path expression queries. • Performance improvement over the conventional methods by up to an order of magnitude. • Future work:optimal page size or the break-even point between the two criteria.

Thank you so much!

CS 561 Presentation: Indexing and Querying XML Data for Regular Path Expressions