Indexing and Querying XML Data for Regular Path Expressions

Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Amnon Shochot

Our Objective • Developing a system that will enable us to perform XML data queries efficiently.

XML Queries Languages • Used for retrieving data from XML files. • Use a regular path expression syntax. • e.g. XPath, XQuery.

Queries Today - Inefficient • Usually XML tree traversals – Inefficient. • Top-Down Approach • Bottom-Up Approach • An example: the query: /chapter/_*/figure (finding all figures in all chapters.)

Our Objective - Refined • Developing a system that will enable us to perform XML data queries efficiently • Developing such a system consists of: • Developing a way to efficiently store XML data. • Developing efficient algorithms for processing regular path expressions (e.g. XQuery expressions).

Storing XML Documents • Question: What would we need from a data structure to be able to perform an efficient query? • Answer: A mechanism for: • Efficiently finding all elements/attributes with a given name. • Efficiently finding all values with a given name. • Efficiently resolving ancestor-descendant relationship.

Storing XML Documents - XISS • XISS - XML Indexing and Storage System. • Provides us with ways to: • efficiently find all elements or attributes with the same name string grouped by document which they belong to. • quickly determine the ancestor-descendant relationship between elements and/or attributes in the hierarchy of XML data hierarchy.

Determining Ancestor-Descendent Relationship • According to Dietz’s: for two given nodes x and y of a tree T, x is an ancestor of y iff x occurs before y in the preorder traversal and after y in the postorder traversal. • Example:

Determining Ancestor-Descendent Relationship – cont. • Advantage: the ancestor-descendent relationship can be determined in constant time. • Disadvantage: a lack of flexibility. • e.g. inserting a new node requires recomputation of many tree nodes.

exclusive Determining Ancestor-Descendent Relationship – cont. • A new numbering scheme: • Each node is associated with a <order, size> pair: • For a tree node y and its parent x: [order(y), order(y) + size(y)] Ì (order(x), order(x) + size(x)] • For two sibling nodes x and y, if x is the predecessor of y in preorder traversal holds: order(x) + size(x) < order(y).

Determining Ancestor-Descendent Relationship – cont. • Fact: for two given nodes x and y of a tree T, x is an ancestor of y iff: order(x) < order(y) £ order(x) + size(x)

Determining Ancestor-Descendent Relationship – cont. • Properties: • the ancestor-descendent relationship can be determined in constant time. • flexibility – node insertion usually doesn’t require recomputation of tree nodes. • an element can be uniquely identified in a document by its order value.

XISS System Overview

XISS System Overview • How the system works: • XML documents are loaded into the XISS system. • These documents are added to the XISS data structures. • Each document is assigned a document id (did). • Index structures are organized as paged files for efficient disk IO. • When a query is performed the query processor interacts with XISS in order to obtain the information required for the query.

XISS - cont. • XISS consists of 5 components: • Name Index • Value Table • Element Index • Attribute Index • Structure Index

Name Index and Value Table • Objective: minimizing the storage and computation overhead by eliminating replicated strings and string comparisons. • Name Index -mapping distinct name strings into unique name identifiers (nid). • Value Table - mapping distinct value strings (i.e. attribute value and text value) into unique value identifiers (vid). • Both implemented as a B+-tree.

The Element Index • Objective: quickly finding all elements with the same name string. • Structure:

The Element Index – cont. • Structure: • B+-tree using nid as a key. • Leaf nodes: pointers to a set of records for elements (or attributes) having an identical name string, grouped by the document they belong to. • Element Record = {<order,size>, Depth, Parent ID} • where Depth is the depth of the element in the XML tree. • Element Records are ordered by <order,size>.

The Attribute Index • Objective: quickly finding all elements with the same name string. • Structure: • Same structure as the Element Index except that the record in attribute index has a value identifier vid which is a key used to obtain the attribute from the value table.

The Structure Index • Objectives: • Finding the parent element and child elements (or attributes) for a given element. • Finding the parent element for a given attribute. • Structure:

The Structure Index – cont. • Structure: • B+-tree using document identifier (did) as a key. • Leaf nodes: linear arrays with records for all elements and attributes from an XML document. • Each record: {nid, <order,size>, Parent order, Child order, Sibling order, Attribute order}. • Records are ordered by order value.

Querying Method • Decomposing path expressions into simple path expressions. • Applying algorithms on simple path expressions and their intermediate results.

Decomposition of Path Expressions • The main idea: • A complex path expression is decomposed into several simple path expressions. • Each simple path expression produces an intermediate result that can be used in the subsequent stage of processing. • The results of the simple path expressions are than combined or joined together to obtain the final result of the given query.

(1) (1) (1) (1) (1) (1) (1) / [ ] /_*/ (3) (2) (3) (1) Single Element/Attribute (2) Element-Attribute (3) Element-Element (4) Kleene Closure (5) Union * | (4) (5) / (3) / (3) Basic Subexpressions - Example Decomposition of (E1/E2)*/ E3 / ((E4[@a=V]) | (E5/_*/E6)):

Basic Subexpressions 5 basic subexpressions: (1) A subexpression with a single element or a single attribute. (2) A subexpression with an element and an attribute. • e.g. figure[@caption = “Tree Frogs”] (3) A subexpression with two elements • e.g. chapter/_*/figure where ‘_’ denotes any kind of node.

Basic Subexpressions - cont. 5 basic subexpressions - cont.: (4) A subexpression that is a Kleene closure (+,*) of another subexpression. (5) A subexpression that is a union of two other subexpressions.

3 Algorithms • 3 Algorithms: • EA-Join: Element and Attribute Join. • EE-Join: Element and Element Join • Kleene Closure

EA-Join: Element and Attribute Join Input: {E1,…,Em}: Ei is a set of elements having a common document identifier (did); {A1,…,An}: Aj is a set of elements having a common document identifier (did); Output: A set of (e,a) pairs such that the element e is the parent of the attribute a.

EA-Join: Element and Attribute Join The Algorithm: // Sort-merge {Ei} and {Aj} by did. (1) foreachEi and Aj with the same diddo: // Sort-merge Ei and Aj by // PARENT-CHILD relationship (2) foreache ÎEi and aÎAjdo (3) if (e is a parent of a) then output (e,a) end end

Ele <1,3> Ele <3,1> Att <2,0> Att <4,0> EA-Join – Example • Consider the XML document: <Ele Att=“A1”> <Ele Att=“A2”> </Ele> </Ele> • And the query: /Ele[@Att=“A1”]

Ele <1,3> Ele <3,1> Att <2,0> Att <4,0> EA-Join – Querying /Ele[@Att=“A1”] <Ele Att=“A1”> <Ele Att=“A2”> </Ele> </Ele> • Sort-merging “Ele”s and “Att”s by parent-child relation ship will give us the list: <1,3>, <2,0>, <3,1>, <4,0> • Finding the elements “Ele”s with a child attribute “Att” with a value “A1” from the accepted list is easy using the information in the Element Record.

EA-Join – Comments • Only a two-stage sort-merge operation without additional cost of sorting: • First merge: by did. • Second merge: by examining parent-child relationship. • This merge is based on the order values of the element and attribute as defined by the numbering scheme. • Attributes should be placed before their sibling elements in the order of the numbering scheme. • guarantees that elements and attributes with the same did can be merged in a single scan.

EE-Join: Element and Element Join Input: {E1,…,Em} and {F1,…,Fm}: Ei or Fj is a set of elements having a common document identifier (did). Output: A set of (e,f) pairs such that element e is an ancestor of element f.

EE-Join: Element and Element Join The Algorithm: // Sort-merge {Ei} and {Fj} by did. (1) foreachEi and Fj with the same diddo: // Sort-merge Ei and Fj bythe // ANCESTOR-DESCENDANT relationship. (2) foreache Î Ei and fÎFjdo (3) if (e is an ancestor of f) then output (e,f); end end

EE-Join – Comments • Only two-stage sort-merge operation without the additional cost of sorting: • First merge: by did. • Second merge: by examining parent-child relationship. • The sets of elements with a matching did cannot be merged in a single scan.

Kleene Closure Input: {E1,…,Em}, where Ei is a group of elements from an XML document. Output: A Kleene closure of {E1,…,Em}.

Kleene Closure The Algorithm: • Set i¬ 1; • Set KiC¬{E1,…,Em}; (3) repeat (4) set i¬i + 1; (5) set KiC¬EE-Join(Ki-1C,K1C); until (KiC is empty); (6) output the union of K1C,K2C,…, KiC;

Performance Experiments • EE-Join: • Results: • Real World: an order of magnitude faster. • Synthetic Data: 6 to 10 times faster.

Performance Experiments • EA-Join: • Results: • Compared to Top-Down: a better performance. • Compared to Bottom-Up: no winner - close results.

Performance Results - Conclusions • The proposed algorithms can achieve performance improvement over the conventional methods (top-down and bottom-up tree traversals) by up to an order of magnitude.

Indexing and Querying XML Data for Regular Path Expressions