Efficient Structural Joins on Indexed XML Documents

Efficient Structural Joins on Indexed XML Documents Shu-Yao Chien, Zografoula Vagena, Donghui Zhang, Vassilis J. Tsotras, Carlo Zaniolo VLDB 2002 Efficient Structural Joins on Indexed XML

Overview • Motivation • Problem Description • Structural Joins • Structural Joins using B+-trees • Structural Joins using R-trees • Problem Variations • Experimental Results Efficient Structural Joins on Indexed XML

Motivation (1) Query languages for XML qualify documents for retrieval both by their structure and the values of their elements. Example: section[title=“Overview”]//figure[caption=“R-tree”] (path-expression query) Efficient Structural Joins on Indexed XML

Motivation (2) • Numbering Schemes • Each node is assigned a unique interval. • The intervals of a parent node contains the intervals of all its children. When the XML document is combined with a numbering scheme, path expression queries require the computation of structural joins. Efficient Structural Joins on Indexed XML

Motivation (2) • From path expressions to structural join: two nodes qualify for a path expression query if one is an ancestor of the other. With intervals, this is equivalent to containment. When the XML document is combined with a numbering scheme, path expression queries require the computation of structural joins. Efficient Structural Joins on Indexed XML

Problem Description • Structural Join: Let A and D be two lists containing the instances of two particular tags in an XML document, join A and D using their containment associations as the join condition. • [Al-Khalifa, etc. 2002] proposed non-indexed structural join algorithms. • We extend their algorithms to take advantage of existing indices on the two lists. Efficient Structural Joins on Indexed XML

Structural Joins, no indices • Let a, d be the first elements of A and D • while (A, D are not empty or the stack is not empty) do • if (a.start > stack.top and d.start > stack.top) then • stack.pop() • else if (a.start < d.start) then • stack.push(a) • Let a be the next element in A • else • output d as descendant of all elements in stack • let d be the next element in D • endif • endwhile Efficient Structural Joins on Indexed XML

Example a1 a2 a4 a3 d2 d3 d1 Efficient Structural Joins on Indexed XML

Structural Joins using B+-trees • Existing structural join algorithms sequentially scan the input lists. • Durable numbering schemes have enabled indexing of XML files with mainstream indices. • Such indices can result in sub-linear access time as they provide the facility to skip elements that don’t participate in the join. Efficient Structural Joins on Indexed XML

Motivation for using the B+-tree index (1) a1 a2 a12 a3 a4 a8 a5 a9 a6 a7 a10 a11 d1 d2 Efficient Structural Joins on Indexed XML

Motivation for using the B+-tree index (2) a1 a2 d14 d1 d2 d3 d13 d4 d5 d9 d10 d6 d7 d8 d11 d12 Efficient Structural Joins on Indexed XML

Structural Joins using B+-trees • Put pointers a and d at the beginning of lists A and D • while ( not at the end of A or D ) do • if ( a is an ancestor of d ) then • Push into stack all elements in A that are ancestors of d • Join d with all elements in stack and let d=d->next • else if ( a.start < d.start ) then // jump ancestor A • Pop all elements in stack which are before d • Move a forward by skipping sub-trees of last element popped • else // a is after d; jump descendant D • Join d with all elements in stack • Move d forward by skipping all D elements with start<a.start Efficient Structural Joins on Indexed XML

Containment forest • Structure linking elements that belong to the same tag. • Each element corresponds to a node in the structure and is linked to other elements via parent, first-child and right-sibling pointers. • Can be embedded within the associated B+-tree • Improves CPU time Efficient Structural Joins on Indexed XML

Containment forest example A (10,500) A (800,900) A (1400,2000) A (150,250) A (300,400) A (830,860) A (1530,1560) A (1700,1800) Efficient Structural Joins on Indexed XML

Containment forest properties • The (start, end) interval of each node contains all intervals in its subtree. • The start numbers in the forest follow a preorder traversal. • The start (end) numbers of sibling nodes are in increasing order. Containment forest can be dynamically maintained. Efficient algorithms for element insertion/deletion Efficient Structural Joins on Indexed XML

Structural Join using R-trees (1) • The interval (start, end) of an element can be mapped to a point (e.start, e.end) in the 2-D space which is then indexed by an R-tree. • An R-tree can also be used to index the element (start, end) ranges as 1-D intervals Efficient Structural Joins on Indexed XML

Structural Join using R-trees (2) two points two pages Efficient Structural Joins on Indexed XML

Problem Variations • Self Joins • non-indexed algorithm that traverses the element list exactly once • Structural Join in a pipelining environment • Feedback between modules can help to skip elements that don’t take part in the join Efficient Structural Joins on Indexed XML

Performance Analysis (1) Effect of skipping only ancestors in join performance Efficient Structural Joins on Indexed XML

Performance Analysis (2) Effect of skipping only descendants in join performance Efficient Structural Joins on Indexed XML

Performance Analysis (3) Effect of skipping both ancestors and descendants Efficient Structural Joins on Indexed XML

Performance Analysis (4) Comparison of B+-tree and B+psp algorithms Efficient Structural Joins on Indexed XML

Conclusions • We presented efficient ways to perform structural joins over XML data utilizing existing indices. • Experimental results showed that among the indexed approaches, the B+-tree with sibling pointers performs the best. • Easily maintainable solution that provided drastic improvement over no-index case. Efficient Structural Joins on Indexed XML

Efficient Structural Joins on Indexed XML Documents