200 likes | 279 Views
<TITLE> Indexing & Querying XML Data for ../Regular Path Expressions/* </TITLE>. <AUTHORS> <NAME ID=1>Quanzhong li</NAME> <NAME ID=2>Bongki MOON</NAME> <AUTHORS>. <PRESENTERS> <NAME UFID=1234567> SUNDAR </NAME> <NAME UFID=7654321> sUPRIYA </NAME> <PRESENTERS>.
E N D
<TITLE>Indexing & Querying XML Data for ../Regular Path Expressions/*</TITLE> <AUTHORS> <NAME ID=1>Quanzhong li</NAME> <NAME ID=2>Bongki MOON</NAME> <AUTHORS> <PRESENTERS> <NAME UFID=1234567>SUNDAR</NAME> <NAME UFID=7654321>sUPRIYA</NAME> <PRESENTERS>
Need for this paper • XML – emerged as a popular standard for data representation and data exchange on the Internet • XML Query Languages use Regular Path Expressions to query the data • Conventional approaches (for indexing & searching this data) based on Tree traversals goes for a toss! – under heavy access requests • Traversing this hierarchy of XML data becomes a overhead if the path lengths are long or unknown • What can be done???
Try our System and the Algorithms !!! • New system for indexing & storing XML data – XISS • New numbering scheme for elements and attributes • Quick in figuring-out ‘ancestor-descendant’ relationship • New index structures • Easier to find all elements and attributes with a particular given name string • Join algorithms for processing Reg-Path-Exp queries • EE-Join – to search paths from element to element • EA-Join – to find element-attribute pairs • KC-Join – to find KC (*) on repeated paths or elements
Go XISS!!! • In general, XML data can be queried for a particular value (or) a structure • By Value: get me “document”; get me “element=‘node1’ ” or “attribute=10” • By Structure: get me parent and child elements/attributes for a given element • Components: • Index Structure: element, attribute and structure (index) • Data Loader • Query Processor • Numbering Scheme first…..
Deitz vs. Li-Moon… • Deitz says, “If x and y are the nodes of a tree T, x is an ancestor of y iff x comes before y when I climb down the tree (pre-order), and after y when I climb up (post-order)” and shows us his scheme, • Ancestor-Descendant relationship determination in constant time • Li-Moon says, “but this lacks flexibility” • This leads to many re-computations when a new node is inserted. • Hmm… let us check-out Li-Moon’s….
Li-Moon’s Numbering… • Hey folks, we are going to extend this preorder and cover up a range of descendants • Just associate a pair of numbers <order, size> with each node • Parent node x says to its child node y, “I came before you so my order is less than yours & my size is >= (your order + your size) and so your interval is always contained in my interval” • If there are siblings x & y (same parent), say, x is before y, then order(x) + size(x) < order(y)
Voila! • Here it goes, • So, for any node x, size(x) >= size of all its direct children [ size(x) is Laarrrge!] • That being said, “Given nodes x and y of a tree T, x is an ancestor of y iff order(x) < order(y) <= order(x) + size(x)
Good news! • Easy accommodation of future insertions – more flexible • Global reordering not necessary until no more reserved spaces • order in <order, size> pair is an unique identifier for each element and attribute in the document • Attribute nodes are placed before their sibling elements in the order – why? • How this scheme helps? – wait till the algorithms! • Switching back to XISS…
Internals of XISS • Index Structure Overview
More structures… • Element Index • Structure Index
Path Join Algorithms • Conventional approaches (top down, bottom up and hybrid traversals) – not effective • Main Idea of proposed algorithm: For a given query “chapter/-*/figure”, - find all ‘chapter’ elements - find all ‘figure’ elements - join the qualified ‘chapter-figure’ pairs without traversing XML data trees (if ancestor- descendant relationship is obtained quickly)
Complex -> Simple • Complex path expression decomposed to many simple path expressions • Intermediate results are joined to get the final result. • Different types of sub-expressions
EA-Join Algorithm • To join intermediate results from sub-expressions with a list of elements and a list of attributes • E.g. “figure[@caption=‘flowchart’]” • Attributes should be placed before sibling elements in the order by the numbering scheme
EA-Join Algorithm • Input: List of “figure” elements and List of “caption” attributes grouped by documents • Steps: (2 stages) • Element sets and attribute sets merged by doc. Id (single scan) • Elements and attributes are merged by figuring out the parent-child relationship using <order> value (single scan) • Output: A set of (e, a) pairs where e is the parent of a
EE-Join Algorithm • To join intermediate results each of which is a list of elements from a sub-expression • E.g. “chapter/-*/figure” • Input: List of “chapter” elements and List of “figure” elements • Steps (2 stages) are similar to EA-Algorithm • Both element sets are merged by doc. Id (single scan) • Chapter element and Figure element are merged by finding the ancestor-descendant relationship using <order, size> values • Output: A set of (e, f) pairs where e is the ancestor of f
EE-Algorithm • The second stage cannot be done in a single scan • In this E.g. , a “figure” element can be descendant of more than one “chapter” element (see book1.xml) • order(figure) will lie in more than one chapter interval ([order(chapter), order(chapter) + size(chapter)]) • This multiple-times scan is still highly effective in searching long or unknown length paths when compared to the conventional tree traversals.
KC-Algorithm • Processes a regular path expression with zero, one or more occurrences of a subexpression • E.g. “chapter*”, “chapter+” • Input: Set of elements from an XML document • Steps: • In each stage applies EE-Algorithm to previous stage’s result • Repeat until no change in result • Output: Kleene Closure of all elements in the given input set
Experiments.. • Prototype of XISS was implemented • Query Interface – C++; Parse XML – Gnome XML Parser; B+-Tree - GiST C++ Library • Workstation: • Sun Ultrasparc-II running on Solaris 2.7 • RAM: 256 MB; Hard-disk: 20GB • Data Sets • Shakespeare’s Plays • SIGMOD Record • NITF100 and NITF1
Performance Comparison • EE-Join Query: • Outperformed bottom-up method by a wide margin • Real-World data set: an order of magnitude faster • Synthetic data set: 6 to 10 times faster • Disk IO was a dominant Cost factor – 60% to 90% of total elapsed time • EA-Join Query: • It was comparatively better than top-down and bottom-up approaches • KC-Join Query: • Performance was not measured; dependent on EE’s performance
THE END! • Hope this presentation was useful • THANKS!