1 / 35

A Document-based Approach to Indexing XML Data

A Document-based Approach to Indexing XML Data. Ya-Hui Chang and Tsan-Lung Hsieh Department of Computer Science National Taiwan Ocean University yahui@cs.ntou.edu.tw Sept. 10 th , 2002. Overview. XML introduction Element block Element tree Two types of index structures Document index

grady-lopez
Download Presentation

A Document-based Approach to Indexing XML Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Document-based Approach to Indexing XML Data Ya-Hui Chang and Tsan-Lung Hsieh Department of Computer Science National Taiwan Ocean University yahui@cs.ntou.edu.tw Sept. 10th, 2002 National Taiwan Ocean University

  2. Overview • XML introduction • Element block • Element tree • Two types of index structures • Document index • Element index • Experiment results • Conclusion National Taiwan Ocean University

  3. Element Block <Books> <Book> <Title>Principles of database systems</Title> <Author> <Lastname>Ullman</Lastname> <Firstname>Jeffrey</Firstname> </Author> <Publisher>Computer Science Press</Publisher> <Date>1999</Date> <Keyword>database</Keyword> </Book> </Books> National Taiwan Ocean University

  4. Element Tree Example of Offset Blocks National Taiwan Ocean University

  5. the Query Processor DocumentIndex ElementIndex XMLDocument IdentifyingDocument DeterminingPosition RetrievingData Query Result National Taiwan Ocean University

  6. the Index Structures • Purpose: • Providing efficient query processing over multiple XML documents • Two types: • Document index • Representing the correspondence of document identifiers and element values • Element index • Representing the positions of elements National Taiwan Ocean University

  7. Document Index • Based on B+-Tree: • the size of each node is restricted by order; • the tree is balanced. Order=5 National Taiwan Ocean University

  8. Document Index (cont) • Each node is represented as an XML document. • Search-key value is represented as the attribute key of the element Pointer, while the document identifier is represented as the content. <?xml version="1.0" encoding="Big5" ?> <!DOCTYPE BTree-Node SYSTEM "Btree.dtd"> <Node type="ex"> <Pointer key="Office">B0001</Pointer> <Pointer key="Win98">B0002</Pointer> <Pointer key="XML">B0001</Pointer> <Next>B3.bt</Next> </Node> XML <?xml version="1.0" encoding="Big5"?> <!ELEMENT Node (Pointer*,Next?)> <!ATTLIST Node type (ex|in) "ex"> <!ELEMENT Pointer (#PCDATA)> <!ATTLIST Pointer key CDATA #REQUIRED> <!ELEMENT Next (#PCDATA)> DTD National Taiwan Ocean University

  9. Element Index • The position information of elements is represented based on the order specified in DTD, or the element tree. • The element indexes are partitioned into offset blocks corresponding to element blocks to capture the nesting structures of elements. • It is named “offset” since we keep the relative position of elements, to reduce the cost of maintenance. • Offset tuples constitute the offset block: • the first component records the offset to the parent element; • the last component records the pointer to the offset tuple for the next sibling element; • the other components record the relative positions of sub-elements. National Taiwan Ocean University

  10. Example of Offset Blocks Books pointer null Child link Book1 Title1 pointer Publisher1 Date1 Keyword1 pointer Author1 Lastname1 Firstname1 point Author2 Lastname2 Firstname2 null Sibling link Book2 Title2 pointer Publisher2 Date2 Keyword2 null Author3 Lastname3 Firstname3 null Element tree National Taiwan Ocean University

  11. Example of Retrieving Offsets • Suppose we plan to retrieve all the data corresponding to the path “/Books/Book/Title”. • Based on the element tree, Book is the first child of Books, and Title is the first child of Book. • This information tells us which components to retrieve in the offset tuples of Books and Book. • We also need to follow the sibling links. National Taiwan Ocean University

  12. Example of Retrieving Offsets (cont) • Suppose the input path is “/Books/Book/Author/Lastname”, where Book is the first child, Author is the second child and Lastname is the first child. • We need to process the sibling elements for both Author and Book. National Taiwan Ocean University

  13. Constructing Algorithm • Idea: performing a linear scan on the XML document; retrieving the absolute positions of all tags to calculate offsets. • data structures used: • StartTagList: the sequence of start-tags and their absolute positions • EndTagList: the sequence of end-tags and their absolute positions • Stack: all unfinished elements; on top is the most recent one, which is also the parent of the current element • Each internal node of the element tree will need to record how many child nodes it has. National Taiwan Ocean University

  14. Initial Data EndTagList StartTagList Offset Tuples … ['Title', 18] ['Book', 9] ['Books', 0] … ['Firstname', 138] ['Lastname', 104] ['Title', 62] <Books> <Book> <Title>Principles of database systems</Title> <Author> <Lastname>Ullman</Lastname> <Firstname>Jeffrey</Firstname> </Author> <Publisher>Computer Science Press</Publisher> <Date>1999</Date> <Keyword>database</Keyword> </Book> </Books> [‘/’, 0, -1] Stack National Taiwan Ocean University

  15. Round 1 EndTagList StartTagList Offset Tuples … ['Title', 18] ['Book', 9] ['Books', 0] 0 [0, _, _] … ['Firstname', 138] ['Lastname', 104] ['Title', 62] 4 2 1 3 ['Books', 0, 0] [‘/’, 0, -1] <Books> <Book> <Title>Principles of database systems</Title> <Author> <Lastname>Ullman</Lastname> … Stack National Taiwan Ocean University

  16. Round 2 EndTagList StartTagList Offset Tuples … ['Author', 66] ['Title', 18] ['Book', 9] 0 [0, 1, _] 1 [9, _, _, _, _, _, _] … ['Firstname', 138] ['Lastname', 104] ['Title', 62] 4 2 1 3 ['Book', 9, 1] ['Books', 0, 0] [‘/’, 0, -1] <Books> <Book> <Title>Principles of database systems</Title> <Author> <Lastname>Ullman</Lastname> … Stack National Taiwan Ocean University

  17. Round 3 EndTagList StartTagList Offset Tuples … ['Lastname', 78] ['Author', 66] ['Title', 18] 0 [0, 1, _] 1 [9, 9, _, _, _, _, _] … ['Firstname', 138] ['Lastname', 104] ['Title', 62] 3 2 1 ['Book', 9, 1] ['Books', 0, 0] [‘/’, 0, -1] <Books> <Book> <Title>Principles of database systems</Title> <Author> <Lastname>Ullman</Lastname> … Stack National Taiwan Ocean University

  18. Round 4 EndTagList StartTagList Offset Tuples … ['Firstname', 109] ['Lastname', 78] ['Author', 66] 0 [0, 1, _] 1 [9, 9, 2, _, _, _, _] 2 [57, _, _, _] … ['Author', 150] ['Firstname', 138] ['Lastname', 104] 4 2 1 3 ['Author', 66, 2] ['Book', 9, 1] ['Books', 0, 0] [‘/’, 0, -1] <Books> <Book> <Title>Principles of daatabase systems</Title> <Author> <Lastname>Ullman</Lastname> … Stack National Taiwan Ocean University

  19. Round 5 EndTagList StartTagList Offset Tuples … ['Publisher', 154] ['Firstname', 109] ['Lastname', 78] 0 [0, 1, _] 1 [9, 9, 2, _, _, _, _] 2 [57, 12, _, _] … ['Author', 150] ['Firstname', 138] ['Lastname', 104] 3 2 1 ['Author', 66, 2] ['Book', 9, 1] ['Books', 0, 0] [‘/’, 0, -1] <Books> <Book> <Title>Principles of daatabase systems</Title> <Author> <Lastname>Ullman</Lastname> … Stack National Taiwan Ocean University

  20. Round 6 EndTagList StartTagList Offset Tuples … ['Date', 202] ['Publisher', 154] ['Firstname', 109] 0 [0, 1, _] 1 [9, 9, 2, _, _, _, _] 2 [57, 12, 43, _] … ['Publisher', 198] ['Author', 150] ['Firstname', 138] 3 2 1 ['Author', 66, 2] ['Book', 9, 1] ['Books', 0, 0] [‘/’, 0, -1] <Lastname>Ullman</Lastname> <Firstname>Jeffrey</Firstname> </Author> <Publisher>Computer Science Press</Publisher> <Date>1999</Date> … Stack National Taiwan Ocean University

  21. Round 7 EndTagList StartTagList Offset Tuples ['Keyword', 222] ['Date', 202] ['Publisher', 154] 0 [0, 1, _] 1 [9, 9, 2, _, _, _, _] 2 [57, 12, 43, 0] … ['Date', 218] ['Publisher', 198] ['Author', 150] 1 ['Author', 66, 2] ['Book', 9, 1] ['Books', 0, 0] [‘/’, 0, -1] <Lastname>Ullman</Lastname> <Firstname>Jeffrey</Firstname> </Author> <Publisher>Computer Science Press</Publisher> <Date>1999</Date> … Stack National Taiwan Ocean University

  22. Round 8 EndTagList StartTagList Offset Tuples ['Keyword', 222] ['Date', 202] ['Publisher', 154] 0 [0, 1, _] 1 [9, 9, 2, 145, _, _, _] 2 [57, 12, 43, 0] … ['Keyword', 248] ['Date', 218] ['Publisher', 198] 3 2 1 ['Book', 9, 1] ['Books', 0, 0] [‘/’, 0, -1] <Lastname>Ullman</Lastname> <Firstname>Jeffrey</Firstname> </Author> <Publisher>Computer Science Press</Publisher> <Date>1999</Date> … Stack National Taiwan Ocean University

  23. Round 9 EndTagList StartTagList Offset Tuples ['Keyword', 222] ['Date', 202] 0 [0, 1, _] 1 [9, 9, 2, 145, 193, _, _] 2 [57, 12, 43, 0] ['Books', 266] ['Book', 257] ['Keyword', 248] ['Date', 218] 3 2 1 ['Book', 9, 1] ['Books', 0, 0] [‘/’, 0, -1] <Publisher>Computer Science Press</Publisher> <Date>1999</Date> <Keyword>database</Keyword> </Book> </Books> Stack National Taiwan Ocean University

  24. Round 10 EndTagList StartTagList Offset Tuples ['Keyword', 222] 0 [0, 1, _] 1 [9, 9, 2, 145, 193, 213, _] 2 [57, 12, 43, 0] ['Books', 266] ['Book', 257] ['Keyword', 248] 3 2 1 ['Book', 9, 1] ['Books', 0, 0] [‘/’, 0, -1] <Publisher>Computer Science Press</Publisher> <Date>1999</Date> <Keyword>database</Keyword> </Book> </Books> Stack National Taiwan Ocean University

  25. Round 11 EndTagList StartTagList Offset Tuples 0 [0, 1, _] 1 [9, 9, 2, 145, 193, 213, 0] 2 [57, 12, 43, 0] ['Books', 266] ['Book', 257] 1 ['Book', 9, 1] ['Books', 0, 0] [‘/’, 0, -1] <Publisher>Computer Science Press</Publisher> <Date>1999</Date> <Keyword>database</Keyword> </Book> </Books> Stack National Taiwan Ocean University

  26. Round 12 EndTagList StartTagList Offset Tuples 0 [0, 1, 0] 1 [9, 9, 2, 145, 193, 213, 0] 2 [57, 12, 43, 0] ['Books', 266] 2 1 ['Books', 0, 0] [‘/’, 0, -1] <Publisher>Computer Science Press</Publisher> <Date>1999</Date> <Keyword>database</Keyword> </Book> </Books> Stack National Taiwan Ocean University

  27. Final Data EndTagList StartTagList Offset Tuples 0 [0, 1, 0] 1 [9, 9, 2, 145, 193, 213, 0] 2 [57, 12, 43, 0] <Books> <Book> <Title>Principles of database systems</Title> <Author> <Lastname>Ullman</Lastname> <Firstname>Jeffrey</Firstname> </Author> <Publisher>Computer Science Press</Publisher> <Date>1999</Date> <Keyword>database</Keyword> </Book> </Books> [‘/’, 0, -1] Stack National Taiwan Ocean University

  28. Performance Evaluation • Comparison with DOM: showing the efficiency of utilizing the pre-built element index • DOM (Document Object Model): a tree-based parsing mechanism where each element is a node • Using Microsoft MSXML 3.0 DOM API • Construction of the cost model: showing the scalability of our indexing scheme • Comparison with Lore: showing the performance of the whole query processor • Lore: a specialized database system for semi-structured/XML data National Taiwan Ocean University

  29. Comparison with DOM National Taiwan Ocean University

  30. Cost Model • The I/O cost consists of processing the following four portions of data: • The internal nodes of the document index • The leaf nodes of the document index • The offset blocks • The XML files • The cost model is as follows: National Taiwan Ocean University

  31. Experiment Setups National Taiwan Ocean University

  32. Experiment Data National Taiwan Ocean University

  33. Queries to Compare with Lore National Taiwan Ocean University

  34. Experiment Results National Taiwan Ocean University

  35. Conclusions • Summary • We construct a query processor to retrieve data from multiple XML documents, which utilizes two index structures: • the document index could quickly identify the required document • the maintainable element index could quickly determine the precise location of desired data • Experiment results show the efficiency of our approach. • Future work • Supporting more complicated queries • Improving space utilization National Taiwan Ocean University

More Related