A Document-based Approach to Indexing XML Data

A Document-based Approach to Indexing XML Data Ya-Hui Chang and Tsan-Lung Hsieh Department of Computer Science National Taiwan Ocean University yahui@cs.ntou.edu.tw Sept. 10th, 2002 National Taiwan Ocean University

Overview • XML introduction • Element block • Element tree • Two types of index structures • Document index • Element index • Experiment results • Conclusion National Taiwan Ocean University

Element Block <Books> <Book> <Title>Principles of database systems</Title> <Author> <Lastname>Ullman</Lastname> <Firstname>Jeffrey</Firstname> </Author> <Publisher>Computer Science Press</Publisher> <Date>1999</Date> <Keyword>database</Keyword> </Book> </Books> National Taiwan Ocean University

Element Tree Example of Offset Blocks National Taiwan Ocean University

the Query Processor DocumentIndex ElementIndex XMLDocument IdentifyingDocument DeterminingPosition RetrievingData Query Result National Taiwan Ocean University

the Index Structures • Purpose: • Providing efficient query processing over multiple XML documents • Two types: • Document index • Representing the correspondence of document identifiers and element values • Element index • Representing the positions of elements National Taiwan Ocean University

Document Index • Based on B+-Tree: • the size of each node is restricted by order; • the tree is balanced. Order=5 National Taiwan Ocean University

Document Index (cont) • Each node is represented as an XML document. • Search-key value is represented as the attribute key of the element Pointer, while the document identifier is represented as the content. <?xml version="1.0" encoding="Big5" ?> <!DOCTYPE BTree-Node SYSTEM "Btree.dtd"> <Node type="ex"> <Pointer key="Office">B0001</Pointer> <Pointer key="Win98">B0002</Pointer> <Pointer key="XML">B0001</Pointer> <Next>B3.bt</Next> </Node> XML <?xml version="1.0" encoding="Big5"?> <!ELEMENT Node (Pointer*,Next?)> <!ATTLIST Node type (ex|in) "ex"> <!ELEMENT Pointer (#PCDATA)> <!ATTLIST Pointer key CDATA #REQUIRED> <!ELEMENT Next (#PCDATA)> DTD National Taiwan Ocean University

Element Index • The position information of elements is represented based on the order specified in DTD, or the element tree. • The element indexes are partitioned into offset blocks corresponding to element blocks to capture the nesting structures of elements. • It is named “offset” since we keep the relative position of elements, to reduce the cost of maintenance. • Offset tuples constitute the offset block: • the first component records the offset to the parent element; • the last component records the pointer to the offset tuple for the next sibling element; • the other components record the relative positions of sub-elements. National Taiwan Ocean University

Example of Offset Blocks Books pointer null Child link Book1 Title1 pointer Publisher1 Date1 Keyword1 pointer Author1 Lastname1 Firstname1 point Author2 Lastname2 Firstname2 null Sibling link Book2 Title2 pointer Publisher2 Date2 Keyword2 null Author3 Lastname3 Firstname3 null Element tree National Taiwan Ocean University

Example of Retrieving Offsets • Suppose we plan to retrieve all the data corresponding to the path “/Books/Book/Title”. • Based on the element tree, Book is the first child of Books, and Title is the first child of Book. • This information tells us which components to retrieve in the offset tuples of Books and Book. • We also need to follow the sibling links. National Taiwan Ocean University

Example of Retrieving Offsets (cont) • Suppose the input path is “/Books/Book/Author/Lastname”, where Book is the first child, Author is the second child and Lastname is the first child. • We need to process the sibling elements for both Author and Book. National Taiwan Ocean University

Constructing Algorithm • Idea: performing a linear scan on the XML document; retrieving the absolute positions of all tags to calculate offsets. • data structures used: • StartTagList: the sequence of start-tags and their absolute positions • EndTagList: the sequence of end-tags and their absolute positions • Stack: all unfinished elements; on top is the most recent one, which is also the parent of the current element • Each internal node of the element tree will need to record how many child nodes it has. National Taiwan Ocean University

Initial Data EndTagList StartTagList Offset Tuples … ['Title', 18] ['Book', 9] ['Books', 0] … ['Firstname', 138] ['Lastname', 104] ['Title', 62] <Books> <Book> <Title>Principles of database systems</Title> <Author> <Lastname>Ullman</Lastname> <Firstname>Jeffrey</Firstname> </Author> <Publisher>Computer Science Press</Publisher> <Date>1999</Date> <Keyword>database</Keyword> </Book> </Books> [‘/’, 0, -1] Stack National Taiwan Ocean University

Round 1 EndTagList StartTagList Offset Tuples … ['Title', 18] ['Book', 9] ['Books', 0] 0 [0, _, _] … ['Firstname', 138] ['Lastname', 104] ['Title', 62] 4 2 1 3 ['Books', 0, 0] [‘/’, 0, -1] <Books> <Book> <Title>Principles of database systems</Title> <Author> <Lastname>Ullman</Lastname> … Stack National Taiwan Ocean University

Round 2 EndTagList StartTagList Offset Tuples … ['Author', 66] ['Title', 18] ['Book', 9] 0 [0, 1, _] 1 [9, _, _, _, _, _, _] … ['Firstname', 138] ['Lastname', 104] ['Title', 62] 4 2 1 3 ['Book', 9, 1] ['Books', 0, 0] [‘/’, 0, -1] <Books> <Book> <Title>Principles of database systems</Title> <Author> <Lastname>Ullman</Lastname> … Stack National Taiwan Ocean University

Round 3 EndTagList StartTagList Offset Tuples … ['Lastname', 78] ['Author', 66] ['Title', 18] 0 [0, 1, _] 1 [9, 9, _, _, _, _, _] … ['Firstname', 138] ['Lastname', 104] ['Title', 62] 3 2 1 ['Book', 9, 1] ['Books', 0, 0] [‘/’, 0, -1] <Books> <Book> <Title>Principles of database systems</Title> <Author> <Lastname>Ullman</Lastname> … Stack National Taiwan Ocean University

Round 4 EndTagList StartTagList Offset Tuples … ['Firstname', 109] ['Lastname', 78] ['Author', 66] 0 [0, 1, _] 1 [9, 9, 2, _, _, _, _] 2 [57, _, _, _] … ['Author', 150] ['Firstname', 138] ['Lastname', 104] 4 2 1 3 ['Author', 66, 2] ['Book', 9, 1] ['Books', 0, 0] [‘/’, 0, -1] <Books> <Book> <Title>Principles of daatabase systems</Title> <Author> <Lastname>Ullman</Lastname> … Stack National Taiwan Ocean University

Round 5 EndTagList StartTagList Offset Tuples … ['Publisher', 154] ['Firstname', 109] ['Lastname', 78] 0 [0, 1, _] 1 [9, 9, 2, _, _, _, _] 2 [57, 12, _, _] … ['Author', 150] ['Firstname', 138] ['Lastname', 104] 3 2 1 ['Author', 66, 2] ['Book', 9, 1] ['Books', 0, 0] [‘/’, 0, -1] <Books> <Book> <Title>Principles of daatabase systems</Title> <Author> <Lastname>Ullman</Lastname> … Stack National Taiwan Ocean University

Round 6 EndTagList StartTagList Offset Tuples … ['Date', 202] ['Publisher', 154] ['Firstname', 109] 0 [0, 1, _] 1 [9, 9, 2, _, _, _, _] 2 [57, 12, 43, _] … ['Publisher', 198] ['Author', 150] ['Firstname', 138] 3 2 1 ['Author', 66, 2] ['Book', 9, 1] ['Books', 0, 0] [‘/’, 0, -1] <Lastname>Ullman</Lastname> <Firstname>Jeffrey</Firstname> </Author> <Publisher>Computer Science Press</Publisher> <Date>1999</Date> … Stack National Taiwan Ocean University

Round 7 EndTagList StartTagList Offset Tuples ['Keyword', 222] ['Date', 202] ['Publisher', 154] 0 [0, 1, _] 1 [9, 9, 2, _, _, _, _] 2 [57, 12, 43, 0] … ['Date', 218] ['Publisher', 198] ['Author', 150] 1 ['Author', 66, 2] ['Book', 9, 1] ['Books', 0, 0] [‘/’, 0, -1] <Lastname>Ullman</Lastname> <Firstname>Jeffrey</Firstname> </Author> <Publisher>Computer Science Press</Publisher> <Date>1999</Date> … Stack National Taiwan Ocean University

Round 8 EndTagList StartTagList Offset Tuples ['Keyword', 222] ['Date', 202] ['Publisher', 154] 0 [0, 1, _] 1 [9, 9, 2, 145, _, _, _] 2 [57, 12, 43, 0] … ['Keyword', 248] ['Date', 218] ['Publisher', 198] 3 2 1 ['Book', 9, 1] ['Books', 0, 0] [‘/’, 0, -1] <Lastname>Ullman</Lastname> <Firstname>Jeffrey</Firstname> </Author> <Publisher>Computer Science Press</Publisher> <Date>1999</Date> … Stack National Taiwan Ocean University

Round 9 EndTagList StartTagList Offset Tuples ['Keyword', 222] ['Date', 202] 0 [0, 1, _] 1 [9, 9, 2, 145, 193, _, _] 2 [57, 12, 43, 0] ['Books', 266] ['Book', 257] ['Keyword', 248] ['Date', 218] 3 2 1 ['Book', 9, 1] ['Books', 0, 0] [‘/’, 0, -1] <Publisher>Computer Science Press</Publisher> <Date>1999</Date> <Keyword>database</Keyword> </Book> </Books> Stack National Taiwan Ocean University

Round 10 EndTagList StartTagList Offset Tuples ['Keyword', 222] 0 [0, 1, _] 1 [9, 9, 2, 145, 193, 213, _] 2 [57, 12, 43, 0] ['Books', 266] ['Book', 257] ['Keyword', 248] 3 2 1 ['Book', 9, 1] ['Books', 0, 0] [‘/’, 0, -1] <Publisher>Computer Science Press</Publisher> <Date>1999</Date> <Keyword>database</Keyword> </Book> </Books> Stack National Taiwan Ocean University

Round 11 EndTagList StartTagList Offset Tuples 0 [0, 1, _] 1 [9, 9, 2, 145, 193, 213, 0] 2 [57, 12, 43, 0] ['Books', 266] ['Book', 257] 1 ['Book', 9, 1] ['Books', 0, 0] [‘/’, 0, -1] <Publisher>Computer Science Press</Publisher> <Date>1999</Date> <Keyword>database</Keyword> </Book> </Books> Stack National Taiwan Ocean University

Round 12 EndTagList StartTagList Offset Tuples 0 [0, 1, 0] 1 [9, 9, 2, 145, 193, 213, 0] 2 [57, 12, 43, 0] ['Books', 266] 2 1 ['Books', 0, 0] [‘/’, 0, -1] <Publisher>Computer Science Press</Publisher> <Date>1999</Date> <Keyword>database</Keyword> </Book> </Books> Stack National Taiwan Ocean University

Final Data EndTagList StartTagList Offset Tuples 0 [0, 1, 0] 1 [9, 9, 2, 145, 193, 213, 0] 2 [57, 12, 43, 0] <Books> <Book> <Title>Principles of database systems</Title> <Author> <Lastname>Ullman</Lastname> <Firstname>Jeffrey</Firstname> </Author> <Publisher>Computer Science Press</Publisher> <Date>1999</Date> <Keyword>database</Keyword> </Book> </Books> [‘/’, 0, -1] Stack National Taiwan Ocean University

Performance Evaluation • Comparison with DOM: showing the efficiency of utilizing the pre-built element index • DOM (Document Object Model): a tree-based parsing mechanism where each element is a node • Using Microsoft MSXML 3.0 DOM API • Construction of the cost model: showing the scalability of our indexing scheme • Comparison with Lore: showing the performance of the whole query processor • Lore: a specialized database system for semi-structured/XML data National Taiwan Ocean University

Comparison with DOM National Taiwan Ocean University

Cost Model • The I/O cost consists of processing the following four portions of data: • The internal nodes of the document index • The leaf nodes of the document index • The offset blocks • The XML files • The cost model is as follows: National Taiwan Ocean University

Experiment Setups National Taiwan Ocean University

Experiment Data National Taiwan Ocean University

Queries to Compare with Lore National Taiwan Ocean University

Experiment Results National Taiwan Ocean University

Conclusions • Summary • We construct a query processor to retrieve data from multiple XML documents, which utilizes two index structures: • the document index could quickly identify the required document • the maintainable element index could quickly determine the precise location of desired data • Experiment results show the efficiency of our approach. • Future work • Supporting more complicated queries • Improving space utilization National Taiwan Ocean University

A Document-based Approach to Indexing XML Data