300 likes | 312 Views
This paper presents ViST, a dynamic index method for querying XML data using tree structures. It introduces structure-encoded sequences, indexing techniques, and the ViST algorithm for efficient subsequence matching. Experimental results demonstrate the advantages of ViST over other methods.
E N D
ViST: a dynamic index method for querying XML data by tree structures Authors: Haixun Wang, Sanghyun Park, Wei Fan, Philip Yu Presenter: Elena Zheleva, November 2003
Overview • Modeling XML Queries • Structure-encoded sequences • Indexing • ViST • Experimental Results
DTD of purchase records: (!ELEMENT purchases (purchase*)) (!ELEMENT purchase (seller, buyer)) (!ATTRIST seller ID ID location CDATA name CDATA) (!ELEMENT seller (item*)) (!ATTRIST buyer ID ID location CDATA name CDATA) (!ELEMENT item (item*)) (!ATTRIST item name CDATA manufacturer CDATA)
Modeling XML Queries • Focus in XML query language design: ability to express complex structural or graphical queries
Modeling XML Queries • Querying XML data = finding sub structures of the data graph that match the sequence • Structure-encoded sequences: a sequential representation of both XML data and XML queries
Structure-Encoded Sequences • Maps the data and the queries • Matches the subsequence • Purpose: to avoid as many join operations as possible • Def. Sequence of (symbol, prefix) pairs
Mapping Data • Represent XML document/tree in preorder • Represent in structure-encoded seq
Mapping Queries • Benefit of sequence matching: query gets processed as whole • Path Expression
Structure-Encoded Sequences • Query • Data
Querying XML • through Structure-Encoded Sequence Matching
Role of Indexing • To provide an algorithm to perform this sequence matching • Desired features for algorithm: • Efficient support for subsequence matching • Use well-supported DB indexing techniques such as B+ trees • Allow dynamic index insertion
What is indexing useful for • Auxiliary access structures • Used to speed up the retrieval of records • In response to certain search conditions • Provide efficient support for arbitrary structured queries • Using wild-cards // and *
Indexing • State-of the-art approaches • Indexes on paths • Indexes on nodes • Indexes on both (structures) – ViST
Algorithms • Naïve Algorithm based on Suffix Trees • RIST: Relationships Indexed Suffix Tree • ViST: Virtual Suffix Tree
Algorithm Using Suffix Trees • Suffix Tree: a compact index to all distinct, contiguous substrings of a string • D-Ancestorship – in XML doc tree • Through structure-encoded sequence • S-Ancestorship – in suffix tree
Algorithm Using Suffix Trees • Searches • first by S-Ancestorship: searching under suffix tree • then by D-Ancestorship: matching nodes and prefixes • Disadvantages: • Costly – traverse large portion of subtree • Most commercial DBMSs do not support
RIST: Indexing by Ancestor-Descendant Relationships • Jumps directly to the nodes Y to which X is both a D-Ancestor and S-Ancestor • Index Construction: uses B+ trees
RIST: Indexing by Ancestor-Descendant Relationships • Subsequence Matching • Determine D-Ancestorship by prefixes • Determine S-Ancestorship by label <nx,sizex> • x – suffix tree node (root of S-tree) • nx – prefix traversal order • sizex – number of descendants
ViST: the Virtual Suffix Tree • Same sequence algorithm as RIST • BUT supports dynamic insertions • Uses dynamic method to assign labels • Once assigned, the labels are fixed and are not affected by subsequent data insertion or deletion • Labeling the suffix tree w/o building it • Relies on statistical information about the XML data
ViST: the Virtual Suffix Tree Index structure contains the sequence: Sequence to be inserted: Dynamic scope of x = <nx, sizex,kx>
Experimental Results • Datasets used • DBLP: CS bibliography DB • 289,627 records/publications • Each publication – tree of max depth 6 • Avg length of structure-encoded seq = 31 • XMARK • 1 record • Complicated tree structure • Synthetic
Experimental Results • Comparison Methods • Index Fabric Algorithm – XML paths • XISS – uses nodes as basic query unit • ViST – appx. 1/10 of time to perform queries due to (multiple) join operations
Experimental Results - remove • Index Structure and Size (1/3 less from suffix tree) • DocId B+ Tree – N elements • Combined D-ancestor and S-ancestor B+ tree - N x L elements • Index Construction
Conclusion • XML Queries = Subsequence Matching • Advantages of ViST – algorithm for subsequence matching • Avoids expensive join operations • Index on both content and structure of XML documents • B+ trees – supported by disk-based data • Dynamic data insertion and deletion