ViST: a dynamic index method for querying XML data by tree structures

ViST: a dynamic index method for querying XML data by tree structures Authors: Haixun Wang, Sanghyun Park, Wei Fan, Philip Yu Presenter: Elena Zheleva, November 2003

Overview • Modeling XML Queries • Structure-encoded sequences • Indexing • ViST • Experimental Results

Modeling XML Queries

DTD of purchase records: (!ELEMENT purchases (purchase*)) (!ELEMENT purchase (seller, buyer)) (!ATTRIST seller ID ID location CDATA name CDATA) (!ELEMENT seller (item*)) (!ATTRIST buyer ID ID location CDATA name CDATA) (!ELEMENT item (item*)) (!ATTRIST item name CDATA manufacturer CDATA)

Modeling XML Queries • Focus in XML query language design: ability to express complex structural or graphical queries

Modeling XML Queries • Querying XML data = finding sub structures of the data graph that match the sequence • Structure-encoded sequences: a sequential representation of both XML data and XML queries

Structure-Encoded Sequences

Structure-Encoded Sequences • Maps the data and the queries • Matches the subsequence • Purpose: to avoid as many join operations as possible • Def. Sequence of (symbol, prefix) pairs

Mapping Data • Represent XML document/tree in preorder • Represent in structure-encoded seq

Mapping Queries • Benefit of sequence matching: query gets processed as whole • Path Expression

Structure-Encoded Sequences • Query • Data

Querying XML • through Structure-Encoded Sequence Matching

Indexing

Role of Indexing • To provide an algorithm to perform this sequence matching • Desired features for algorithm: • Efficient support for subsequence matching • Use well-supported DB indexing techniques such as B+ trees • Allow dynamic index insertion

What is indexing useful for • Auxiliary access structures • Used to speed up the retrieval of records • In response to certain search conditions • Provide efficient support for arbitrary structured queries • Using wild-cards // and *

Indexing • State-of the-art approaches • Indexes on paths • Indexes on nodes • Indexes on both (structures) – ViST

ViST

Algorithms • Naïve Algorithm based on Suffix Trees • RIST: Relationships Indexed Suffix Tree • ViST: Virtual Suffix Tree

Algorithm Using Suffix Trees • Suffix Tree: a compact index to all distinct, contiguous substrings of a string • D-Ancestorship – in XML doc tree • Through structure-encoded sequence • S-Ancestorship – in suffix tree

Example Using Suffix Trees

Algorithm Using Suffix Trees • Searches • first by S-Ancestorship: searching under suffix tree • then by D-Ancestorship: matching nodes and prefixes • Disadvantages: • Costly – traverse large portion of subtree • Most commercial DBMSs do not support

RIST: Indexing by Ancestor-Descendant Relationships • Jumps directly to the nodes Y to which X is both a D-Ancestor and S-Ancestor • Index Construction: uses B+ trees

RIST: Indexing by Ancestor-Descendant Relationships • Subsequence Matching • Determine D-Ancestorship by prefixes • Determine S-Ancestorship by label <nx,sizex> • x – suffix tree node (root of S-tree) • nx – prefix traversal order • sizex – number of descendants

ViST: the Virtual Suffix Tree • Same sequence algorithm as RIST • BUT supports dynamic insertions • Uses dynamic method to assign labels • Once assigned, the labels are fixed and are not affected by subsequent data insertion or deletion • Labeling the suffix tree w/o building it • Relies on statistical information about the XML data

ViST: the Virtual Suffix Tree Index structure contains the sequence: Sequence to be inserted: Dynamic scope of x = <nx, sizex,kx>

ViST: the Virtual Suffix Tree

Experimental Results • Datasets used • DBLP: CS bibliography DB • 289,627 records/publications • Each publication – tree of max depth 6 • Avg length of structure-encoded seq = 31 • XMARK • 1 record • Complicated tree structure • Synthetic

Experimental Results • Comparison Methods • Index Fabric Algorithm – XML paths • XISS – uses nodes as basic query unit • ViST – appx. 1/10 of time to perform queries due to (multiple) join operations

Experimental Results - remove • Index Structure and Size (1/3 less from suffix tree) • DocId B+ Tree – N elements • Combined D-ancestor and S-ancestor B+ tree - N x L elements • Index Construction

Conclusion • XML Queries = Subsequence Matching • Advantages of ViST – algorithm for subsequence matching • Avoids expensive join operations • Index on both content and structure of XML documents • B+ trees – supported by disk-based data • Dynamic data insertion and deletion

ViST: a dynamic index method for querying XML data by tree structures