540 likes | 676 Views
Presentation for Cmpe-521. VIST – Virtual Suffix Tree Prepared by : Evren CEYLAN – 2003700163 Aslı UYAR - 2003701321. VIST : A Dynamic Index Method for Querying XML Data by Tree Structures Written by: Haixun Wang, Sanghyun Park, Wei Fan, Philip S. Yu – SIGMOD 2003. What is XML? .
E N D
Presentation for Cmpe-521 VIST – Virtual Suffix Tree Prepared by: Evren CEYLAN – 2003700163 Aslı UYAR - 2003701321
VIST: A Dynamic IndexMethod for Querying XML Data by Tree Structures Written by:Haixun Wang, Sanghyun Park, Wei Fan, Philip S. Yu – SIGMOD 2003
What is XML? • XML : Extentional Markup Language • Has a great importance in Data Exchange. • So, lots of research has been done in providing flexible query mechanisms in order to extract data from XML Documents.
VIST : Virtual Suffix Tree • In this paper, VIST is proposed to search XML Documents. • XML Documents and XML Queries will be represented in structured-encoded sequences (that will be explained in on-going pages). • By using this type of sequences it is shown that, querying XML data is equal to finding subsequence matches.
Index Methods in XML • Previous index methods: Disassemble a query into multiple sub-queries, and then join the results of these sub-queries to provide final answers.
What does VIST do? • Converts both XML Data and XML Queries to structure-encoded sequences • Uses tree structures as the basic unit of query in order to avoidhighly expensive join operations • In other words, uses structured-encoded sequences instead of nodes or paths
What does VIST do? • Matches structured queries against structured data as a whole, without breaking down the queries into sub-queries of paths or nodes and relying on join operations. • Supports dynamic index update.
What does VIST do? ðIn this paper, it is shown that VIST is effective and efficient in supporting structural queries.
Introduction • XML has a growing importance in data exchange (extracting data from XML documents) • XML provides a flexible way to define semi-structured data • In this paper a ‘novel index structure’ is introduced called “VIST”(Virtual Suffix Tree) • VIST provides solutions, offers better performance and usabilitythan previous approaches in XML indexing.
In XML query language design, expressing complex structural or graphical queries is one of the major concept. • (In figure 2, four sample queries is displayed in graph form)
In previous approaches; • i. Indexes are created on path (e.g. “/P/S/I/M” in Q1)Path indexes can answer simple queries efficiently (no branches in Q1). • ii. However, queries that involves branching structures (such as Q2), have to be disassembled into sub-queries, then combined by expensive join operations to produce final results. • iii. So, these methods are inefficient in handling.
In VIST approach; Objective: to provide a general method so that structural XML queries need not to be decomposed into sub-queries. Result: no need to perform expensive join operations.
Method: • XML Data and XML Queries is transformed into to “structure-encoded sequences”. • In order to organize structure-encoded sequences Virtual Suffix Tree is used. • VIST also speeds up the matching process.
Structure: • VIST’s index structure includes two parts: D-Ancestor index, S-Ancestor index (that will be explained in on-going pages). • VIST unifies structural indexes and value indexes into a single index. • Toachieve this, a method is proposed called “dynamic virtual suffix tree labeling” (index update can be performed directly on B+Trees.
Structure-Encoded Sequences • Sequential representation of both XML Data and XML Queries.
Objective: Modeling of XML queries through sequence matching makes us to avoid unnecessary join operations in query processing. • Result: Structure-Encoded Sequences are used instead of paths or nodes.
Mapping Data and Queries to Structure-Encoded Sequences: Stage 1: • Lets consider the purchase record example in figure 3. • Notation: Capital letters represent names of Attributes. • Lowercase letter represent names of attribute values. • To encode attribute values into integers we use hash( ) function. • e.g. v1 = h(“dell”) and v2 = h(“ibm”) • V1 and v2 is used to represent delle and ibm respectively.
Stage 2: • Representing an XML document by the preorder sequence of its tree structure. • e.g. preorder sequence of the tree in Figure 3 is: PSNv1IMv2Nv3IMv4Inv5Lv6BLv7Nv8
Stage 3: • Definition: A structure-encoded sequence is a sequence of (symbol,prefix) pairs: D = (a1,p1), (a2,p2), . . . , (an,pn) ai: node in the XML doc tree. pi: path from the root node to node ai.
Figure 3 can be converted into the structure-encoded sequence. • D = ... ... (Figure 4)
Benefits: • Modeling XML queries through sequence matching is that structural queries can be processed as a whole instead of being broken into smaller query units(paths or nodes of XML doc tree) • Combining the results of the sub queries by join operations is expensive.
The VIST Approach: Presented in 3 stages: • Naïve algorithm based on the suffix trees • RIST : improves the naïve algorithm by using B+Trees to index suffix tree nodes • VIST : an index structure but relying only on the B+Trees
Requirements • XML indexing method needs to include: • Should support structural queries directly. This is done by “structure-encoded sequences”. • Instead of relying on “suffix trees”, the index method uses better indexing techniques such as B+Trees. • The index structure should allow dynamic data insertion and deletion, etc.
A Naïve Algorithm Based on Suffix Trees • Most widely used index structure forsubsequence matching is the suffix tree.
Example: • 2 XML Documents called Doc1 and Doc2, • 2 XML Queries called Q1 and Q2 in structure-encoded sequences. Doc1 : (P,e)(S,P)(N,PS)(V1,PSN)(L,PS) (V2,PSL) Doc2 : (P,e) (B,P) (L,PB) (V2,PBL) Q1 : (P,e) (B,P) (L,PB) (V2,PBL) Q2 : (P,e) (L,P*) (V2,P*L)
Example: (Cont’d) • A tree structure for Doc1 and Doc2 is shown in Figure 5
Example: (Cont’d) • As it is shown above elements in the sequences represent nodes in the suffix tree. • Since the nodes are involed in 2 different trees, there is 2 kinds of ancestor-descendent relationships among the nodes. i ) D-Ancestorship e.g. (S,P) is a D-ancestor of (L,PS) ii ) S-Ancestorship e.g. (v1,PSN) is a S-ancestor of (L,PS)
Naïve Algorithm based on the suffix trees: • NaiveSearch algorithm based on suffix trees. • Represents a naïve method for non-contigious subsequence matching.
For example to match Q2; • Start with the root node, which matches the 1st element of Q2 that is (P,e). • Then search under the root for ll nodes that match (L,P*) which yields to (L,PS) and (L,PB) • Finally, search for - (v2,PSL) under the node labeled (L,PS) - (v2,PBL) under the node labeled (L,PB) • Algorithm 1, searches nodes first by S-Ancestorship, and then D-Ancestorship.
Difficulties ofNaive Algorithm: • There are difficulties in using suffix tree to index structure-encoded sequences. • Major difficulty is explained below: Searching for nodes satisfying both S-Ancestorship, and D-Ancestorship is extremely costly. (because we need to go over a large portion of the subtree for each match)
RIST: Indexing by Ancestor-Descendent Relationships • Improves Naïve Algorithm by eliminating the expensive go-over operations in suffix tree. • When we reach node X after matching, we can jump directly to those nodes Y to which X is both D-Ancestor and S-Ancestor. • So, no longer need to search among the descendents of X to find Ys one by one.
RIST Algorithm: • 1.index nodes in suffix tree by their (Symbol,Prefix) pairs. This is represented by a B+Tree. • i.This enables us to search nodes by these (Symbol,Prefix) pairs that is D-Ancestorship. • ii.This B+Tree is called D-Ancestorship B+Tree.
RIST Algorithm: • 2.among all the nodes satisfying D-Ancestorship, we are interested in the ones satisfying S-Ancestorship as well. • i.Labels are created for suffix tree nodes in order to tell the relationship btw 2 nodes. • ii.We use B+Trees to index nodes by labels. • iii.This B+Tree is called S-Ancestorship B+Tree.
Labeling Notation • <nx, sizex> • nx: prefix traversal order of x in the suffix tree. • Sizex:total number of descendants of x in the suffix tree. • That kind of labeling is shown in figure 5.
Labeling Notation • Note: with that labeling, the S-Ancestorship between any two nodes can be decide easily: • If x and y are labeled <nx, sizex> and <ny, sizey>, node x is an S- Ancestor of y if ny Є ( nx , <nx + sizex> )
Constructing the B+Trees: • Insert all suffix tree nodes into the D-Ancestorship B+Tree using their symbols as their keys. • For all nodes that x inserted with the same (Symbol,Prefix), we index them by an S-Ancestorship B+Tree, using the nx values of their labels as keys. • Shown in FIGURE 6
Building the DocID B+Tree: • DocID B+Tree stores for each node x ( using nx as key ), the document IDs of those XML sequences that end up at node x when they are inserted into the suffix tree. • Shown in DocID B+Tree
In summary; • Unlike the naïve algorithm, RIST does not use suffix trees for subsequence matching (it uses D-Ancestorship B+Tree and S-Ancestorship B+Tree ) • Form any node , instead of searching the entire subtree under the node, we can jump to the sub nodes that match the next element in the query. • So, RIST supports non-contigious subsequence matching efficiently.
VIST: The Virtual Suffix Tree • RIST uses a static scheme to label suffix tree nodes and that prevents it from supporting dynamic insertions. • Because any node x labeled <n,size> , late insertions can change the number of nodes that appear before x. (in the prefix order) • As well as the size of the subtree rooted at x, which means neither n nor size can be fixed.
VIST: The Virtual Suffix Tree • The purpose of the suffix tree is to provide a labeling mechanism to encode S-Ancestorship. • Suppose a node x is created for element di ,during the insertion of sequence d1, … , di,… ,dk.
VIST: The Virtual Suffix Tree • If it is estimated i.how many different elements will possibly follow di in future insertions. ii.The occurrence probability of each of these elements • Then we can label x’s child nodes instead of waiting until all sequences are inserted.
VIST: The Virtual Suffix Tree (Cont’d) • It also means ; • the suffix tree itself is no longer needed, because it’s labeling mechanism is inefficient. • It supports dynamic data insertion and deletion.
Top down scope allocation: • A tree structure defines nested scopes: the scope of a child node is a subscope of its parent node, and the root node has the max scope which covers the scope of each node.
Top down scope allocation: • In dynamic scope allocation there is a parameter called λ, which is the expected number of child nodes of any node, • λ is usually assumed as 2. • without the knowledge of the occurrence rate of the each child node, 1/λ of the remaining scope is allocated to x’s 1st inserted child. • Child1 : <n+1,size/2> • Child2 : <(n+1+size)/2, size/4>
Dynamic scope of a Suffix Tree Node: • The dynamic scope of a node is triple <n,size,k> , • where k is the number of subscopes allocated inside current scope.
Algorithm of VIST: • VIST uses the same sequence matching algorithm as RIST • Dynamic method for labeling suffix tree nodes is represented without building the suffix tree.
Algorithm of VIST: • The method relies on insensitive estimations of the number of attribute values. • Because of that the labeling mechanism is based on a virtual suffix tree .
Example: - lets look at the index structure before and after insertion
Algortihm of VIST: • Suppose, before the insertion the index structure already contains the following sequence: Doc1 = (P,e) (S,P) (N,PS) (V1,PSN) (L,PS) (V2,PSL) • The sequence to be inserted => Doc2 = (P,e) (S,P) (L,PS) (V2,PSL)
Assumptions of the Example: • There are 2 assumptions for the algorithm: • Max = 20480 • Dynamic scope allocation method uses the parameter λ =2