430 likes | 568 Views
On Boosting Holism in XML Twig Pattern Matching Using Structural Indexing Techniques. Ting Chen, Jiaheng Lu , Tok Wang Ling. Outline. Background XML Twig Pattern Query Previous Twig Join algorithms Limit of the original holistic method TwigStack
E N D
On Boosting Holism in XML Twig Pattern Matching Using Structural Indexing Techniques Ting Chen, Jiaheng Lu, Tok Wang Ling
Outline • Background • XML Twig Pattern Query • Previous Twig Join algorithms • Limit of the original holistic method TwigStack • Our holistic Twig Pattern Matching algorithms • Two Refined Indexing Schemes: Tag+Level and PPS • A generalized holistic matching theory • iTwigJoin: a generalized holistic matching algorithm • Experiments • Conclusion On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
Background: XML and Region coding • XML document is modeled as a tree in our work • Region Coding for XML document tree • <start, end, level> label for each element • Containment Property: a.start < b.start AND a.end > b.end if and only if a is an ancestor of b On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
Background: XML twig pattern queries • An XML twig query is a small tree, whose edges include parent-child or ancestor-descendant relationships. • Given an XML document D, and an XML twig query Q, our problem is to find all occurrences of Q on D. On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
Previous XML Twig Join algorithms Techniques Edge Based Binary Structural Join [Al-Khalifa et al ICDE02] Join Order Selection [Wu et al ICDE03] Path Based BLAS [Chen et al SIGMOD04] Tree (Holistic) Based TwigStack [Bruno et al SIGMOD02] TwigStackList [Lu et al CIKM04] Index Based B tree [[Chien et al VLDB02] XR tree[Jiang et al ICDE02] TSGeneric+[Jiang et al VLDB03] On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
Holistic Twig Matching • TwigStack [Bruno et al SIGMOD02]A holistic twig join algorithm • E.g: For query A[.//C]//B, there may be many matches only to A//B. But TwigStack only output results for A with descendants B and C. • No join order selection required • TwigStack is optimal for only ancestor-descendant twig patterns. • Reordering of elements in a stream does not help.[Choi et al DEXA03] On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
Sub-optimality of TwigStack • Not optimal for twigs with parent-child edge a1 a1 a2 … an A a2 an cn b1 B C b1 b2 … bn c1 c2 … cn … b2 c1 bn cn-1 Document Query On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
Two Refined Streaming Schemes(1) • To enlarge the optimality of TwigStack, in our paper we proposed two refined streaming schemes. • Tag + Level: elements with the same tag and level are grouped together a1 A a1 … a2 an cn b1 b1 a2 a3 … an cn B C … b2 b3 … bn c1 c2 … b2 c1 bn cn-1 Document Query On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
Two Refined Streaming Schemes(1) • For this query, tag+level streaming scheme can guarantee the optimality. a1 A a1 … a2 an cn b1 b1 a2 a3 … an cn B C … b2 b3 … bn c1 c2 … b2 c1 bn cn-1 Document Query On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
Two Refined Streaming Schemes(1) • But given a more complex query and document, tag+level cannot guarantee the optimality.For example: a1 A a1 a2 b2 e1 a2 b2 D B d3 d1 d2,d3 d1 d2 b1 b1 C c1 c2 Query c1 c2 Document On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
Two Refined Streaming Schemes(2) • Prefix Path Streaming (PPS): elements with the same root-to-node path are grouped together Every element in the document is stored as an individual stream in this example. D: a1 a1 a2 b2 e1 e1 a2 b2 d1 d2 b1 d3 d3 d1 d2 b1 c1 c2 Document c1 c2 On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
Two Refined Streaming Schemes(2) • PPS is optimal for the following example. d1,d2,c1,c2 are separated to different streams a1 A a1 a2 b2 e1 a2 b2 D B d3 d1 d2 d1 d2 b1 b1 C c1 c2 Query c1 c2 Document On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
Two Refined Streaming Schemes(2) • A natural question : Can PPS guarantee to be optimal for all queries and data? On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
Two Refined Streaming Schemes(2) • A natural question : Can PPS guarantee to be optimal for all queries and data? • The answer isNO. • For example: c1, c2 are in the same stream. Similarly, e1, e2 are also in the same stream. A a1 b1 b2 b3 C B a2 a3 a4 d2 E D c1 c2 b4 b5 e1 d1 e2 Query : head element Document On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
A general algorithm: iTwigJoin • We propose a general algorithm, called iTwigJoin , which can be used on various data streaming schemes. • Our key idea is to classify all current head elements to three classes: • Subtree-matching • Useless • Blocked On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
Classifying Head Elements • Subtree-Matching Element • Element e of tag E is called a subtree-matching element for queryQ • e is in a match to QE (QE is the sub-tree of Q rooted at E); and • NOT in any future match to QP where P is the parent of E in Q • Useless Element • Element e is called a useless element if e is not in any future match to QE. • Blocked Element • An element which is neither subtree-matching nor useless On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
Example: Classifying Head Elements (Tag+Level Streaming) a1 D: A Q1: a2 b2 e1 D B d1 d2 b1 d3 C c1 c2 : head element a1 a2 b2 d1 d2d3 … b1 c1 c2 On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
Example: Classifying Head Elements (Tag+Level Streaming) a1 D: A Q1: a2 b2 e1 D B d1 d2 b1 d3 C c1 c2 : head element a1 a2 b2 d1 d2d3 … b1 c1 c2 On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
Example: Classifying Head Elements (Tag+Level Streaming) a1 D: A Q1: a2 b2 e1 D B d1 d2 b1 d3 C c1 c2 : head element a1 a2 b2 d1 d2d3 … b1 c1 c2 On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
Example: Classifying Head Elements (Tag+Level Streaming) a1 D: A Q1: a2 b2 e1 D B d1 d2 b1 d3 C c1 c2 : head element a1 a2 b2 d1 d2d3 … b1 c1 c2 On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
Example: Classifying Head Elements (Tag+Level Streaming) a1 D: A Q1: a2 b2 e1 D B d1 d2 b1 d3 C c1 c2 : head element a1 Q2: A a2 b2 D B d1 d2d3 … b1 c1 c2 C On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
Example: Classifying Head Elements (Tag+Level Streaming) a1 D: A Q1: a2 b2 e1 D B d1 d2 b1 d3 C c1 c2 : head element a1 Q2: A a2 b2 D B d1 d2d3 … b1 c1 c2 C On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
Example: Classifying Head Elements (Tag+Level Streaming) a1 D: A Q1: a2 b2 e1 D B d1 d2 b1 d3 C c1 c2 : head element a1 Q2: A a2 b2 D B d1 d2d3 … b1 c1 c2 C On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
Example: Classifying Head Elements (Tag+Level Streaming) a1 D: A Q1: a2 b2 e1 D B d1 d2 b1 d3 C c1 c2 : head element a1 Q2: A a2 b2 D B d1 d2d3 … b1 c1 c2 C On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
Example: Classifying Head Elements (Tag+Level Streaming) a1 D: A Q1: a2 b2 e1 D B d1 d2 b1 d3 C c1 c2 : head element a1 Q2: A a2 b2 D B d1 d2d3 … b1 c1 c2 C On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
Example: Classifying Head Elements (Tag+Level Streaming) a1 A D B a2 b2 e1 C A d1 d2 b1 d3 B D C c1 c2 • Useless element can be discarded safely • sub-tree Matching element is pushed to the corresponding stack • Blocked element causes problem • CANNOT be discarded because it may cause loss of results • CANNOT be pushed to stack because it may cause useless results • When all head elements are blocked; optimal holistic matching CANNOT be guaranteed On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
iTwigJoin • In our algorithm, in order to output all correct answers, we push blocked elements into stack, which may result in useless intermediate results in some cases. Tag+Level Streaming a1 A Q1: a2 b2 e1 D B d1 d2 b1 d3 C c1 c2 On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
iTwigJoin • In our algorithm, in order to output all correct answers, we push blocked elements into stack, which may result in useless intermediate results in some cases. Tag+Level Streaming a1 Since all head elements are blocked, we have to push a1 to stack and output one path solution (a1,d1). A Q1: a2 b2 e1 D B d1 d2 b1 d3 C c1 c2 On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
iTwigJoin • In our algorithm, in order to output all correct answers, we push blocked elements into stack, which may result in useless intermediate results in some cases. Tag+Level Streaming a1 Since all head elements are blocked, we have to push a1 to stack and output one path solution (a1,d1). A Q1: a2 b2 e1 D B d1 d2 b1 d3 C c1 c2 If there is no c2, then (a1,d1) is a useless path solution. On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
iTwigJoin • Two Main Components • Stream Manager: Control the advance operation of streams and send elements for temporary storage • Temporary Storage: Push elements to stack and output intermediate paths. Stream Manager Temporary Storage a1 SA a2 b2 SB SC c1 c2c3 … b1 On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
Flowchart of iTwigJoin Labelcurrent head elements as either subtree-Matching, Useless or Blocked If useless element is found Discard Useless elements If not all streams end Select a subtree-Matching or blocked element e Pop some elements from stack Push e to the stack and output intermediate paths if e is the leaf On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
Optimal classes of iTwigJoin for three streaming schemes Streaming scheme Optimal class Tag Streaming A-D only pattern A-D only On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
Optimal classes of iTwigJoin for three streaming schemes Streaming scheme Optimal class Tag Streaming A-D only pattern Tag+Level Streaming A-D/P-C only pattern A-D/P-C only A-D only On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
Optimal classes of iTwigJoin for three streaming schemes Streaming scheme Optimal class Tag Streaming A-D only pattern Tag+Level Streaming A-D/P-C only pattern A-D/P-C only or 1-Branch Prefix Path Streaming A-D/P-C only or 1-Branch node A-D/P-C only A-D only On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
Optimal classes of iTwigJoin for three streaming schemes Streaming scheme Optimal class Optimal class:Larger More refined Tag Streaming A-D only pattern Tag+Level Streaming A-D/P-C only pattern A-D/P-C only or 1-Branch Prefix Path Streaming A-D/P-C only or 1-Branch node A-D/P-C only A-D only On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
Experiments • Benchmarks • XMark: Synthetic Data • Treebank: Real Data from Wall Street Journal On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
Experiments: I/O Performance Tree1: A-D only Tree2: P-C only Tree3: P-C only Tree4: 1-branchnode Tree5: 1-branchnode By pruning irrelevant streams, PPS usually scan the fewest number of elements. On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
Experiments: Number of Intermediate Paths Tree1: A-D only Tree2: P-C only Tree3: P-C only Tree4: 1-branchnode Tree5: 1-branchnode For treebank 5, there is no matching results. So Tag+Level and PPS do not output any intermediate results. On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
Experiments: Running Time XMark1: Path Pattern, XMark2: A-D only, XMark3: P-C only, XMark4: 1-branchnode, XMark5: Non-optimal, Tag+level and PPS have better performance than TwigStack and TwigStackList in XMark data. On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
Experiments: Summary • Both PPS and Tag+Level help to reduce I/O costs. while PPS saves more. • PPS may result in too many streams for deep XML data; Tag+Level seems to be a good compromise. • PPS and Tag+Level completely avoid the output of redundant intermediate paths in all cases we tested, though they cannot guarantee the optimality in theory. On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
Conclusions • We develop a general algorithm to perform holistic twig join on Tag+Level and PPS streaming schemes. • We identify two I/O optimal classes for Tag+Level and PPS streaming schemes. • Since our experiments show that Tag+Level streaming schemes can guarantee to produce very few useless intermediate results in most cases, we recommend to use Tag+Level scheme for efficient XML twig pattern matching. On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
END • Thank you! • Q & A On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
Backup iTwigJoin Algorithm • While(not all streams end) • Label current head elements as either Matching, Useless or Blocked • If any head element is Useless, discard it and continue • Let e1 be the matching element with the smallest startPos; • Let e2 be the blocked element with the smallest endPos; • If e2.endPos < e1.startPos, let e be the blocked element with • the smallest startPos; else let e be e1 • Advance the stream e belongs to • Pop out elements from e’s stack whose endPos < e.startPos • Pushe into its stackif e has a parent/ancestor in the temporary storage system, • Output all paths involving eIf the tag of e is a leaf node in Q On Boosting Holism in XML Twig Pattern Matching using Structural Indexing