450 likes | 626 Views
Prefix Path Streaming: a New Clustering Method for XML Twig Pattern Matching. Ting Chen, Tok Wang Ling, Chee-Yong Chan School of Computing, National University of Singapore Sept, 2004. Outline. XML Twig Pattern Matching Background State of the Art: TwigStack Sub-optimality of TwigStack
E N D
Prefix Path Streaming: a New Clustering Method for XML Twig Pattern Matching Ting Chen, Tok Wang Ling, Chee-Yong Chan School of Computing, National University of Singapore Sept, 2004
Outline • XML Twig Pattern Matching • Background • State of the Art: TwigStack • Sub-optimality of TwigStack • Contributions of our Work • PPS: Prefix-Path-Streaming • Notations • Difference of Tag Streaming and PP Streaming • PPS Stream Pruning • PPS Based Twig Pattern Matching • Algorithm: PPS-TwigStack • Analysis • Performance • Conclusion
XML Twig Pattern Matching • XML Data Model • A XML document is commonly modeled as a rooted, ordered and labeled tree. • E.g. Note that identifiers (e.g. b1) are given to tree nodes for easy reference b1 book D1: c1 pf1 c2 chapter preface chapter s1 p1 …………. paragraph section p4 t1 s3 s2 section paragraph title section t2 p2 p3 f3 title paragraph figure paragraph f2 f1 figure figure
XML Twig Pattern Matching • Regional Coding [1] • Node Label: (startPos: endPos, LevelNum) • startPos and endPos are calculated by performing a pre-order traversal of the document tree; • LevelNum is the level of the node in the tree. • E.g. book (0: 50, 1) D1: preface (1:3, 2) chapter (4:22, 2) chapter(23:45, 2) section (5:21, 3) paragraph (2:2, 3) section(13:17, 4) title: (6:6, 4) section(7:12, 4) paragraph(18:20, 4) paragraph(14:16, 5) title: (8:8, 5) figure (19:19, 5) paragraph(9:11, 5) figure (15:15, 6) figure (10:10, 6) M.P. Consens and T.Milo. Optimizing queries on files. In In Proceedings of ACM SIGMOD, 1994.
XML Twig Pattern Matching • What is a Twig Pattern? • A twig pattern is a small tree whose nodes are predicates (e.g. element type test) and edges are either Parent-Child (P-C) edges or Ancestor-Descendant (A-D) edges. • E.g. An XPath query Q1 selects Figure elements which are descendants of some Paragraph elements which in turn are children of Section elements having at least one child element Title Q1: Section[Title]/Paragraph//Figure Section Paragraph Title Figure
XML Twig Pattern Matching • Twig Pattern Matching • Problem Statement • Given a query twig pattern Q, and a XML database D that has index structures (e.g. regional coding scheme) to identify database nodes that satisfy each of Q’s node predicates, compute ALL the answers to Q in D. • E.g. The matches for twig pattern Section[Title]/Paragraph//Figure in the document D1 are: (s1, t1, p4, f3) (s2, t2, p2, f1) b1 D1: c1 c2 pf1 s1 p1 t1 s2 s3 s4 t2 p2 p3 f3 f2 f1
XML Twig Pattern Matching • TwigStack[2]: a Holistic approach • Tag Streaming: all elements of tag q are grouped in a stream Tq ordered by their startPos • Optimal when all the edges in twig pattern are A-D edges • Two-phase algorithm: • Phase 1 TwigJoin: a list of intermediate paths are outputted • Phase 2 Merge: merge the intermediate path list to get the result N. Bruno, D. Srivastava, and N. Koudas. Holistic twig joins: optimal xml pattern matching. In In Proceedings of ACM SIGMOD, 2002.
XML Twig Pattern Matching • TwigStack Review • A node q in a twig pattern Q is coupled with a stack Sq • An element e with tag E is pushed into its stack SE if and only if e is in some match to Q. • E.g.Only color highlighted elements are pushed into their stacks. • Thus it is ensured that no redundant paths are output. • An element e is popped out from its stack if all matches involving it have been reported • Thus we ensure that the memory space used by stacks is bounded. D1: Q: Section[//Title]//Paragraph//Figure b1 SSection c1 c2 pf1 s1 SParagraph p1 t1 s2 s3 s4 STitle t2 p2 p3 f3 f2 SFigure f1
XML Twig Pattern Matching • Optimality of TwigStack for A-D edge only twig pattern • Each stream Tq is scanned only once • Obviously q should appear the twig pattern • No redundant intermediate result: All intermediate paths output in Phase 1 appear in the final result; • CPU and I/O cost: O(|Input| + |Output|) • Space Complexity: O(|Longest Path in the XML tree|) • E.g. Q2:Section[//Title]//Paragraph//Figure on document D1 • Phase 1 Output: • <s1,t1>, <s1,t2>, <s2,t2> • <s1,p2,f1>, <s2,p2,f1>, <s1,p3,f2>, <s1,p4,f3>, Note that <s3,p3,f2> will not be output ; although it matches one root to leaf path of Q2, it is not in a match to Q2. • Phase 2 (i.e. the final) output: • < s1,t1,p2,f1>, < s1,t1,p3,f2>, < s1,t1,p4,f3>, < s1,t2,p2,f1>, < s1,t2,p3,f2>, < s1,t2,p4,f3>, < s2,t2,p2,f1>
XML Twig Pattern Matching • TwigStack is notoptimal for twig pattern with P-C edge • E.g. Q1:Section[Title]/Paragraph//Figure on document D1 • The only match involving s1 is • <s1,t1,p4,f3> • To know s1 is in a match requires to advance the stream Tpara top4. p2 and p3 should be read. • p2 is in a match <s2,t2,p2,f1> to Q1 but p3 is not. However, since s2 and s3 have not been read, both p2 and p3 should be stored. • Space complexity is no longer • O(|Longest Path in the XML tree|) Tsection s1 s2 s3 Ttitle Tpara t1 t2 p1 p2 p3 p4 Tfigure f1 f2 f3
XML Twig Pattern Matching • TwigStack is notoptimal for twig pattern with P-C edge • Remove elements p4 and f3 in document D1 to get D2 • To answer Q1:Section[Title]/Paragraph//Figure on document D2 • TwigStack will push s1 into stack Ssection because s1 is in a match to Section[//Title]//Paragraph//Figure (instead of Q1) • Consequently, a redundant intermediate path s1/t1 will be output. D1: Matches to Q1 are highlighted D2: Matches to Q1 are highlighted b1 b1 c1 c2 c1 c2 pf1 pf1 s1 s1 p1 p1 t1 s2 s3 s4 t1 s2 s3 t2 p2 p3 f3 t2 p2 p3 f2 f1 f2 f1
XML Twig Pattern Matching • Why is TwigStacknot optimal for twig pattern with P-C edge ? • Match Order • The only match involving s1 is • m1 = <s1,t1,p4,f3> • The only match involving s2 is • m2 = <s2,t2,p2,f1> • s1 is ahead of s2 in Tsection • but • p4 is behind p2 in stream Tpara • m1 doesn’t have all its elements be ahead of the corresponding elements of m2 in their streams. Neither does m2. • Q1:Section[Title]/Paragraph//Figure Tsection s1 s2 s3 Ttitle Tpara t1 t2 p1 p2 p3 p4 Tfigure f1 f2 f3
XML Twig Pattern Matching • Why is TwigStackoptimal for twig pattern with only A-D edge? • Match Order • s1 is in a match m1 = <s1,t1,p2,f1> to Q • The only match involving s2 is • m2 = <s2,t2,p2,f1> • Now m1 has all its elements be aheadof the corresponding elements of m2 in their streams. • It is possible to know whether s1 is in a match to Qwithout skipping any matching elements of s2 to Q. • Q:Section[//Title]//Paragraph//Figure Tsection s1 s2 s3 Ttitle Tpara t1 t2 p1 p2 p3 p4 Tfigure f1 f2 f3
XML Twig Pattern Matching • Tag Streaming can not achieve optimality. How about other streaming method? • Solution: Divide elements of the same tag into more than one stream as the following diagram shows • E.g: Notice that now the match of s1 is not “blocked” by that of s2 any more. • Question: How can the division be done? • Q1:Section[Title]/Paragraph//Figure Tsection s1 s1 s2 s3 t1 s2 s3 p4 Ttitle Tpara p1 p2 p3 p4 t1 t2 f3 t2 p2 p3 Tfigure f1 f2 f3 f1 f2
Our Contributions • Prefix Path Streaming: a New XML streaming scheme for XML twig pattern matching. • PPS reduces I/O cost by pruning irrelevant streams. • PPS allows optimal processing of a large class of twig patterns with both A-D and P-C edges. • Optimality achieved by outputting only intermediate paths (as Phase 1 output) which appears in final results for the class of twig patterns. • The class of twig pattern includes: • A-D only twig pattern (with any number of branch nodes) ; and • P-C only twig pattern (with any number of branch nodes); and • 1-branch node twig pattern (which can have both A-D and P-C edges)
XML Twig Pattern Matching • Our contributions XML Twig Pattern P-C edge only Twig Pattern A-D edge only Twig Pattern 1-branchnode Twig Pattern Optimal for PP Streaming Optimal for Tag Streaming
Prefix Path Streaming • The prefix-path of an element in a XML document tree is the path from the document root to the element. • A prefix-path stream (PP stream)contains all elements in the document tree with the same prefix-path, ordered by their startPos numbers. • Example: • For the sample XML document D1, there are 11 PP streams compared with 7 streams generated by Tag Streaming. • For example, instead of a single stream Tsectionin Tag Streaming, now we have two PP streams: Tbook/chapter/sectionand Tbook/chapter/section/section.
Prefix Path Streaming • Notations • Twig Pattern • Q: a twig pattern • q : a node in twig pattern. • Qq: the sub-twig of Q rooted at q • We do not allow nodes with the same tag in a twig pattern; therefore we will use node q and tag q interchangeably according to the context • Element e is used to refer to a node in an XML document tree; the capital letter E is used to denote the tag of e • Element Stream • Tis used to denote an elementstream • Tag Stream is denoted by Tqwhere q is the tag name of elements in Tq. • PP Stream is denoted by Tprefix-path where prefix-path is the common prefix path of elements inTprefix-path • A PP Stream is of class q if its elements are of tag q. • A match to a twig pattern is denoted by M
Prefix Path Streaming • PP Streaming allows some degree of “parallel” access to sequential files containing elements with the same tag. More formally, • Definition 1: (Match Order) • Intuitively a match MAis ahead of another match MB if no element of MAis placed behind any element of MB in the same stream (either tag stream or PP stream) • Note that the definition of Match Order can be applied to both Tag Streaming and PP streaming
Tag Streaming Neither M1 ≤ M2 nor M2 ≤ M1 because of (s1,s2) and (p2, p4) Informally, we can think M1 and M2 are “tangled”. PP Streaming M1 ≤ M2 and M2 ≤ M1 M1 and M2 are not “tangled” Prefix Path Streaming • Match Order • Example: Q1:Section[Title]/Paragraph//Figure • M1: <s1,t1,p4,f3> and M2: <s2,t2,p2,f1> TB/C/S s1 Tsection s1s2 s3 s2 s3 p4 TB/C/S/P TB/C/S/T TB/C/S/S t1 Ttitle Tpara TB/C/S/S/T t1t2 p1 p2 p3 p4 t2 p2 p3 f3 TB/C/S/S/P TB/C/S/P/F Tfigure TB/C/S/S/P/F f1 f2f3 f1 f2
Properties of Prefix Path Streaming • Definition (Element Order) • Example • Under Tag Streaming for the XML document and Twig Pattern Q1 in the previous slide • t1 ≤ Q1 s2 ; but s2 ≤ Q1 t1 does not hold because t2 is placed behind t1 in the stream TTitle • Neither s1 ≤ Q1 s2 nor s2 ≤ Q1 s1 • Under PP Streaming • s1 ≤ Q1 s2 and s2 ≤ Q1 s1 • Intuitively, a ≤Q b means it is possible to know a is in match to QA without skipping any match involving b to QB.
Properties of Prefix Path Streaming • Why can Tag Streaming provide optimal solution for A-D edge only twig pattern? • PP Streaming has similar property for a larger class of twig pattern • Informally, referring back to slide 20, Lemma 1 says under Tag Streaming, there is no problem of “tangle” for A-D only twig pattern; Lemma 2 says under PP Streaming, there is no problem of “tangle” for all three classes of twig patterns. • Lemma 2 makes it possible to devise optimal algorithms for a larger class of twig patterns on PP streams
Prefix Path Stream Pruning • Using the set of PP Stream labels, we can prune away many PP streams without accessing the data • E.g for Q1:Section[Title]/Paragraph//Figure , the PP stream in document D1 TBook/Preface/Paragraph can be pruned • Details please refer to our paper [3].
PP Stream Based Twig Pattern Matching • PPS-TwigStack: • Similar to TwigStack, a stack Sq is coupled with a node q in the twig pattern. • Using stacks, exponential number of matches can be encoded in linear space. For more details, see [2]. • E.g. SSection Section Title Paragraph STitle SParagraph SFigure Figure
PP Stream Based Twig Pattern Matching TA …… TB e := getNext() • Overview • Similar to TwigStack, our algorithm operates in two phases (See slide 7). • Our goal is to reduce the number of intermediate paths output in Phase 1 Twig Join (See slide 7). • Flow Chart of the procedure of Phase 1 • In each step getNext() finds a “potential” matching element e • Depending on the current stack states, e is then being pushed into its stack SE or being used only for updating stack matching states. (See TwigStack for details) • Phase 1 ends when all leaf streams (streams whose tag is leave node of Q) • Phase 2 “Merge” of TwigStack remains unchanged. Pop elements from stack SE and Sparent(E)which are not ancestor of e NO If e should be pushed into its stack* YES Push e to its stack and output a path if e is a leaf element Advance the stream e belongs to NO If all leaf streams end YES END *: e is pushed into its stack SEif it has a parent or ancestor element in stack Sparent(E)
PP Stream Based Twig Pattern Matching • Which element will getNext() select in the remaining elements? • e of tag E is in a match to QEand e ≤Q e’ for each remaining element e’; and • e does not have a parent (ancestor)* element p of tag parent(E) such that p is in a match to Qparent(E). And Additionally, to correctly maintain the stack states, we require • e has the smallest startPosamong elements which satisfy condition (1) and (2) and have tag E or Sibling(E). Sibling(E) is a node which is sibling node of E in Q. • Why? • To find e, no element appearing in matches will be skipped because of (1) and thus we don’t need to store any elements. • (2) ensures that an ancestor element in a match to Q will always be found before its descendant element. • (3) ensures that when an element is popped out of its stack, all of its matches have been output. *: depending on the edge type of <E,parent(E)>
PP Stream Based Twig Pattern Matching • Example: PPS-TwigStack • Q1:Section[Title]/Paragraph//Figure • 1st call getNext() returns s1 • s1is then pushed intoSsection • Note that s1 is in a match to QSection which is Q1 itself. TB/C/S s1 TB/C/S/T p4 SSection s2 s3 TB/C/S/P TB/C/S/S t1 s1 TB/C/S/S/T t2 p2 p3 f3 STitle SParagraph TB/C/S/P/F TB/C/S/S/P TB/C/S/S/P/F f1 f2 SFigure current stream cursor position current stream end
PP Stream Based Twig Pattern Matching • Example: PPS-TwigStack • Q1:Section[Title]/Paragraph//Figure • 2nd call getNext() returns t1 • t1 is then pushed into STitle • Path s1/t1 outputted; since s1 is in a match to Q1, s1/t1 will be in the final match TB/C/S s1 TB/C/S/T p4 s2 s3 TB/C/S/P TB/C/S/S t1 SSection TB/C/S/S/T s1 t2 p2 p3 f3 TB/C/S/S/P TB/C/S/P/F SParagraph STitle t1 TB/C/S/S/P/F f1 f2 SFigure current stream cursor position current stream end
PP Stream Based Twig Pattern Matching • Example: PPS-TwigStack • Q1:Section[Title]/Paragraph//Figure • 3rd call getNext() returns s2 • s2 is then pushed into Ssection TB/C/S s1 s2 SSection TB/C/S/T s1 p4 s2 s3 TB/C/S/P TB/C/S/S t1 TB/C/S/S/T SParagraph STitle t2 p2 p3 f3 TB/C/S/S/P TB/C/S/P/F SFigure TB/C/S/S/P/F f1 f2 current stream cursor position current stream end
PP Stream Based Twig Pattern Matching • Example: PPS-TwigStack • Q1:Section[Title]/Paragraph//Figure • 4th call getNext() returns t2 • t2 is then pushed into Stitle • Path s2/t2outputted • Note that p4 is not selected because it satisfies all conditions but (3); the reason is that by selecting p4, s2will be popped out prematurely. TB/C/S s1 TB/C/S/T p4 s2 s3 TB/C/S/P TB/C/S/S t1 TB/C/S/S/T t2 s2 p2 p3 f3 SSection TB/C/S/S/P TB/C/S/P/F s1 TB/C/S/S/P/F SParagraph f1 f2 STitle t2 current stream cursor position SFigure current stream end
PP Stream Based Twig Pattern Matching • Example: PPS-TwigStack • Q1:Section[Title]/Paragraph//Figure • 5th call getNext() returns p2 • p2 is pushed into Sparagraph TB/C/S s1 TB/C/S/T p4 s2 s3 TB/C/S/P TB/C/S/S t1 s2 SSection s1 TB/C/S/S/T t2 p2 p3 f3 SParagraph TB/C/S/S/P TB/C/S/P/F STitle p2 TB/C/S/S/P/F f1 f2 SFigure current stream cursor position current stream end
PP Stream Based Twig Pattern Matching • Example: PPS-TwigStack • Q1:Section[Title]/Paragraph//Figure • 6th call getNext() returns f1 • f1 is pushed into SFigure • Path s2/p2/f1outputted TB/C/S s1 TB/C/S/T p4 s2 s3 TB/C/S/P TB/C/S/S t1 s2 SSection s1 TB/C/S/S/T t2 p2 p3 f3 SParagraph TB/C/S/S/P TB/C/S/P/F STitle p2 TB/C/S/S/P/F f1 f2 SFigure f1 current stream cursor position current stream end
PP Stream Based Twig Pattern Matching • Example: PPS-TwigStack • Q1:Section[Title]/Paragraph//Figure • 7th call getNext() returns p3 • s3 is skipped because it is not in any match to Qsection • p3 is selected but discarded because it doesn’t have a parent element in Ssection • p2 and s2 are popped out because they are not ancestor of p3 TB/C/S s1 TB/C/S/T p4 s2 s3 TB/C/S/P TB/C/S/S t1 TB/C/S/S/T t2 s2 SSection p2 p3 f3 s1 TB/C/S/S/P TB/C/S/P/F SParagraph STitle p2 TB/C/S/S/P/F f1 f2 current stream cursor position SFigure current stream end
PP Stream Based Twig Pattern Matching • Example: PPS-TwigStack • Q1:Section[Title]/Paragraph//Figure • 8th call getNext() returns f2 • f2 is then discarded because it does not have an ancestor element in Sparagraph TB/C/S s1 TB/C/S/T p4 s2 s3 TB/C/S/P TB/C/S/S t1 SSection s1 TB/C/S/S/T t2 p2 p3 f3 SParagraph TB/C/S/S/P TB/C/S/P/F STitle TB/C/S/S/P/F f1 f2 SFigure current stream cursor position current stream end
PP Stream Based Twig Pattern Matching • Example: PPS-TwigStack • Q1:Section[Title]/Paragraph//Figure • 9th call getNext() returns p4 • p4 is pushed into Sparagraph TB/C/S s1 TB/C/S/T p4 s2 s3 SSection TB/C/S/P TB/C/S/S t1 s1 TB/C/S/S/T t2 SParagraph STitle p4 p2 p3 f3 TB/C/S/S/P TB/C/S/P/F TB/C/S/S/P/F SFigure f1 f2 current stream cursor position current stream end
PP Stream Based Twig Pattern Matching • Example: PPS-TwigStack • Q1:Section[Title]/Paragraph//Figure • 10th call getNext() returns f3 • f3 is pushed into SFigure • Path s1/p4/f3 outputted • Matching ends because all streams end. TB/C/S s1 TB/C/S/T p4 s2 s3 TB/C/S/P TB/C/S/S t1 SSection TB/C/S/S/T s1 t2 p2 p3 f3 TB/C/S/S/P TB/C/S/P/F SParagraph STitle p4 TB/C/S/S/P/F f1 f2 SFigure f3 current stream cursor position current stream end
PP Stream Based Twig Pattern Matching • Just like TwigStack for A-D only twig patterns, for the three classes of twig pattern PPS-TwigStack • Push into stacks only elements which appear in the final results • No redundant paths are output as the intermediate results • getNext() in PPS-TwigStack selects element e of tag Esuch that it is in a match to sub-twigQE • getNext() is a recursive method as its counterpart in TwigStack • It works by ensuring for each child node Ei of E in Q, • there exists an element ei such that ei is in a match of sub-twig QEi • e and ei satisfy the A-D or P-C relationship of edge <E,Ei> • getNext() in PPS-TwigStack is more complex than that of TwigStack because the descendant elements of an element e can scatter on several PP streams • For each child node Ei of E in Q, we consider all possible PP streams of class Ei to find the suitable child/descendant element ei of e.
PP Stream Based Twig Pattern Matching • Example: Call stacks of the 1stgetNext() • element s1 is returned • The current cursor position of each stream is denoted in parentheses. e.g. TBCS(s1) getNext(Section) <section, {TB/C/S(s1),TB/C/S/S(s2)}> getNext(title) getNext(Paragraph) <paragraph, {TB/C/S/S/P(p2),TB/C/S/P(p4)} > <title, {TB/C/S/T(t1),TB/C/S/S/T(t2)}> getNext(Figure) < figure, {TB/C/S/S/P/F(f1),TB/C/S/P/F(f3)} >
PP Stream Based Twig Pattern Matching • Analysis of PPS-TwigStack • Perform a single forward scan of all useful PP streams. • No redundant intermediate paths • The worst case I/O complexity is linear in the sum of sizes of the useful PP streams and the output list. • CPU complexity is O((|no. of PP streams after pruning|) * (|Input list|+|Output list|)), • which is also independent of the size of partial matches to any root-to-leaf path of the twig. • The space complexity is the maximum length of a root-to-leaf path. • If a twig pattern is not in any of the three classes afore-mentioned, we can simply run TwigStack by simulating a single stream Tq in Tag Streaming using multiple PP streams of class q.
Performance • The impact of Prefix-Path streaming on XML document storage. • In most practical cases, the total number of PPS streams in XML documents can be considered as linear to the number of different tags. • Compared with the space used by Tag Streaming approach, PP streaming requires little extra space. • Fragmentation Factor = space used by PPS / space used by Tag Streaming
Performance • Effects of Prefix-Path stream pruning. • We measure the effect of PPS pruning on several twig queries in XMark benchmark . • Pruning can reduce the I/O cost significantly Query 1: site/people/person/name Query 2: site/open_auctions/open_auction/bidder/increase/ Query 3: site/closed_auctions/closed_auction/price Query 4: site/regions//item Query 5: site/people/person[//age and //income]/name
Performance • Performance Comparison of PPS-TwigStack and TwigStack • The running times of PPS-TwigStack (with pruning) and the TwigStack on the five queries on a XMark document of size 200M Bytes • PPS-TwigStack out-performs TwigStack in all the queries except Q3 which PPS-pruning can’t find any irrelevant PPS stream.
Performance • Performance Comparison of PPS-TwigStack and TwigStack • Even there is no PP streams which can be pruned, PPS-TwigStack still outperforms TwigStack because it outputs fewer intermediate paths. • We synthesize XML documents which have a lot of partial matches to branches of twig pattern and there is no PP stream can be pruned. • We control and vary the ratio of portions of XML document that contain partial matches and that contain full matches. • Run PPS-TwigStack and TwigStack on a 300 MB documents and twig pattern A/B[C//D]//E. Notice PPS-TwigStack is optimal for the query.
Conclusion • We propose a novel clustering method Prefix-Path Streaming (PPS) for XML twig pattern matching. • PP Streaming can reduce I/O cost by avoiding scanning of irrelevant PP streams. • PP Streaming optimal processing of a large class of twig patterns with both A-D and P-C edges. • The work is published in [3].
References: • M.P. Consens and T.Milo. Optimizing queries on files. In In Proceedings of ACM SIGMOD, 1994. • N. Bruno, D. Srivastava, and N. Koudas. Holistic twig joins: optimal xml pattern matching. In In Proceedings of ACM SIGMOD, 2002. • T. Chen, T. W. Ling, C.Y. Chan, Prefix Path Streaming: a New Clustering Method for XML Twig Pattern Matching. In In Proceedings of DEXA, 2004.