340 likes | 460 Views
CIKM 2004 Washington D.C. U.S.A. Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach. Jiaheng Lu, Ting Chen, Tok Wang Ling National University of Singapore Nov. 11. 2004. Outline. ☞ XML Twig Pattern Matching Problem definition
E N D
CIKM 2004 Washington D.C. U.S.A. Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach Jiaheng Lu, Ting Chen, Tok Wang Ling National University of Singapore Nov. 11. 2004
Outline • ☞XML Twig Pattern Matching • Problem definition • State of the Art: TwigStack • Sub-optimality of TwigStack • Our algorithm: TwigStackList • Performance • Conclusion
XML Twig Pattern Matching • An XML document is commonly modeled as a rooted, ordered and labeledtree. book chapter preface chapter …………. “Intro” section section paragraph title section title paragraph figure paragraph “Data” figure figure “XML”
Regional Coding • Node Label1: (startPos: endPos, LevelNum) • E.g. book (0: 32, 1) preface (1:3, 2) chapter (4:29, 2) chapter(30:31, 2) section (5:28, 3) “Intro” (2:2, 3) section(18:23, 4) title: (6:8, 4) section(9:17, 4) paragraph(24:27, 4) paragraph(19:22, 5) title: (10:12, 5) “Data” (7:7, 3) figure (25:26, 5) paragraph(13:16, 5) figure (20:21, 6) “XML” (11:11, 3) figure (14:15, 6) M.P. Consens and T.Milo. Optimizing queries on files. In In Proceedings of ACM SIGMOD, 1994.
What is a Twig Pattern? • A twig pattern is a small tree whose nodes are tags, attributes or text values and edges are either Parent-Child (P-C) edges or Ancestor-Descendant (A-D) edges. • E.g. Selects Figure elements which are descendants of Paragraph elements which in turn are children of Section elements having child element Title • Twig pattern : Section Paragraph Title Figure
XML Twig Pattern Matching • Problem Statement • Given a query twig pattern Q, and an XML database D, weneed to compute ALL the answers to Q in D. • E.g. Consider Q1 and Doc 1: Q1: Section • Query solutions: • (s1, t1, f1) • (s2, t2, f1) • (s1, t2, f1) Doc1: s1 t1 s2 title figure t2 p1 f1
Previous work: TwigStack • TwigStack2: a holistic approach • Two-phase algorithm: • Phase 1 TwigJoin: intermediate root-leaf paths are outputted • Phase 2 Merge: merge the intermediate path list to get the result 2. N. Bruno, D. Srivastava, and N. Koudas. Holistic twig joins: optimal xml pattern matching. In In Proceedings of ACM SIGMOD, 2002.
Previous work: TwigStack • A node q in a twig pattern Q is associated with a stack Sq • Insertion and deletion in a stack Sq • Insertion: An element eq from stream Tq is pushed into its stack Sq if and only if • eq has a descendanteqi in each Tqi , where qi is a child of q • Each node eqi recursively has the first property • Deletion: An element eqis popped out from its stack if all matches involving it have been output.
Sub-optimality of TwigStack • TwigStack is I/O optimal for onlyancestor-descendant edge query • Unfortunately, TwigStack is sub-optimal for queries with any parent-child edge. • TwigStack may output a large size of intermediate results that are not merge-joinable to any final solution for queries with parent-child relationships.
Sub-optimality of TwigStack: an example A simple XML tree Twig Pattern s1 Section t1 paragraph title p1 t2 figure f1 • Since s1 has descendants t1,p1 and in turn p1 has descendant f1, TwigStack output an intermediate path solution <s1,t1>. • But it is useless, for there is no solution for this example at all.
Main problem and our experiment • TwigStack might output some intermediate results that are useless to query answers. • To have a better understanding , we perform TwigStack on real dataset. • Data set : TreeBank[from U. of Washington XML datasets] • Queries: • Q1:VP [/DT] //PRP_DOLLAR_ • Q2: S//NP[//PP/TO][/VP/_NONE_]/JJ • Q3: S [/JJ] /NP • All queries contain parent-child relationships.
Our experimental results Most intermediate paths do not contribute to final answers due to parent-child edges! It is a big challenge to improve TwigStack to answer queries with parent-child edges.
Intuition for improvement A simple XML tree Twig Pattern s1 Section t1 paragraph title p1 t2 figure f1 • Our intuitive observation: why not read more paragraph elements and cache them in the main memory? • For example, after we scan the p1, we do not stop and continue to read the next paragraph element. Then we find that there is only oneparagraph element and f1 is notthe child of paragraph. So we should not output any intermediate solution.
Outline • XML Twig Pattern Matching • Problem definition • State of the Art: TwigStack • Sub-optimality of TwigStack • ☞Our algorithm TwigStackList • Experimental results • Conclusion
Our main idea • Main idea: we read more elements in the input streams and cache some of them in the main memory so that we can make a more accurate decision about whether an element can contribute to final answer. • But we cannot cache too many elements in the main memory. For each node q in twig query, the number of elements with tag q cached in the main memory should not be greater than thelongest path in the XML dataset.
Twig Pattern Section paragraph title figure Our caching method • What elements should be cached into the main memory? • Only those that might contribute to final answers A simple XML tree s1 p1 t1 p3 p2 f1 • We only need to cache p1,p3 into main memory, why not p2? • Because if p2 contributed to final answers, then there would be an element before f1 to become the child of p2. But now we see that f1 is the first element. So p2 is guaranteed not to contribute to final answers.
Our criteria for pushing an element to stack • The criteria for an element to be pushed into stack is very important for controlling intermediate results. Why? • Because, once an element is pushed into stack, then this element is ready to output. Soless elements are pushed into stack, lessintermediate results are output. • Our criteria: Given an element eq from stream Tq, before eq is pushed into stack Sq , we ensure that • (i) element eq has a descendant eq’ for each child q’ of q, and • (ii) if (q, q’) is a parent-child relationship, eq’ has parent with tag q in the path from eq to eqmax , where eqmax is the descendant of eq with the maximal start value, qmax being a child of q. • (iii) each of q’ recursively satisfy the first two conditions.
Twig Pattern Section paragraph title figure Examples A simple XML tree s1 t1 p1 p2 p3 f1 • Element p3 can be pushed into stack , but p1, p2 cannot. • Because p3 has a child f1. • Although p1 has a descendant f1, but f1 is not the child of p1.
Our algorithm: TwigStackList • We propose a novel holistic twig algorithm TwigStacklist to evaluate a twig query. • Unique features of TwigStackList: • It considers the parent-child edge in the query • There is a list for each query node to cache elements that likely participate in final solutions. • It identifies a broader class of optimal queries. TwigStackList can guarantee the I/O optimality for queries with only ancestor-descendant edges connecting branching nodes and their children.
TwigStackList : an example An XML tree Twig Pattern Section s2 Root title paragraph p3 p2 p1 p3 s2 s1 t3 p2 t3 p1 t1 figure p3 t2 f2 f1 f2 Stack List Scan s1, t1, p1 ,f1.
TwigStackList : an example An XML tree Twig Pattern Section s2 Root title paragraph p3 p2 p1 p3 s2 s1 t3 p2 t3 p1 t1 figure p3 t2 f2 f1 f2 Stack List Since p1 is not the parent of f1 (but ancestor) , we continue to scan p2 and put p1 to list.
TwigStackList : an example An XML tree Twig Pattern Section s2 Root title paragraph p3 p2 p1 p3 s2 s1 t3 p2 t3 p1 t1 figure p3 t2 f2 f1 f2 Stack List Put p2,p3 to list and the cursor points to p3, for it is the parent of f2.
TwigStackList : an example An XML tree Twig Pattern Section s2 Root title paragraph p3 p2 p1 p3 s2 s1 t3 p2 t3 p1 t1 figure p3 t2 f2 f1 f2 Stack List Merge Final: <s2,t3,p3,f2> Output intermediate solutions: <s2,t3> ,<s2,p3,f2>
TwigStackList v.s. TwigStack • TwigStackList shows I/O optimal for the above query. In contrast, TwigStack shows sub-optimal, for it output the “uesless” path solution < s1,t1> Root Twig Pattern An XML tree Section s2 s1 title p2 t3 paragraph p1 t1 p3 t2 figure f1 f2
Sub-optimality of TwigStackList • Although TwigStackList broadens the class of optimal query compared to TwigStack, TwigStackList is still show sub-optimality for queries with parent-child edge connecting branching nodes. A simple XML tree Twig Pattern Section s1 t1 title s2 paragraph p1 • Observe that there is no matching solution for this dataset. But TwigStackList caches s1 and s2 in the list and push s1 to stack. So (s1,t1) will be output as a useless solution.
Sub-optimality of TwigStackList • Although TwigStackList broadens the class of optimal query compared to TwigStack, TwigStackList is still show sub-optimality for queries with parent-child edge connecting branching nodes. A simple XML tree Twig Pattern Section s1 t1 title s2 p2 paragraph p1 • Observe that there is no matching solution for this dataset. But TwigStackList caches s1 and s2 in the list and push s1 to stack. So (s1,t1) will be output as a useless solution. • Here the behavior of TwigStackList is still reasonable since we do not know whether s1 has a child p2 following p1 before we advance p1.
Outline • XML Twig Pattern Matching • Problem definition • State of the Art: TwigStack • Sub-optimality of TwigStack • Our algorithm TwigStackList • ☞Experimental results • Conclusion
Experimental Setting • Experimental Setting • Pentium 4CPU, RAM 768MB, disk 2GB • TreeBank • Download from University of Washington XML dataset • Maximal depth 36, 2.4 million nodes • Random • Seven tags : a, b, c, d, e, f, g. ; uniform distributed • Fan-out of elements varied 2-100, depth varied 10-100
Performance against TreeBank • Queries with XPath expression: • Number of intermediate path solutions for TwigStackList V.s. TwigStack
Performance analysis • We have three observations: • (1) when queries contain only ancestor-descendant edges, two algorithms havesimilar performance. See Q1. • (2)When edges connecting branching nodes contain only ancestor-descendant relationships, TwigStack is optimal, but TwigStack show the sub-optimal. See Q3.Q5 • (3) When edges connecting branching nodes contain parent-child relationships, both TwigStack and TwigStackList are sub-optimal. But TwigStack typically output far few “useless” (<5%) intermediate solution than TwigStack. See Q2,Q4,Q6.
Performance against random dataset From the following table, we see that for all queries, TwigStackList again is more efficient than TwigStack in terms of the size of intermediate results.
Outline • XML Twig Pattern Matching • Problem definition • State of the Art: TwigStack • Sub-optimality of TwigStack • Our algorithm TwigStackList • Experimental results • ☞Conclusion
Conclusion • Previous algorithm TwigStack show the sub-optimality for queries with parent-child edges. • We propose a new algorithm TwigStackList to address this problem. • TwigStackList broadens the class of query with I/O optimality. • Experiments show that TwigStackList typically output much fewer useless intermediate result as far as the query contains parent-child edges. • We recommend to use TwigStackList as a new holistic join algorithm to evaluate a query with parent-child edges.
Thank You! • Q & A