1 / 101

Benchmarking Holistic Approaches to XML TPQ Processing

BenchmarX 2010. Benchmarking Holistic Approaches to XML TPQ Processing. Jiaheng Lu Renmin University of China. A little bit of history. Database world 1970 relational databases 1990 object oriented database 1995 semi-structured databases Document world

taini
Download Presentation

Benchmarking Holistic Approaches to XML TPQ Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BenchmarX 2010 Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China

  2. A little bit of history • Database world • 1970 relational databases • 1990 object oriented database • 1995 semi-structured databases • Document world • 1974 SGML (Structured Generalized Markup Language) • 1990 HTML (Hypertext MarkupLanguage) • 1992 URL (Universal Resource Locator) 1996 XML (eXtensible Markup Language) Benchmarking Holistic Approaches to TPQ Processing

  3. What is XML • The eXtensible Markup Language (XML) is the universal format for structured documents and data on the Web. • Advantages of XML: • Human- and machine-readable format • More flexible than HTML, not so complicated as SGML • Unlike relational table, XML can describe tree and graph structural data Benchmarking Holistic Approaches to TPQ Processing

  4. What is XML • Basic Specification: • XML 1.0, W3C Recommendation Feb’98 <book year=“1967”> <title>The politics of experience</title> <author> <firstname>Ronald</firstname> <lastname>Laing</lastname> </author> </book> Benchmarking Holistic Approaches to TPQ Processing

  5. XML Tree • An XML document is commonly modeled as arooted, orderedtree. book author @year title “year” is an attribute firstname lastname “1967” “The politics…” “Ronald” “Lazing” Benchmarking Holistic Approaches to TPQ Processing

  6. XML query language • Major standards for querying XML data • XPath and XQuery • “XPath is a language for addressing parts of an XML document ” XPath 1.0 W3C, Nov 1999 • E.g. paper [title=“XML”]/author • “XQuery is an XML query language which provide features for retrieving and interpreting information from XML documents. ” XQuery 1.0 Nov 2005 Benchmarking Holistic Approaches to TPQ Processing

  7. An XQuery example Create a flat list of all the title-author pairs for every book in bibliography. XQuery: <results> { for $b in doc("bib.xml")/bib//book, $t in $b/title, $a in $b/author, return <result> { $t } { $a } </result> } </results> Benchmarking Holistic Approaches to TPQ Processing

  8. XML Twig Pattern • XML Twig Pattern Query (TPQ) is a core operation in XPath and XQuery • Definition of XML twig pattern : an XML twig pattern is a small tree whose nodes are tags, attributes or text values; and edges are either parent-child (P-C) or ancestor-descendant (A-D) relationships Benchmarking Holistic Approaches to TPQ Processing

  9. An XML twig pattern example bib Create a flat list of all the title-author pairs for every book in bibliography. To answer the XQuery, we need to first match the following XML twig pattern: XQuery: <results> { for $b in doc("bib.xml")/bib//book, $t in $b/title, $a in $b/author, return <result> { $t } { $a } </result> } </results> $b book $t: $a: title author Benchmarking Holistic Approaches to TPQ Processing

  10. Research Problem • Given an XML twig pattern Q, and an XML database D, weneed to find ALL the matches of Q on D efficiently. • E.g. Consider the following twig pattern and document: An XML tree: Twig pattern: • Query solutions: • (s1, t1, f1) • (s2, t2, f1) • (s1, t2, f1) s1 section t1 s2 t2 p1 title figure f1 Benchmarking Holistic Approaches to TPQ Processing

  11. Why research XML twig pattern match • An XML query includes two parts: value match and twig match. XPath: paper [title=“XML”]/author paper Value (content) match Twig Match: New challenge! title author Benchmarking Holistic Approaches to TPQ Processing

  12. Approach Overview • (1) Labeling: Assign each element in the XML document tree an integer label to capture the structural information of documents • (2) Computing: Use labels to answer the twig pattern without traversing the original document Benchmarking Holistic Approaches to TPQ Processing

  13. Related work graph XML TPQ Algorithms Labeling schemes Computing algorithms Stack-merge [ICDE ’02] Dewey scheme [ SIGMOD’02 ] Containment scheme [SIGMOD’01] XPath-SQL [SIGMOD ’02] TwigStack [SIGMOD ’02] Dynamic Dewey scheme [ SIGMOD’09 ] TJFast [VLDB ’05] Twig2Stack [VLDB’06] TreeMatch[ TKDE’2010] Benchmarking Holistic Approaches to TPQ Processing

  14. Approach Overview • (1)Labeling • Region encoding (or called containment) labeling scheme (start,end,level) An example XML tree with region encoding labels (1,12,1) s1 (2,3,2) (4,11,2) t1 s2 (5,6,3) (7,10,3) t2 p1 f1 (8,9,4) Benchmarking Holistic Approaches to TPQ Processing

  15. Approach Overview • (1)Labeling • Dewey (or called prefix) labeling scheme: integer sequence An example XML tree with Dewey labels ε s1 0 1 t1 s2 1.0 1.1 t2 p1 f1 1.1.0 Benchmarking Holistic Approaches to TPQ Processing

  16. Approach Overview • (2)Computing • Inverted data list: each data list contains all labels of elements with the same tag name An XML tree: Data lists: Query: (1,12,1) s (1,12,1), (4,11,2) s s1 (2,3,2) (4,11,2) t (2,3,2), (5,6,3) t1 s2 t f (7,10,3) f (8,9,4) t2 p1 (5,6,3) f1 (8,9,4) Benchmarking Holistic Approaches to TPQ Processing

  17. Previous work: TwigStack [1] (2) Computing • TwigStack [1] is a holistic algorithm for XML twig matching on containment labeling scheme. • Two steps in TwigStack : • (1) intermediate path solutions are output to match each query root-to-leaf path; and • (2) these intermediate path solutions are merged to get the final results. [1] N. Bruno, D. Srivastava, and N. Koudas. Holistic twig joins: optimal xml pattern matching. In Proceedings of ACM SIGMOD, 2002. Benchmarking Holistic Approaches to TPQ Processing

  18. Running example: TwigStack algorithm State of stacks: Query: Data streams: s (1,12,1) (1,12,1) (4,11,2) (4,11,2) s t f t (2,3,2) (2,3,2) (5,6,3) (5,6,3) Output path intermediate solutions: Final results: s//t: s//f: f (8,9,4) (8,9,4) (1,12,1) (2,3,2) (8,9,4) (1,12,1) (2,3,2) (1,12,1) (8,9,4) (1,12,1) (5,6,3) (8,9,4) (4,11,2) (8,9,4) (1,12,1) (5,6,3) (4,11,2) (5,6,3) (8,9,4) (4,11,2) (5,6,3) Benchmarking Holistic Approaches to TPQ Processing

  19. Limitations of TwigStack • (1) TwigStack may output many useless intermediate results for queries with parent-child relationship • (2) TwigStack cannot process XML twig queries with ordered predicates, like “Proceeding”, “Following” in XPath • (3) TwigStack cannot answer queries with wildcards in branching nodes. E.g. * The parent of B should be an ancestor of C B C Benchmarking Holistic Approaches to TPQ Processing

  20. Outline • Introduction • Holistic algorithms: • TwigStackList (CIKM2005) • OrderedTJ (DEXA2006) • iTwigJoin (SIGMOD2005) • TJFast (VLDB2005) • Twig2Stack(VLDB2006) • TreeMatch (TKDE2010) • Benchmark experiments • Conclusions and future work Benchmarking Holistic Approaches to TPQ Processing

  21. Inefficiency of TwigStack • TwigStack is inefficient to answer twig query with parent-child edges # of intermediate path Q1=VP[/DT]//PRP DOLLAR, Q2=S[/JJ]/NP, Q3=S[//VP/IN]//NP in Tree Bank data More than 99% intermediate results are useless, TwigStack wastes too much time to output useless intermediate results! Benchmarking Holistic Approaches to TPQ Processing

  22. Example to illustrate the inefficiency of TwigStack for queries with P-C edge Twig pattern: An XML tree: TwigStack outputs the useless root-to-leaf intermediate path solutions: (A1, B1, C1), (A1, B2, C1) …… (A1, Bn, Cn) A1 A B1 Bn-1 E1 D B B2 Bn D1 …… C C1 Cn-1 C2 Cn The reason for theinefficiencyof TwigStack : TwigStack assumes that all edges are A-D relationships in the first step and does not consider level information Benchmarking Holistic Approaches to TPQ Processing

  23. Naïve improvement is incorrect Naïve improvement: because A1 is not the parent of D1 , wedonotoutput the following path solutions (A1, B1, C1), (A1, B2,C1) …… (A1, Bn, Cn) by considering level information Twig pattern: An XML tree: A1 A B1 Bn-1 E1 D B B2 Bn D1 …… C C1 Cn-1 But this naïve approach is NOT correct for some cases! C2 Cn Benchmarking Holistic Approaches to TPQ Processing

  24. Problem of naïve approach • Naïve approach possibly make a wrong decision about whether the current element contributes to final results • Example: An XML tree: Twig pattern: When wereadA1, B1, C1 and D1, since C1 is not the parent D1 , according to the naïve approach, we decide that C1 and D1do not belong to query answers. A1 A B1 C1 C B C2 D2 But it is wrong! Cn D …… D1 Dm Benchmarking Holistic Approaches to TPQ Processing

  25. Our solution: Look-ahead • New technique used in our new algorithm called TwigStackList: Look-ahead When wereadA1, B1, C1 and D1, we do not hurriedly decide whether C1 or D1 belongs to final solutions, but buffer C1 to Cn in the a main-memory list structure. Since Cn is the parent, we are sure that (A1, B1, Cn , D1) is a real match. Twig pattern: An XML tree: A1 A B1 C1 C B C2 Dm+1 D Cn Why not buffer D1 to Dm? Too many! D1 …… Dm Benchmarking Holistic Approaches to TPQ Processing

  26. Running example: TwigStackList algorithm XML tree: Query: (1,11,1) Data streams: A1 A (2,2,2) (3,10,2) A (1,11,1) (1,11,1) B1 C1 C B (4,8,3) C2 D2 B (2,2,2) (2,2,2) (9,9,3) (5,7,4) C3 D C (3,10,2) (3,10,2) (4,8,3) (4,8,3) (5,7,4) (5,7,4) (6,6,5) D1 D (6,6,5) (6,6,5) (9,9,3) (9,9,3) (3,10,2) SA (5,7,4) SB SC Output path solutions: List LC A//C/D A//B (1,11,1) (2,2,2) (1,11,1) (5,7,4) (6,6,5) (1,11,1) (3,10,2) (9,9,3) SD Benchmarking Holistic Approaches to TPQ Processing

  27. Features of TwigStackList • Main memory efficient • Size of stack and list is no more than |Depth(Tree)| • TwigStackList can process very large documents with small main memory cost • I/O efficient • Each element is scanned once • For a large query class, TwigStackListguarantees that each output path solution is useful to final answers. Benchmarking Holistic Approaches to TPQ Processing

  28. Optimal query classes • If an algorithm does not output any useless intermediate path solution for a query Q for all given documents, we call this algorithm is optimal with respective to Q If an algorithm has a larger optimal query class, this algorithm has better ability to control the size of intermediate results Benchmarking Holistic Approaches to TPQ Processing

  29. Optimal query classes . Optimal Class of TwigStack Only A-D in all edges Only A-D in branching edges A Optimal Class of TwigStackList A B C B C D D Benchmarking Holistic Approaches to TPQ Processing

  30. Outline • Introduction • Holistic algorithms: • TwigStackList (CIKM2005) • OrderedTJ (DEXA2006) • iTwigJoin (SIGMOD2005) • TJFast (VLDB2005) • Twig2Stack(VLDB2006) • TreeMatch (TKDE2010) • Benchmark experiments • Conclusions Benchmarking Holistic Approaches to TPQ Processing

  31. Motivation • TwigStack and TwigStackList cannot handle order-based twig query. • XPath and XQuery includes ordered axes such as following, preceding, following-sibling and preceding-sibling. This symbol shows that B and C are ordered. XPath expression < A A/B[following-sibling::C] B C Benchmarking Holistic Approaches to TPQ Processing

  32. Ordered twig query pattern • OrderedXML twig pattern : sibling query nodes should be matched according to their order in the twig query. • Example < A1 A D1 B1 D2 B D D3 C1 C Only D2 and D3 contribute to final results. Benchmarking Holistic Approaches to TPQ Processing

  33. OrderedTJ • OrderedTJ, a new algorithm proposed for evaluating ordered twig query pattern. • OrderedTJ, which extends TwigStackList, also uses stack and list data structure OrderedTJ additionally checks the order conditions of elements before outputting intermediate paths. What’s the main modification of OrderedTJ over TwigStackList? Benchmarking Holistic Approaches to TPQ Processing

  34. OrderedTJ • Before any element is pushed to the stack, OrderedTJ checks the order condition Data streams: Query Data < (1,9,1) A (1,9,1) (1,9,1) A A1 (6,8,2) B (3,5,2) (3,5,2) B D D1 B1 D2 (2,2,2) (3,5,2) (7,7,3) C (4,4,3) (4,4,3) D3 C1 C (4,4,3) D (2,2,2) (6,8,2) (6,8,2) (7,7,3) (7,7,3) SA Output intermediate path solutions: A/B/C A//D SB (1,9,1) (3,5,2) (4,4,3) (1,9,1) (6,8,2) SD (1,9,1) (7,7,3) SC Benchmarking Holistic Approaches to TPQ Processing

  35. The optimal query classes of OrderedTJ • OrderedTJ can guarantee the optimality for ordered queries with A-D relationships from the second branching edges. • In other words, OrderedTJ is optimal for queries with P-C relationship in the first branching edges. < A A B C B C Q1 Q2 TwigStackList is non-optimal for Q1. OrderedTJ is Optimal for Q2 Benchmarking Holistic Approaches to TPQ Processing

  36. Outline • Introduction • Holistic algorithms: • TwigStackList (CIKM2005) • OrderedTJ (DEXA2006) • iTwigJoin (SIGMOD2005) • TJFast (VLDB2005) • Twig2Stack(VLDB2006) • TreeMatch (TKDE2010) • Benchmark experiments • Conclusions and future work Benchmarking Holistic Approaches to TPQ Processing

  37. iTwigJoin algorithm • TwigStack and OrderedTJ partition data to streams according to their tag names alone • We propose two new data partition schemes • (1) Tag+level scheme • (2) Prefix path scheme • Potential benefits: • Enlarge the optimal query classes • Reduce I/O cost Benchmarking Holistic Approaches to TPQ Processing

  38. T2 B Data partition scheme Refined Tag partition Tag +level partition Refined Prefix path partition By level By path Tag partition Tag+Level partition Prefix Path partition A1 TA A1 T1 A1 TA A A1 B1 C2 TB B1 B1 TAB B1 T2 TC C1, C2, C3 C2 C1 C3 C TAC C2 T3 C1, C3 C TABC C1 TACC C3 Benchmarking Holistic Approaches to TPQ Processing

  39. Property of three schemes Refined Tag scheme Tag +level scheme Refined Prefix path scheme By level By path • 1. the number of inverted lists : increasing (CPU cost increase correspondingly) • 2. the optimal query classes : enlarging (output cost decrease correspondingly) • 3. the number of elements scan : decreasing (input cost decrease correspondingly) Benchmarking Holistic Approaches to TPQ Processing

  40. T2 B The number of inverted lists : increasing 3 4 5 Tag partition Tag+Level partition Prefix Path partition A1 TA A1 T1 A1 TA A A1 B1 C2 TB B1 B1 TAB B1 T2 TC C1, C2, C3 C2 C1 C3 C TAC C2 T3 C1, C3 C TABC C1 TACC C3 Benchmarking Holistic Approaches to TPQ Processing

  41. The optimal query classes : enlarging Optimal class of tag scheme Only A-D in branching edges and only P-C in all edges and only 1-branching Optimal Class of tag+level scheme Only A-D in branching edges and only P-C in all edges Only A-D in branching edges A A Optimal Class of prefix path scheme A B C B C B C D E D D E E Benchmarking Holistic Approaches to TPQ Processing

  42. T3 B The number of elements scan : decreasing A 4 3 0 C B Tag scheme Tag+Level scheme Prefix Path scheme Query TA A1 T2 A1 TDA A A1 1: D1 TB B1 B1 TDAB B1 2: A1 C1 T2 TC C1, C2 C1 C TDC C1 3: T3 B1 C2 C2 C TDCC C2 Data Benchmarking Holistic Approaches to TPQ Processing

  43. iTwigJoin algorithm • A general algorithm which can be applied on all three schemes • For different schemes, iTwigJoinachieves different performance. • The main technical difficult in designing iTwigJoin is to handle many current nodes for one tag name. We classify the current visited elements to three categories: current-match, current-useless and current-blocked Benchmarking Holistic Approaches to TPQ Processing

  44. Three kinds of elements • Current-match : the element is guaranteed to contribute to final answers with current elements. • Current-useless : the element is guaranteed not to contribute to final answers with current and remaining elements. • Current-blocked: the element is neither current-matchnor current-useless. Current-blocked Cannot get any matching data Matching data appears Match Useless Benchmarking Holistic Approaches to TPQ Processing

  45. Example on three kinds of elements A Tag+level scheme C B T1 A1 A Current-match: A1,B1,C2 Query T2 A2, A3 A 1: A1 T2 Current-blocked: B2,C1 B1 B 2: B1 A2 A3 C2 T3 B2 B Current-useless: A2 3: T2 B2 C1 C2 C Document T3 C1 C Benchmarking Holistic Approaches to TPQ Processing

  46. Example on three kinds of elements A Tag+level scheme C B T1 A1 A Query T2 A2, A3 A 1: A1 T2 B1 B 2: B1 A2 A3 C2 T3 B2 B B2 ,C1 are converted from current-blockedtocurrent-match due to the appearance of A3. 3: B2 C1 T2 C2 C Document T3 C1 C Benchmarking Holistic Approaches to TPQ Processing

  47. Main flowchart of iTwigJoin Y End of the algorithm Are all elements scanned? N See whether it contributes to previous match, and advance to the next element Y Is there any current-useless element? N Y Output intermediate path solutions, and advance to the next element Is there any current-match element? N Choose the smallest current-blocked element and output intermediate path solutions, then advance to the next element Benchmarking Holistic Approaches to TPQ Processing

  48. Outline • Introduction • Holistic algorithms: • TwigStackList (CIKM2005) • OrderedTJ (DEXA2006) • iTwigJoin (SIGMOD2005) • TJFast (VLDB2005) • Twig2Stack(VLDB2006) • TreeMatch (TKDE2010) • Benchmark experiments • Conclusions and future work Benchmarking Holistic Approaches to TPQ Processing

  49. Motivation: new labeling scheme • TwigStackList, OrderedTJ and iTwigJoin are all based on the containment labeling scheme Why not try Dewey labeling scheme for XML twig pattern query ? Oh, it is really a novel idea! Benchmarking Holistic Approaches to TPQ Processing

  50. Original Dewey Labeling Scheme • In Dewey labeling scheme, each element is presented by an integer sequence: • (i) the root is labeled by a empty stringε • (ii) for a non-root element u, label(u)= label(s).x, where u is the x-th child of s. • For example: ε s1 2 1 3 t1 s2 f2 2.1 2.2 t2 f1 Benchmarking Holistic Approaches to TPQ Processing

More Related