1 / 36

Unordered Tree Matching and Strict Unordered Tree Matching: the Evaluation of Tree Pattern Queries

Unordered Tree Matching and Strict Unordered Tree Matching: the Evaluation of Tree Pattern Queries. Dr. Yangjun Chen Dept. Applied Computer Science, University of Winnipeg 515 Portage Ave. Winnipeg, Manitoba, Canada R3B 2E9. Outline. Motivation

wendi
Download Presentation

Unordered Tree Matching and Strict Unordered Tree Matching: the Evaluation of Tree Pattern Queries

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Unordered Tree Matching and Strict Unordered Tree Matching: the Evaluation of Tree Pattern Queries Dr. Yangjun Chen Dept. Applied Computer Science, University of Winnipeg 515 Portage Ave. Winnipeg, Manitoba, Canada R3B 2E9

  2. Outline • Motivation • Algorithm for unordered tree pattern query evaluation - Tree encoding - Algorithm description • On the strict unordered tree matching • Experiment results • Summary

  3. a tree pattern query XML documents Motivation • Efficient method to evaluate XPath expression queries – XML query processing

  4. <Purchase> <Seller> <Name>dell</Name> <Item> <Manufacturer>IBM</Manufacturer> <Name>part#1</Name> <Item> <Manufacturer>Intel</Manufacturer> </Item> </Item> <Item> <Name>Part#2</Name> </Item> <Location>Houston</Location> </Seller> <Buyer> <Location>Winnipeg</Location> <Name>Y-Chen</Name> </Buyer> </Purchase> P S B N I I L L N Houston Winnipeg Y-Chen Dell M N I N IBM Part#1 Part#2 M Intel Motivation Document:

  5. P S B N I I L L N Houston Winnipeg Y-Chen Dell M N I N IBM Part#1 Part#2 M Intel Motivation Document: Query – XPath expressions: Q1: /Purchase[Seller[Loc=‘Boston’]]/ Buyer[Loc = ‘New York’ Purchase Buyer Seller Location Location ‘Winnipeg’ ‘Houston’ Q2: /Purchase//Item[Manufacturer = ‘Intel’] Buyer d-edge: ancestor- descendant relationship Item Manufacturer c-edge: parent-child relationship ‘Intel’

  6. a b b c d c e d book title author Art of Programming fn ln Knuth Donald Motivation • XPath evaluation against XML documents - XPath expression a[b[c and .//d]]/b[c and e//d] book[title = ‘Art of Programming’]//author[fn = ‘Donald’ and ln = ‘Knuth’] <document> <book> <title> Art of Programming </title> <author> <fn>Donald Knuth</fn> … …

  7. a c c a b b c b b Motivation • XPath evaluation against XML documents -Evaluation based on unordered tree matching: Definition An embedding of a tree pattern Q into an XML document T is a mapping f: Q  T, from the nodes of Q to the nodes of T, which satisfies the following conditions: (i) Preserve node label: For each u  Q, label(u) matches label(f(u)). (ii) Preserve parent-child/ancestor-descendant relationships: If uv in Q, then f(v) is a child of f(u) in T; if uv in Q, then f(v) is a descendant of f(u) in T. T: Q: q3 v6 q1 q2 v4 v5 v1 v3 v2

  8. a c c a b b c b b Motivation • XPath evaluation against XML documents -Evaluation based on unordered tree matching: Definition 1 An embedding of a tree pattern Q into an XML document T is a mapping f: Q  T, from the nodes of Q to the nodes of T, which satisfies the following conditions: (i) Preserve node label: For each u  Q, label(u) matches label(f(u)). (ii) Preserve parent-child/ancestor-descendant relationships: If uv in Q, then f(v) is a child of f(u) in T; if uv in Q, then f(v) is a descendant of f(u) in T. T: Q: q3 v6 q1 q2 v4 v5 v1 v3 v2

  9. T: (1, 1, 11, 1) A v1 B v8 (1, 2, 9, 2) B v2 (1, 10, 10, 2) v3 C (1, 3, 3, 3) B v4 (1, 4, 8, 3) v5 C v6 C D v7 (1, 7, 7, 4) (1, 5, 5, 4) (1, 6, 6, 4) Algorithm for query evaluation • Tree encoding Let T be a document tree. We associate each node v in T with a quadruple (DocId, LeftPos, RightPos, LevelNum), where DocId is the document identifier; LeftPos and RightPos are generated by counting word numbers from the beginning of the document until the start and end of the element, respectively; and LevelNum is the nesting depth of the element in the document. 1 <A> <B> <C> </C> <B> <C> </C> <C> </C> <D> </D> </B> </B> <B> </B> </A> 2 3 3 4 5 5 6 6 7 7 8 9 10 10 11

  10. Tree encoding Let T be a document tree. We associate each node v in T with a quadruple (DocId, LeftPos, RightPos, LevelNum), denoted as a(v), where DocId is the document identifier; LeftPos and RightPos are generated by counting word numbers from the beginning of the document until the start and end of the element, respectively; and LevelNum is the nesting depth of the element in the document. (i) ancestor-descendant: a node v1 associated with (d1, l1, r1, ln1) is an ancestor of another node v2 with (d2, l2, r2, ln2) iff d1= d2, l1< l2, and r1> r2. (ii) parent-child: a node v1 associated with (d1, l1, r1, ln1) is the parent of another node v2 with (d2, l2, r2, ln2) iff d1= d2, l1< l2, r1> r2, and ln2 = ln1 + 1. (iii)from left to right: a node v1 associated with (d1, l1, r1, ln1) is to the left of another node v2 with (d2, l2, r2, ln2) iff d1= d2, r1< l2.

  11. T: (1, 1, 11, 1) A v1 B v8 (1, 2, 9, 2) B v2 (1, 10, 10, 2) v3 C (1, 3, 3, 3) B v4 (1, 4, 8, 3) v5 C v6 C D v7 (1, 7, 7, 4) (1, 5, 5, 4) (1, 6, 6, 4) Algorithm for query evaluation • Tree encoding 1 <A> <B> <C> </C> <B> <C> </C> <C> </C> <D> </D> </B> </B> <B> </B> </A> 2 3 3 4 5 5 6 6 7 7 8 9 10 Data streams: 10 sorted by LeftPos values 11 A:(1, 1, 11, 1) B:(1, 2, 9, 2)(1, 4, 8, 3), (1, 10, 10, 2) C:(1, 3, 3, 3)(1, 5, 5, 4), (1, 6, 6, 4) D:(1, 7, 7, 4)

  12. A q1 L(q1) Q: B q5 q2 B L(q2) L(q5) q3 C C q4 L(q4) L(q3) Algorithm for query evaluation • Algorithm description • Our algorithm works bottom-up. Therefore, we need to sort XML • streams by (DocID, RightPos) values. • Each time a query Q is submitted to the system, we will associate • each query node q with a data stream L(q) such that for • each vL(q) label(v) = label(q), in which each query node is • attached with a list of matching nodes of the document tree. T: Q: sorted by RightPos values {v1} L(q1 ) =(1, 1, 11, 1) - {v4, v2, v8} L(q2 ) = L(q5)=(1, 4, 8, 3),(1, 2, 9, 2)(1, 10, 10, 2) - {v3, v5, v6} L(q3) = L(q4) = (1, 3, 3, 3)(1, 5, 5, 4), (1, 6, 6, 4) -

  13. Algorithm for query evaluation • Algorithm description 1. S(v) – a set of query nodes q associated with document tree node v such that Q[q] can be embedded in T[v]. • (q) – a variable associated with query node q. • During the process, (q) will be dynamically assigned a series • of values a0, a1, ..., am for some m in sequence, where a0 =  • and ai’s (i = 1, ..., m) are different nodes of T. • Initially, (q) is set to a0 = . (q) will be changed from ai-1to • ai = v (i = 1, ..., m) when the following conditions are satisfied. i) v is the node currently encountered. ii) q appears in S(u) for some child node u of v. iii) q is a d-child, or q is a c-child, and u is a c-child with label(u) = label(q). T: Q: S(v) (q) v q

  14. q’ v u q S(u) = {…, q, …} Algorithm for query evaluation • Algorithm description i) v is the node currently encountered. ii) q appears in S(u) for some child node u of v. iii) q is a d-child, or q is a c-child, and u is a c-child with label(u) = label(q). d-child (q) is changed from ai-1 to ai = v q’ v u c-child q label(u) = label(q). S(u) = {…, q, …} u must be a direct child of v.

  15. Algorithm for query evaluation • Algorithm description 3.Subtree embedding by using(q) label(v) = label(q’} v q’ ql q1 (q1)= (q2) = … = (ql) =v T[v] embedsQ[q’].

  16. Q : A q1 5 3 4 q2 B B q5 2 1 q3 C C q4 Algorithm for query evaluation • Algorithm description 4.Construction of S(v) • The nodes of Q are numberedin postorder. So the nodes in Q will be referenced by their postorder numbers. • For a leaf node v in T, any encountered leaf node q in Q will be inserted into S(v) if label(u) = label(q). (Since Q is explored bottom-up, the query nodes in S(v)must be increasingly sorted.) • For a non-leaf node v with children v1, …, vk, S(v)  S(v1) …  S(vk)  {all nodes q with T[v] embedding Q[q]}. Since S(v) is sorted, ‘union’ can be implemented using the ‘merge’ operation.

  17. a a c c c b b b b Strict unordered tree matching • Definition Definition 2 A strict embedding of a tree pattern Q into an XML document T is a mapping f: Q  T, from the nodes of Q to the nodes of T, which satisfies the following conditions: • same as (i) in Definition 1. • same as (ii) in Definition 1. • For any two nodes v1 Q and v2 Q, if v1and v2 are not • related by an ancestor-descendant relationship, then f(v1)and • f(v2) in T are not related by an ancestor-descendant relationship. T: Q: q3 v6 q1 q2 v4 v5 v1 v3 v2

  18. a c c a b c b b b Strict unordered tree matching • Definition Definition 2 A strict embedding of a tree pattern Q into an XML document T is a mapping f: Q  T, from the nodes of Q to the nodes of T, which satisfies the following conditions: • same as (i) in Definition 1. • same as (ii) in Definition 1. • For any two nodes v1 Q and v2 Q, if v1and v2 are not • related by an ancestor-descendant relationship, then f(v1)and • f(v2) in T are not related by an ancestor-descendant relationship. T: Q: q3 v6 q1 q2 v4 v5 v1 v3 By Definition 2, this mapping is not allowed. v2

  19. Q: T: … c b y e d c d a a e b x Strict unordered tree matching Strict unordered tree matching needs the exponential time. O(|T||Q|dk) d – the largest out-degree of the nodes in T. k – the largest out-degree of the nodes in Q. Using the concept of hypergraphs, the time complexity can be slightly improved to O(|T||Q|2k).

  20. Algorithm for query evaluation • Experiments • We conducted our experiments on a DELL desktop PC • equipped with Pentium(R) 4 CPU 2.80GHz, 0.99GB RAM • and 20GB hard disk. The code was compiled using • Microsoft Visual C++ compiler version 6.0, running • standalone. • Tested methods • In the experiments, we have tested four methods: • - TwigStack (TS for short), • - Twig2Stack (T2S for short), • - Twig-List (TL for short), • - One-Phase Holistic (OPH for short), • - tree-embedding (discussed in this paper, TE for short).

  21. Algorithm for query evaluation • Experiments • Tested methods • In the experiments, we have tested five methods: • - TwigStack (TS for short) [1], • - Twig2Stack (T2S for short) [2], • - Twig-List (TL for short) [3], • - One-Phase Holistic (OPH for short) [4], • - tree-embedding (discussed in this paper, TE for short). [1] N. Bruno, N. Koudas, and D. Srivastava, Holistic Twig Joins: Optimal XML Pattern Matching, in Proc. SIGMOD Int. Conf. on Management of Data, Madison, Wisconsin, June 2002, pp. 310-321. [2] S. Chen, H-G. Li, J. Tatemura, W-P. Hsiung, D. Agrawa, and K.S. Canda, Twig2Stack: Bottom-up Processing of General­ized-Tree-Pattern Queries over XML Documents, in Proc. VLDB, Seoul, Korea, Sept. 2006, pp. 283-294. [3] Qin, L., Yu, J. X., and Ding, B., “TwigList: Make Twig Pattern Matching fast,” In Proc. 12th Int’l Conf. on Database Systems for Advanced Applications (DASFAA), pp. 850-862, Apr. 2007. [4] Jiang, Z., Luo, C., Hou, W.-C., Zhu, Q., and Che, D., “Efficient Processing of XML Twig Pattern: A Novel One-Phase Holistic Solution,” In Proc. The 18th Int’l Conf. on Database and Expert Systems Applications (DEXA), pp. 87-97, Sept. 2007.

  22. Algorithm for query evaluation • Experiments • Theoretical computational complexities D- a largest data stream associated with a node q of Q such that for each v in the data stream we have label(v) = label(q).

  23. Algorithm for query evaluation • Experiments • Indexes All the tested methods use XB-trees as their index structure.

  24. Algorithm for query evaluation • Experiments • Data sets The data sets used for the tests are DBLP data set [30] and a synthetic XMARK data set [35]. The DBLP data set is another real data set with high similarity in structure. It is in fact a wide and shallow document.

  25. Algorithm for query evaluation • Experiments • Queries – altogether 20 queries, divided into 4 groups Group I: Q1: //inproceedings [author]//year [text() = ‘2004’] Q2: //inproceedings [author and title]//year [text() = ‘2004’] Q3: //inproceedings [author and title and .//pages]//year[text() = ‘2004’] Q4: //inproceedings [author and title and .//pages and .//url] //year [text() = 2004’] Q5: //articles [author and title and .//volume and .//pages and //url]//year [text() = ‘2004’]

  26. Algorithm for query evaluation • Experiments • Test results For all the experiments, the buffer pool size was fixed at 2000 pages. The page size of 8KB was used. For each data set, all the tag names are stored in a single list and then each tag name is represented by its order number in that list during the evaluation of queries. In our implementation, each DocId occupies 4 bytes while a number in a Prüfer sequence, a LeftPos or a RightPos occupies 2 bytes. A levelNum value takes only 1 byte.

  27. * Algorithm for query evaluation • Experiments execution time (sec.) 5 + TS T2S OPH TL TM 4 + * 3 + * + 2 1 Q1 Q2 Q3 Q4 Q5

  28. Algorithm for query evaluation • Experiments • Queries – altogether 20 queries, divided into 4 groups Group II: Q6: //inproceedings[author/* and ./*]/year Q7: //inproceedings[author/* and title and ./*]/year Q8: //inproceedings[author/* and title and .//pages and ./*]/year Q9: //inproceeding[author/* and title and .//pages and .//url and ./*]/year Q10: //articles[author/* and title and .//volume and .//pages and .//pages and .//url and ./*]/year

  29. + TS T2S OPH TL TM * Algorithm for query evaluation • Experiments execution time (sec.) 60 + 50 40 * + * * 30 * 20 Q6 Q7 Q8 Q9 Q10

  30. Algorithm for query evaluation • Experiments • Queries – altogether 20 queries, divided into 4 groups Group III: Q11: /site//open_auction[.//seller/person]//date[text() = ‘10/23/2006’] Q12: /site//open_auction[.//seller/person and .//bidder]//date[text() = ‘10/23/2006’] Q13: /site//open_auction[.//seller/person and /./bidder/increase]//date [text() = ‘10/23/2006’] Q14: /site//open_auction[.//seller/person and .//bidder/increase and .//initial]//date [text() = ‘10/23/ 2006’] Q15: /site//open_auction[.//seller/person and .//bidder/increase and //initial and .//description]//date [text() = ‘10/23/2006’]

  31. + TS T2S OPH TL TM * Algorithm for query evaluation • Experiments execution time (sec.) 5 + 4 + 3 + * 2 * * * 1 Q11 Q12 Q13 Q14 Q15

  32. Algorithm for query evaluation • Experiments • Queries – altogether 20 queries, divided into 4 groups Group IV: Q16: /site//open_auction[.//seller/person/* and ./*]/date Q17: /site//open_auction[.//seller/person/* and .//bidder and ./*]/date Q18: /site//open_auction[.//seller/person/* and .//bidder/increase]/date Q19: /site//open_auction[.//seller/person/* and .//bidder/increase and .//initial and ./*]/date Q20: /site//open_auction[.//seller/person/* and .//bidder/increase and .//initial and .//description and ./*]/ date

  33. + TS T2S OPH TL TM * Algorithm for query evaluation • Experiments execution time (sec.) 5 + 4 * 3 * * 2 * 1 Q16 Q17 Q18 Q19 Q20

  34. Run time space usage Algorithm for query evaluation • Experiments • Test results In the following figure, we compare the runtime memory usage of all the five tested approaches for the second group of queries. By the memory usage, we mean the intermediate data structures, not including data stream (concretely, path stacks for TwigStack; hierarchical stacks for Twig2Stack, TL and OPH; and QSs for ours.)

  35. Summary • An efficient method for evaluating unordered tree pattern queries in XML document databases • - parent/child and ancestor/descendant relation • - O(|D||Q|) time and O(|D||Q|) space • Strict unordered tree matching • - exponential time • Experiments • - TreeBank database, DBLP and XMark documents • - I/O time and CPU processing time

  36. Thank you.

More Related