Unordered Tree Matching and Strict Unordered Tree Matching: the Evaluation of Tree Pattern Queries

Unordered Tree Matching and Strict Unordered Tree Matching: the Evaluation of Tree Pattern Queries Dr. Yangjun Chen Dept. Applied Computer Science, University of Winnipeg 515 Portage Ave. Winnipeg, Manitoba, Canada R3B 2E9

Outline • Motivation • Algorithm for unordered tree pattern query evaluation - Tree encoding - Algorithm description • On the strict unordered tree matching • Experiment results • Summary

a tree pattern query XML documents Motivation • Efficient method to evaluate XPath expression queries – XML query processing

<Purchase> <Seller> <Name>dell</Name> <Item> <Manufacturer>IBM</Manufacturer> <Name>part#1</Name> <Item> <Manufacturer>Intel</Manufacturer> </Item> </Item> <Item> <Name>Part#2</Name> </Item> <Location>Houston</Location> </Seller> <Buyer> <Location>Winnipeg</Location> <Name>Y-Chen</Name> </Buyer> </Purchase> P S B N I I L L N Houston Winnipeg Y-Chen Dell M N I N IBM Part#1 Part#2 M Intel Motivation Document:

P S B N I I L L N Houston Winnipeg Y-Chen Dell M N I N IBM Part#1 Part#2 M Intel Motivation Document: Query – XPath expressions: Q1: /Purchase[Seller[Loc=‘Boston’]]/ Buyer[Loc = ‘New York’ Purchase Buyer Seller Location Location ‘Winnipeg’ ‘Houston’ Q2: /Purchase//Item[Manufacturer = ‘Intel’] Buyer d-edge: ancestor- descendant relationship Item Manufacturer c-edge: parent-child relationship ‘Intel’

a b b c d c e d book title author Art of Programming fn ln Knuth Donald Motivation • XPath evaluation against XML documents - XPath expression a[b[c and .//d]]/b[c and e//d] book[title = ‘Art of Programming’]//author[fn = ‘Donald’ and ln = ‘Knuth’] <document> <book> <title> Art of Programming </title> <author> <fn>Donald Knuth</fn> … …

a c c a b b c b b Motivation • XPath evaluation against XML documents -Evaluation based on unordered tree matching: Definition An embedding of a tree pattern Q into an XML document T is a mapping f: Q  T, from the nodes of Q to the nodes of T, which satisfies the following conditions: (i) Preserve node label: For each u  Q, label(u) matches label(f(u)). (ii) Preserve parent-child/ancestor-descendant relationships: If uv in Q, then f(v) is a child of f(u) in T; if uv in Q, then f(v) is a descendant of f(u) in T. T: Q: q3 v6 q1 q2 v4 v5 v1 v3 v2

a c c a b b c b b Motivation • XPath evaluation against XML documents -Evaluation based on unordered tree matching: Definition 1 An embedding of a tree pattern Q into an XML document T is a mapping f: Q  T, from the nodes of Q to the nodes of T, which satisfies the following conditions: (i) Preserve node label: For each u  Q, label(u) matches label(f(u)). (ii) Preserve parent-child/ancestor-descendant relationships: If uv in Q, then f(v) is a child of f(u) in T; if uv in Q, then f(v) is a descendant of f(u) in T. T: Q: q3 v6 q1 q2 v4 v5 v1 v3 v2

T: (1, 1, 11, 1) A v1 B v8 (1, 2, 9, 2) B v2 (1, 10, 10, 2) v3 C (1, 3, 3, 3) B v4 (1, 4, 8, 3) v5 C v6 C D v7 (1, 7, 7, 4) (1, 5, 5, 4) (1, 6, 6, 4) Algorithm for query evaluation • Tree encoding Let T be a document tree. We associate each node v in T with a quadruple (DocId, LeftPos, RightPos, LevelNum), where DocId is the document identifier; LeftPos and RightPos are generated by counting word numbers from the beginning of the document until the start and end of the element, respectively; and LevelNum is the nesting depth of the element in the document. 1 <A> <C> </C> <C> </C> <C> </C> <D> </D> </A> 2 3 3 4 5 5 6 6 7 7 8 9 10 10 11

Tree encoding Let T be a document tree. We associate each node v in T with a quadruple (DocId, LeftPos, RightPos, LevelNum), denoted as a(v), where DocId is the document identifier; LeftPos and RightPos are generated by counting word numbers from the beginning of the document until the start and end of the element, respectively; and LevelNum is the nesting depth of the element in the document. (i) ancestor-descendant: a node v1 associated with (d1, l1, r1, ln1) is an ancestor of another node v2 with (d2, l2, r2, ln2) iff d1= d2, l1< l2, and r1> r2. (ii) parent-child: a node v1 associated with (d1, l1, r1, ln1) is the parent of another node v2 with (d2, l2, r2, ln2) iff d1= d2, l1< l2, r1> r2, and ln2 = ln1 + 1. (iii)from left to right: a node v1 associated with (d1, l1, r1, ln1) is to the left of another node v2 with (d2, l2, r2, ln2) iff d1= d2, r1< l2.

T: (1, 1, 11, 1) A v1 B v8 (1, 2, 9, 2) B v2 (1, 10, 10, 2) v3 C (1, 3, 3, 3) B v4 (1, 4, 8, 3) v5 C v6 C D v7 (1, 7, 7, 4) (1, 5, 5, 4) (1, 6, 6, 4) Algorithm for query evaluation • Tree encoding 1 <A> <C> </C> <C> </C> <C> </C> <D> </D> </A> 2 3 3 4 5 5 6 6 7 7 8 9 10 Data streams: 10 sorted by LeftPos values 11 A:(1, 1, 11, 1) B:(1, 2, 9, 2)(1, 4, 8, 3), (1, 10, 10, 2) C:(1, 3, 3, 3)(1, 5, 5, 4), (1, 6, 6, 4) D:(1, 7, 7, 4)

A q1 L(q1) Q: B q5 q2 B L(q2) L(q5) q3 C C q4 L(q4) L(q3) Algorithm for query evaluation • Algorithm description • Our algorithm works bottom-up. Therefore, we need to sort XML • streams by (DocID, RightPos) values. • Each time a query Q is submitted to the system, we will associate • each query node q with a data stream L(q) such that for • each vL(q) label(v) = label(q), in which each query node is • attached with a list of matching nodes of the document tree. T: Q: sorted by RightPos values {v1} L(q1 ) =(1, 1, 11, 1) - {v4, v2, v8} L(q2 ) = L(q5)=(1, 4, 8, 3),(1, 2, 9, 2)(1, 10, 10, 2) - {v3, v5, v6} L(q3) = L(q4) = (1, 3, 3, 3)(1, 5, 5, 4), (1, 6, 6, 4) -

Algorithm for query evaluation • Algorithm description 1. S(v) – a set of query nodes q associated with document tree node v such that Q[q] can be embedded in T[v]. • (q) – a variable associated with query node q. • During the process, (q) will be dynamically assigned a series • of values a0, a1, ..., am for some m in sequence, where a0 =  • and ai’s (i = 1, ..., m) are different nodes of T. • Initially, (q) is set to a0 = . (q) will be changed from ai-1to • ai = v (i = 1, ..., m) when the following conditions are satisfied. i) v is the node currently encountered. ii) q appears in S(u) for some child node u of v. iii) q is a d-child, or q is a c-child, and u is a c-child with label(u) = label(q). T: Q: S(v) (q) v q

q’ v u q S(u) = {…, q, …} Algorithm for query evaluation • Algorithm description i) v is the node currently encountered. ii) q appears in S(u) for some child node u of v. iii) q is a d-child, or q is a c-child, and u is a c-child with label(u) = label(q). d-child (q) is changed from ai-1 to ai = v q’ v u c-child q label(u) = label(q). S(u) = {…, q, …} u must be a direct child of v.

Algorithm for query evaluation • Algorithm description 3.Subtree embedding by using(q) label(v) = label(q’} v q’ ql q1 (q1)= (q2) = … = (ql) =v T[v] embedsQ[q’].

Q : A q1 5 3 4 q2 B B q5 2 1 q3 C C q4 Algorithm for query evaluation • Algorithm description 4.Construction of S(v) • The nodes of Q are numberedin postorder. So the nodes in Q will be referenced by their postorder numbers. • For a leaf node v in T, any encountered leaf node q in Q will be inserted into S(v) if label(u) = label(q). (Since Q is explored bottom-up, the query nodes in S(v)must be increasingly sorted.) • For a non-leaf node v with children v1, …, vk, S(v)  S(v1) …  S(vk)  {all nodes q with T[v] embedding Q[q]}. Since S(v) is sorted, ‘union’ can be implemented using the ‘merge’ operation.

a a c c c b b b b Strict unordered tree matching • Definition Definition 2 A strict embedding of a tree pattern Q into an XML document T is a mapping f: Q  T, from the nodes of Q to the nodes of T, which satisfies the following conditions: • same as (i) in Definition 1. • same as (ii) in Definition 1. • For any two nodes v1 Q and v2 Q, if v1and v2 are not • related by an ancestor-descendant relationship, then f(v1)and • f(v2) in T are not related by an ancestor-descendant relationship. T: Q: q3 v6 q1 q2 v4 v5 v1 v3 v2

a c c a b c b b b Strict unordered tree matching • Definition Definition 2 A strict embedding of a tree pattern Q into an XML document T is a mapping f: Q  T, from the nodes of Q to the nodes of T, which satisfies the following conditions: • same as (i) in Definition 1. • same as (ii) in Definition 1. • For any two nodes v1 Q and v2 Q, if v1and v2 are not • related by an ancestor-descendant relationship, then f(v1)and • f(v2) in T are not related by an ancestor-descendant relationship. T: Q: q3 v6 q1 q2 v4 v5 v1 v3 By Definition 2, this mapping is not allowed. v2

Q: T: … c b y e d c d a a e b x Strict unordered tree matching Strict unordered tree matching needs the exponential time. O(|T||Q|dk) d – the largest out-degree of the nodes in T. k – the largest out-degree of the nodes in Q. Using the concept of hypergraphs, the time complexity can be slightly improved to O(|T||Q|2k).

Algorithm for query evaluation • Experiments • We conducted our experiments on a DELL desktop PC • equipped with Pentium(R) 4 CPU 2.80GHz, 0.99GB RAM • and 20GB hard disk. The code was compiled using • Microsoft Visual C++ compiler version 6.0, running • standalone. • Tested methods • In the experiments, we have tested four methods: • - TwigStack (TS for short), • - Twig2Stack (T2S for short), • - Twig-List (TL for short), • - One-Phase Holistic (OPH for short), • - tree-embedding (discussed in this paper, TE for short).

Algorithm for query evaluation • Experiments • Tested methods • In the experiments, we have tested five methods: • - TwigStack (TS for short) [1], • - Twig2Stack (T2S for short) [2], • - Twig-List (TL for short) [3], • - One-Phase Holistic (OPH for short) [4], • - tree-embedding (discussed in this paper, TE for short). [1] N. Bruno, N. Koudas, and D. Srivastava, Holistic Twig Joins: Optimal XML Pattern Matching, in Proc. SIGMOD Int. Conf. on Management of Data, Madison, Wisconsin, June 2002, pp. 310-321. [2] S. Chen, H-G. Li, J. Tatemura, W-P. Hsiung, D. Agrawa, and K.S. Canda, Twig2Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents, in Proc. VLDB, Seoul, Korea, Sept. 2006, pp. 283-294. [3] Qin, L., Yu, J. X., and Ding, B., “TwigList: Make Twig Pattern Matching fast,” In Proc. 12th Int’l Conf. on Database Systems for Advanced Applications (DASFAA), pp. 850-862, Apr. 2007. [4] Jiang, Z., Luo, C., Hou, W.-C., Zhu, Q., and Che, D., “Efficient Processing of XML Twig Pattern: A Novel One-Phase Holistic Solution,” In Proc. The 18th Int’l Conf. on Database and Expert Systems Applications (DEXA), pp. 87-97, Sept. 2007.

Algorithm for query evaluation • Experiments • Theoretical computational complexities D- a largest data stream associated with a node q of Q such that for each v in the data stream we have label(v) = label(q).

Algorithm for query evaluation • Experiments • Indexes All the tested methods use XB-trees as their index structure.

Algorithm for query evaluation • Experiments • Data sets The data sets used for the tests are DBLP data set [30] and a synthetic XMARK data set [35]. The DBLP data set is another real data set with high similarity in structure. It is in fact a wide and shallow document.

Algorithm for query evaluation • Experiments • Queries – altogether 20 queries, divided into 4 groups Group I: Q1: //inproceedings [author]//year [text() = ‘2004’] Q2: //inproceedings [author and title]//year [text() = ‘2004’] Q3: //inproceedings [author and title and .//pages]//year[text() = ‘2004’] Q4: //inproceedings [author and title and .//pages and .//url] //year [text() = 2004’] Q5: //articles [author and title and .//volume and .//pages and //url]//year [text() = ‘2004’]

Algorithm for query evaluation • Experiments • Test results For all the experiments, the buffer pool size was fixed at 2000 pages. The page size of 8KB was used. For each data set, all the tag names are stored in a single list and then each tag name is represented by its order number in that list during the evaluation of queries. In our implementation, each DocId occupies 4 bytes while a number in a Prüfer sequence, a LeftPos or a RightPos occupies 2 bytes. A levelNum value takes only 1 byte.

* Algorithm for query evaluation • Experiments execution time (sec.) 5 + TS T2S OPH TL TM 4 + * 3 + * + 2 1 Q1 Q2 Q3 Q4 Q5

Algorithm for query evaluation • Experiments • Queries – altogether 20 queries, divided into 4 groups Group II: Q6: //inproceedings[author/* and ./*]/year Q7: //inproceedings[author/* and title and ./*]/year Q8: //inproceedings[author/* and title and .//pages and ./*]/year Q9: //inproceeding[author/* and title and .//pages and .//url and ./*]/year Q10: //articles[author/* and title and .//volume and .//pages and .//pages and .//url and ./*]/year

+ TS T2S OPH TL TM * Algorithm for query evaluation • Experiments execution time (sec.) 60 + 50 40 * + * * 30 * 20 Q6 Q7 Q8 Q9 Q10

Algorithm for query evaluation • Experiments • Queries – altogether 20 queries, divided into 4 groups Group III: Q11: /site//open_auction[.//seller/person]//date[text() = ‘10/23/2006’] Q12: /site//open_auction[.//seller/person and .//bidder]//date[text() = ‘10/23/2006’] Q13: /site//open_auction[.//seller/person and /./bidder/increase]//date [text() = ‘10/23/2006’] Q14: /site//open_auction[.//seller/person and .//bidder/increase and .//initial]//date [text() = ‘10/23/ 2006’] Q15: /site//open_auction[.//seller/person and .//bidder/increase and //initial and .//description]//date [text() = ‘10/23/2006’]

+ TS T2S OPH TL TM * Algorithm for query evaluation • Experiments execution time (sec.) 5 + 4 + 3 + * 2 * * * 1 Q11 Q12 Q13 Q14 Q15

Algorithm for query evaluation • Experiments • Queries – altogether 20 queries, divided into 4 groups Group IV: Q16: /site//open_auction[.//seller/person/* and ./*]/date Q17: /site//open_auction[.//seller/person/* and .//bidder and ./*]/date Q18: /site//open_auction[.//seller/person/* and .//bidder/increase]/date Q19: /site//open_auction[.//seller/person/* and .//bidder/increase and .//initial and ./*]/date Q20: /site//open_auction[.//seller/person/* and .//bidder/increase and .//initial and .//description and ./*]/ date

+ TS T2S OPH TL TM * Algorithm for query evaluation • Experiments execution time (sec.) 5 + 4 * 3 * * 2 * 1 Q16 Q17 Q18 Q19 Q20

Run time space usage Algorithm for query evaluation • Experiments • Test results In the following figure, we compare the runtime memory usage of all the five tested approaches for the second group of queries. By the memory usage, we mean the intermediate data structures, not including data stream (concretely, path stacks for TwigStack; hierarchical stacks for Twig2Stack, TL and OPH; and QSs for ours.)

Summary • An efficient method for evaluating unordered tree pattern queries in XML document databases • - parent/child and ancestor/descendant relation • - O(|D||Q|) time and O(|D||Q|) space • Strict unordered tree matching • - exponential time • Experiments • - TreeBank database, DBLP and XMark documents • - I/O time and CPU processing time

Thank you.

Unordered Tree Matching and Strict Unordered Tree Matching: the Evaluation of Tree Pattern Queries

Unordered Tree Matching and Strict Unordered Tree Matching: the Evaluation of Tree Pattern Queries

Presentation Transcript

Containment of Partially Specified Tree-Pattern Queries

Pattern Matching: Suffix Tree Applications

Pattern Matching

Multi-view matching for unordered image sets

Tree Binary Tree

Tree Evaluation Overview

Choice and Strict Matching

On Testing Satisfiability of Tree Pattern Queries

Tree-Pattern Queries on a Lightweight XML Processor

Frequent-Pattern Tree

Game Tree Evaluation

Answering Tree Pattern Queries Using Views

Choice and Strict Matching

Tree Pattern Matching to Subset Matching in Linear Time

Evaluation of Tree Pattern Queries

Tree Pattern Matching in Phylogenetic Trees

Tree Inclusion, Signatures, and Evaluation of Path-Oriented Queries

Tree Inclusion, Signatures, and Evaluation of Path-Oriented Queries

Frequent-Pattern Tree

Tree Evaluation

Scale and Rotation Invariant Matching Using Linearly Augmented Tree

Tree-Pattern Queries on a Lightweight XML Processor