1 / 36

Tree Inclusion, Signatures, and Evaluation of Path-Oriented Queries

Tree Inclusion, Signatures, and Evaluation of Path-Oriented Queries Dr. Yangjun Chen Dept. Applied Computer Science, University of Winnipeg, Canada. Motivation Path-Oriented Queries and Tree Inclusion Problem Evaluation of Path-Oriented Queries - Top-down Algorithm for Tree Inclusion

Download Presentation

Tree Inclusion, Signatures, and Evaluation of Path-Oriented Queries

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Tree Inclusion, Signatures, and Evaluation of Path-Oriented Queries Dr. Yangjun ChenDept. Applied Computer Science, University of Winnipeg, Canada • Motivation • Path-Oriented Queries and Tree Inclusion Problem • Evaluation of Path-Oriented Queries • - Top-down Algorithm for Tree Inclusion • - Integration of Signatures into Top-down Tree Inclusion • Experiment Results • Summary and Future Work

  2. Motivation • Local Information Resource Management – document databases • Internet – Distributed Document Databases • Document Databases • - Storage of documents in relational databases • non-structured data, semi-structured data • - Evaluation of path-oriented queries in document databases • path-oriented languages: XQL, XPath, and XML-QL • Query evaluation methods: • inverse-file based • signature based • string-matching based: suffix trees, Pat-trees • tree-inclusion based • Integrating signatures into top-down tree inclusion algorithm

  3. Path-Oriented Queries and Tree Inclusion Problem • XML Documents and Path-Oriented Queries

  4. Path-Oriented Queries and Tree Inclusion Problem • Tree Inclusion Problem • Definition (tree embedding) Let T and P be two labeled trees. A mapping M from the nodes of P to the nodes of Tis an embedding of Pinto Tif it preserves labels and ancestorship. That is, for all nodes u and v of P, we require that • a) M(u) = M(v) if and only if u = v, • b) label(u) = label(M(u)), • c) u is an ancestor of v in P if and only if M(u) is an ancestor of M(v) in T, and • d) v is to the left of u iff M(v) is to the left of M(u). • An embedding is root preserving if M(root(P)) = root(T). It can be shown that restricting to root-preserving embedding does not lose generality.

  5. Path-Oriented Queries and Tree Inclusion Problem • Example: T: P: Hotel-room-reservation Hotel-room-reservation name location reservation name location type address ?x City-or-district price from to rooms country address state City-or- district Travel-lodge Post- code number number street street one-bed- room April 20, 2005 April 28, 2005 Winnipeg Winnipeg $119.00 Manitoba Canada R3B 2E9 515 Portage Ave. 515 Portage Ave.

  6. Path-Oriented Queries and Tree Inclusion Problem • - Algorithms for Tree Inclusion Problem • Bottom-up algorithm: • Kilpelainen-Mannila’s Algorithm (Pekka Kilpelainen and Heikki • Mannila, Ordered and unordered tree inclusion, SIAM Journal of • Computing, 24:340-356, 1995.) • O(|T| |P|) time • O(|T| |P|) space • Chen’s Algorithm (W. Chen, More efficient algorithm for ordered • tree inclusion, Journal of Algorithms, 26:370-385, 1998.) • O(|T||leaves(P)|) time • O(|leaves(P)|min{height(P), |leaves(T)|}) space

  7. Path-Oriented Queries and Tree Inclusion Problem • - Algorithms for Tree Inclusion Problem • Top-down algorithms: • Y. Chen and Y.B. Chen, An Efficient Top-down Algorithm for Tree • Inclusion, in Proc. of 18th Intl. Conf. Symposium on High Performance • Computing System and Application, Winnipeg, Canada: IEEE, • May 2004, pp. 183-187.) • O(|T| |leaves(P)|) time, need no extra space • Y. Chen and Y.B. Chen, On the Top-down Tree Inclusion Algorithm, • submitted to Information Processing Letters.) • O(|T||height(P)|) time, need no extra space • Advantages of top-down over bottom-up: • - better computational complexities • - checking trees page-wise (suitable for the cases of large data volume) • - integrating signatures into tree inclusion to cut useless subtree checkings • as early as possible

  8. Evaluation of Path-Oriented Queries • - Top-down Algorithm • Target tree: T = <t; T1, ..., Tk>, where t = root(T) and each Ti (i = 1, …, k) • is the subtrees of t; • Pattern forest: G = <P1, ..., Pq>, where each Pj(j = 1, …, q) is a subtree. • Main idea: • The algorithm attempts to find the number of subtrees j ( 0) within an • ordered forest G = <P1, ..., Pq> (q  1), which are embedded in a target • tree T. If j = q, we say that G is embedded in T. If j < q, then only the trees • P1, ..., and Pj are embedded in T. Let p1, ..., pq and t be the roots of P1, ..., Pq • and T, respectively. Since a forest does not have a root, we use a virtual • node pv to serve as a substitute for root(G). Thus, root(G) will return pv if • G = <P1, ..., Pq> with q > 1, and will return p1 if q = 1.

  9. Evaluation of Path-Oriented Queries - Top-down Algorithm Case 1: root(G) pv (i.e., G = <P> is a tree and root(G) = p), and label(p) label(t). If G is embedded in T, then there must exist a subtree Ti of t such that it contains the whole G. The algorithm should return 1 if an embedding can be found and 0 if it cannot. label(root(T))  label(root(G)) G: T: Ti Tree G is included in Ti.

  10. Evaluation of Path-Oriented Queries - Top-down Algorithm Case 2: root(G) pv (i.e., G = <P> and root(G) = p), and label(p) = label(t). Let <P1, ..., Pl> (l  0) be the forest of subtrees of pand <T1, ..., Tk> the forest of subtrees of t. If G is embedded in T, there must exist two sequences of integers: k1, ..., kg and l1, ..., lg (g l) such that includes < , ..., > (i = 1, ..., g, l0 = 0, lg = l), where < , ..., > represents a forest containing subtrees , ..., and . Thus, if lg = l, the algorithm should return 1 since we have a root preserving inclusion of G in T. Otherwise, it should return 0. label(root(T)) = label(root(G)) G: T: p t = Pl Tk P1 T1 … … … … … … include include

  11. Evaluation of Path-Oriented Queries - Top-down Algorithm Case 2: root(G) = pv and there exists an integer j (0 j q) such that <P1, ..., Pj> is included in T. If j = q, then the whole G is embedded in T. There are two possibilities to be considered when looking for j. The first possibility is similar to Case 2, where there are two sequences of integers: k1, ..., kg and l1, ..., lg (g q) that represent the order, in which the subtrees of root(G) are embedded in the subtrees of root(T). In thiscase, j = lg. If j = 0, we will check the second possibility to see whether there exists a root preserving inclusion of P1 in T, i.e., label(p1) = label(t) and the subtrees of p1 are included in the subtrees of t. In this case, j = 1.

  12. possibility 1: qv(virtual node) T: G: t = Pl Tk P1 T1 … … … … … … include include Evaluation of Path-Oriented Queries - Top-down Algorithm possibility 2: label(root(T)) = label(root(P1)) qv(virtual node) G: T: t = Pl Tk P1 T1 … … … … … … include

  13. Evaluation of Path-Oriented Queries - Top-down Algorithm • j := bottom-up-process(T, G); • 13. if (j = l) then return 1 else 0;} • else {ift is a leaf then return 0; • 14. (*handling Case 1*) • 15. i := 1; • 16. while (i k) do • 17.{iftop-down-process(Ti, G) > 0 then return 1; • 18. i := i + 1;} • 19. return 0;} } • end functiontop-down-process(T, G) input: T = <t; T1, ..., Tk>, G = <p; P1, ..., Pq> (*p may or may not be a virtual node.*) output: if root(G) is virtual, returns j  0; else returns 1 if T includes G; otherwise returns 0. begin 1. ifroot(G) is virtual then 2. if (|T| < |P1| + |P2| or p has only one child) 3. thenG := P1; 4. else {j := bottom-up-process(T, G); 5. if (j = 0 and label(t) = label(P1’s root)) (*second possibility in Case 3*) 6. then {change P1’s root to a virtual node; x := bottom-up-process(T, P1); 7. if (x = the number of the children of P1’s root) thenj := 1 else j := 0;} 8. return j;}} 9. if |T| < |G| return 0; 10. else {if (label(t) = label(p)) (*handling Case 2*) 11. then {p := virtual node; functionbottom-up-process(T, G) input: T = <t; T1, ..., Tk>, G = <p; P1, ..., Pq> output: j - an integer begin 1. j := 0; i := 1; 2. while (j < q and i k) do 3. { x := top-down-process(Ti, G); 4. j := j + x; G := <p; Pj+1, ..., Pq>; i := i + 1; } end

  14. Integration of Signatures into Top-down Inclusion • Definition A signature for a key word or an attribute value is • hash-coded bit string. • - Example: (constructing a signature for a word with m = 4 and F = 12) • “database” •  • letter triplets: dat, ata, tab, aba, bas, ase •  • H(dat) = 5, H(ata) = 1, H(tab) = 8, H(aba) = 1, H(bas) = 10, • H(ase) = 8. •  • 100 010 010 100 • D. Dervos, Y. Manolopulos and P. Linardis, “Comparison of signature • File models with superimposed coding,” J. of Information Processing • Letters 65 (1998) 101 - 106.

  15. Integration of Signatures into Top-down Inclusion Definition A signature for a key word or an attribute value is hash-coded bit string. - Important parameters: m: number of 1s in bit string F: length of bit string D: size of a block (or average number of the key words of an element) optimal choice of the parameters: Fln2 =mD(1) S. Christodoulakis and C. Faloutsos, “Design consideration for a message file server,” IEEE Trans. Software Engineering, 10(2) (1984) 201-210.

  16. 0101 0011 0001 0010 1010 1100 a: b: c: d: e: f: 0000 1000 0101 1000 1000 0000 T: a e b t0 t1 t2 t1 t2 t0 e c d f t22 t12 t22 t21 t11 t11 t12 t21 Integration of Signatures into Top-down Inclusion - Assigning signatures to tree nodes Let v be a node in a tree T. If v is a leaf node, its signature svis equal to the signature assigned to its label. Otherwise, sv= s v1 ... vn, where s represents the signature for the label associated with v, and s1, ... , and snare the signatures of v’s children: v1, ..., vn, respectively. T: a 1111 1101 e b 1111 1101 1111 1000 f e c d 1100 0000 0001 0101 0010 1000 1010 1000

  17. t0 t1 p0 t2 t21 p1 t22 t11 p2 t12 e 1111 1101 c d 0010 1000 0001 0101 Integration of Signatures into Top-down Inclusion • - Cutting off useless subtree checks by examining signatures • We assign each node v in T a bit string sv (called a signature), and each node • u in P a bit string su in such a way that if su matches sv then the subtree Tv • rooted at v may includes the subtree Pu rooted at u. Otherwise, Tv definitely • does not contain Pu. By “matching”, we mean that for each bit set to 1 in su, • the corresponding bit in sv is also set to 1 while for a bit set to 0 in su, the • corresponding bit in sv can be 0 or 1. In the following, we discuss this • technique in great detail. virtual node T: P: This subtree will not be explored. a 1111 1101 a b 1111 1000 0011 1101 e c d f 0010 1000 1100 0000 1010 1000 0001 0101

  18. Integration of Signatures into Top-down Inclusion • - Determine the length of signatures • Consider s = s1 / s2, where s1 and s2 are of length F and with m1 • and m2 bits set to 1, respectively. • How to determine the length of S? • l - the number of 1s in s • d = l - m’, where m’ = max(m1, m2). • length(s) = F + cd, where c is a constant and should be tuned for different • applications. • The value of d can be estimated as follows. • l - random variable representing the number of positions, in which both • s1 and s2 have 1s.

  19. Integration of Signatures into Top-down Inclusion • - Determine the length of signatures • El = 1 p(l = 1) + 2  p(l = 2) + ... + m’’  p(l = m’’) (2) • m’’ = min(m1, m2) and p(l = i) represents the probability that l is equal to i. • p(l = i) = (3) • d = l - m = m1 + m2 - l - max(m1, m2).

  20. Evaluation of Path-Oriented Queries • - Procedure for calculating signature length • 1) Identify the key words in a document, which can be done by using • Connexor-analyzer (http://www.connexor.com/demos/index.html.) • 2) Determine the length of the signatures for the nodes of a document tree, • which can be done in two steps: • - First, use formula (1) to determine the initial length of the signatures • according to the number of the chosen key words and their distribution • - Secondly, use formula (2) and (3) to determine the length of the • signatures for each document according to the initial length set for • signatures.

  21. Evaluation of Path-Oriented Queries • - Determine Procedure for calculating signature length In the figure, F stands for the initial length of the signatures and m for the initial number of bits set to 1.

  22. Experiment Results • - Test Platform • Computer - DELL desktop PC equipped with Pentium III 864Ghz processor, • 512MB RAM and 20GB hard disk. • Database system - Oracle-9i Enterprise Edition, The default buffer cache of • Oracle-9i is of size 4MB. • Language - Oracle PL/SQL language. • Data - all the 37 Shakespeare’s plays in a database

  23. Experiment Results • - Storage of XML documents in databases • All the documents are stored in three tables. • The relation Element has the following structure: • {DocID: <integer>, ID: <integer>, Ename: <string>, • firstChildID: <integer>, siblingID: <integer>, attributeID: <integer>}

  24. Experiment Results • - Storage of XML documents in databases • The relation Text is of a simpler structure: • {DocID: <integer>, textID: <integer>, value: <string>}, • where “textID” is for the identifiers of texts as the values of the corresponding elements • in the original document. One should notice that a text takes always an element as the • parent node. See the following table for illustration.

  25. Experiment Results • - Storage of XML documents in databases • The relation Attribute has five data fields: • {DocID: <integer>, att-ID: <integer>, parentID: <integer>, att-name: <string>, • att-value: <string>}.

  26. Experiment Results - Tested queries Group I - for testing path length impact • Group II - for testing node degree impact

  27. Experiment Results - Tested queries Group III - for testing impact of matching at higher level • Group IV - for testing impact of matching at middle level

  28. Experiment Results - Tested queries Group V- for testing impact of matching at lower level

  29. Experiment Results • - Tested methods • Inversion on Elements and Words (IEW) • (C. Zhang, J. Naughton, D. DeWitt, Q. Luo and G. Lohman, “On Supporting • Containment Queries in Relational Database Management Systems, in Proc. of ACM • SIGMOD Intl. Conf. On Management of Data, California, USA, 2001.) • Inversion on Paths and Words (IPW) • (C. Seo, S. Lee, and H. Kim, An Efficient Index Technique for XML Documents • Using RDBMS, Information and Software Technology 45(2003) 11-22, Elsevier • Science B.V.) • Tree Inclusion Algorithm (TIA) • Tree Inclusion with Signatures (TIS)

  30. E-index: (1, <1, 45>, 0) ... hotel-room-reservation (1, <2, 4>, 1) ... name (1, <5, 28>, 2) ... location ... ... ... ... ... T-index: (1, 3, 2) ... Travel-lodge (1, 7, 3) ... Winnipeg (1, 10, 3) ... Manitoba ... ... ... ... ... Experiment Results • - Tested methods • Inversion on Elements and Words (IEW) • - (Dno, Wposition, level) for a text word • - (Dno, Eposition, level) for an element Example:

  31. Experiment Results • - Tested methods • To evaluate the query: /hotel-room-reservation/location/address [street = Portage Ave.], • four joins are performed: • self-joins on E-index relation to connect ‘hotel-room-reservation’ and ‘location’, • ‘location’ and ‘address’, • ‘address’ and ‘street’, • the join between E-index and T-index relations to connect ‘street’ and ‘Portage Ave.’

  32. Experiment Results • - Tested methods • Inversion on Paths and Words (IPW) • - Path(path, pathID), • - PathIndex(pathID, docno, begin, end) • - Word(word, wordID) • - WordIndex(wordID, docno, pathID, position)

  33. Experiment Results • - Tested methods • In order to process the same query: • /hotel-room-reservation/location/address [street = Portage Ave.], • two joins are needed. • First join between Path and WordIndex relations with the following join condition: • Path.path = ‘hotel-room-reservation/location/address/street’ and • Path.pathID = WordIndex.pathID. • The second join between the result R of the first join and the Word relation with the • join condition: • R.wordID = Word.wordID and Word.word = ‘Portage Ave.’.

  34. IPW      * + + * * * + + TIS TIA IEW 12 IPW IPW + + + TIS TIS TIA TIA Execution time (sec.) 6 • • • * * Q1 Q2 Q3 Q4 Q5 Results of Group III Experiment Results - Tested results 2 1000 IPW TIS TIA Execution time (sec.) Execution time (sec.) 1 100  • • • • + * + * * * + + Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Results of Group I Results of Group II 12 + + + Execution time (sec.) 6 • • • * * Q1 Q2 Q3 Q4 Q5 Results of Group IV

  35. * + IPW TIS TIA Experiment Results - Tested results 12 + + + Execution time (sec.) 6 • • • * * Q1 Q2 Q3 Q4 Q5 Results of Group V

  36. Summary and Future Work • Path-oriented queries in document databases • Evaluation of path-oriented queries • - top-down algorithm for tree inclusion problem signatures- Integration of signatures into top-down tree inclusion • Future work: • document recognition using • tree inclusion • probabilistic analysis • Benford low • Zipf low

More Related