250 likes | 398 Views
A Linear-Space Top-down Algorithm for Tree Inclusion Problem. Prof. Dr. Yangjun Chen Dept. Applied Computer Science, University of Winnipeg 515 Portage Ave. Winnipeg, Manitoba, Canada R3B 2E9. Outline. Motivation Basic algorithm for tree inclusion problem - Definition
E N D
A Linear-Space Top-down Algorithmfor Tree Inclusion Problem Prof. Dr. Yangjun Chen Dept. Applied Computer Science, University of Winnipeg 515 Portage Ave. Winnipeg, Manitoba, Canada R3B 2E9
Outline • Motivation • Basic algorithm for tree inclusion problem - Definition - Algorithm description • Improvements • Summary
T: T: a d a d b b c e f e f Motivation Given two ordered labeled trees P and T, called the pattern and the target, respectively. An interesting problem is: Can we obtain pattern P by deleting some nodes from target T? That is, is there a sequence v1 , ..., vk of nodes such that for T0 = T and Ti+1 = delete(Ti, vi +1) for i = 0, ..., k - 1, we have Tk = P. If this is the case, we say, P is included in T, T contains P, or say, T covers P. delete(T, c) By “ordered trees”, we mean those trees, in which the order of siblings issignificant.
s s np vp vp det n v np adv v n adv “The” “student” “reads” det adj n “again and again” “reads” “book” “the” “interesting” “book” Motivation • Linguistic analysis
a b a b b d e b b Tree inclusion algorithm • Definition Definition 1 Let F and G be labeled ordered forests. We define an ordered embedding (, G, F) as an injective function : V(G) V(F) such that for all nodes v, u V(G), i) label(v) = label((v)); (label preservation condition) ii) v is an ancestor of u iff (v) is an ancestor of (u);(ancestor condition) iii) v is to the left of u iff (v) is to the left of (u); (Sibling condition) F: G: By “ordered trees”, we mean those trees, in which the order of siblings issignificant.
Tree inclusion algorithm • Algorithm • Let T = <t; T1, ..., Tk> (k 1) be a tree and G = <P1 , ..., Pl> • (l 1) be a forest. We handle G as a tree P = <pv; P1, ..., Pl>, • where pv represents a virtual node, matching any node in T. • Consider a node v in P with children v1, ..., vj. We use a pair <i, v> • (i j) to represent an ordered forest containing the first i subtrees • of v: <P[v1], ..., P[vi]>. Then, <j, pv> represents the first j trees • in G. P: v vi vk v1 … … referred to as a left corner <i, v>
Tree inclusion algorithm • Algorithm • In addition, h(v) represents the height of v in a tree; and (v) • represents a link from v in P to the leaf node on the left-most • path in P[v]. P: Let v’ be a leaf node in P. We denote by -1(v’) a set of nodes x such that for each v x (v) = v’. -1(v3) = {v1, v2, v3} (v1) v1 (v2) v2 v5 v4 v3
… … Pl Tree inclusion algorithm • Algorithm The tree inclusion checking is done by calling two functions recursively: top-down(T, G), bottom-up(T’, G), where T is a tree, and T’ and G are two forests. G: pv T = <t; T1, ..., Tk> T’ = <T1’, ..., Tk’> v P1 P2 G = <P1, ..., PL> • Each of the two functions returns a pair <i, v> with v being pv or a node on • the left-most path in P1. • Each node in T will be associated with a data structure, (t). - Initially, (t) is set . - By each call of top-down(<t; T1, ..., Tk>, G), (t) is changed to <i, v>, where <i, v> is the return value of top-down(<t; T1, ..., Tk>, G). ((t) is mainly used in the execution of bottom-up(T’, G) to avoid repeated computation.)
T: t … … P2 Pl Tree inclusion algorithm • Function: top-down(T, G) In top-down(T, G), two cases will be handled. Case 1: G = <P1>; or G = <P1, ..., Pl> (l >1), but |T| |P1| + |P2|. In this case, we try to find a pair <i, v> such that T contains the first i subtrees of v, where v = pv , or v -1(v’) and v’ is the leaf node on the left-most path in P1. pv T: G: t p1 P1 G: pv p1 |T| |P1| + |P2|. P1
… … Pl Tree inclusion algorithm • Function: top-down(T, G) case 1: i) If t is a leaf node, we will check whether label(t) = label((p1)), where p1 is the root of P1. If it is the case, return <1, parent of (p1)>. Otherwise, return <0, parent of (p1)>. T = <t; T1, ..., Tk>: pv G: t P1 G: pv T = <t; T1, ..., Tk>: t |T| |P1| + |P2|. P1 P2
p1 P11 P1i P1j … … … … Pl Tree inclusion algorithm • Function: top-down(T, G) case 1: • If |T| < |P1| or height(t) < height(p1), we will make a recursive call • top-down(T , <P11, ..., P1j>), where <P11, ..., P1j> be a forest of • the subtrees of p1. The return value of top-down(T , <P11, ..., P1j>) • is used as the return value of top-down(T, G) pv G: T: t |T| < |P1|
t p1 label(t) = label(p1) T1 P11 Ti Tk P1i P1j … … … … Tree inclusion algorithm • Function: top-down(T, G) case 1: • If |T| |P1| (but |T | |P1| + |P2|) and height(t) height(p1), two cases • need to be considered: • label(t) = label(p1). Call bottom-up(<T1, ..., Tk>, <P11, ..., P1j>). • label(t) label(p1). Call bottom-up(<T1, ..., Tk>, <P1>). t p1 label(t) label(p1) T1 P11 Ti Tk P1i P1j … … … …
P1: p1 v Tree inclusion algorithm • Function: top-down(T, G) case 1: In both sub-cases, assume that the return value of the corresponding bottom-up function is <i, v>. A further checking needs to be conducted: • If label(t) = label(v) and i = the outdegree of v, the return value should • be <1, v’s parent>. • Otherwise, the return value is the same as <i, v>. label(t) =label(v) T: t or label(t) label(v)
… … … … Tk Pl Tree inclusion algorithm • Function: top-down(T, G) Case 1: G = <P1>; or G = <P1, ..., Pl> (l >1), but |T| |P1| + |P2|. Case 2: G = <P1, ..., Pl> (l >1), and |T| > |P1| + |P2|. In this case, we will call bottom-up(<T1, ..., Tk>, G). Assume that the return value is <i, v>. The following checkings will be continually conducted. T: G: t pv |T| > |P1| + |P2| T1 T2 P1 P2
… … Pl Pl Tree inclusion algorithm • Function: top-down(T, G) Case 2: G = <P1, ..., Pl> (l >1), and |T| > |P1| + |P2|. In this case, we will call bottom-up(<T1, ..., Tk>, G). Assume that the return value is <i, v>. The following checkings will be continually conducted. iv) If v = p1’s parent, the return value is the same as <i, v>. v) If v p1’s parent, check whether label(t) = label(v)) and i = the outdegree of v. If so, the return value will be changed to <1, v’s parent>. Otherwise, the return value remains <i, v>. v p1’s parent v = p1’s parent = pv G: pv pv … … v P1 P2 P1 P2 Pi
P1 T2 Ti Tk Pi Pq … … Tree inclusion algorithm • Function: bottom-up(T’, G) bottom-up(T’, G)is designed to handle the case that both T’ and G are forests. Let T’ = <T1, ..., Tk> and G = <P1, ..., Pq>. In bottom-up(T’, G), we will make a series of calls top-down(Tl, <Pjl, ..., Pq>), where l = 1, ..., k, j1 = 0, and j1j2 ... jhq (for some h k), controlled as follows. G: T’: T1 … … … top-down(Tl, <Pjl, ..., Pq>) Before each call of top-down(Tl, <Pjl, ..., Pq>), (tl) will be checked to determine whether the call will be conducted.
Tree inclusion algorithm • Function: bottom-up(T’, G) bottom-up(T’, G) is designed to handle the case that both T’ and G are forests. Let T’ = <T1, ..., Tk> and G = <P1, ..., Pq>. In bottom-up(T’, G), we will make a series of calls top-down(Tl, <Pjl, ..., Pq>), where l = 1, ..., k, j1 = 0, and j1j2 ... jhq (for some h k), controlled as follows. • Two index variables l, j are used to scan T1, ..., Tk and P1, ..., Pq, • respectively. • Let <il, vl>be the return value of top-down(Tl, <Pj, ..., Pq>). If vl = pj’s • parent, set j to be j + il - 1. Otherwise, j is not changed. Set l to be l + 1. • Go to (2). • The loop terminates when all Tl’s or all Pj’s are examined. In the above process, (tl) will be checked before each top-down(Tl, <Pj, ..., Pq>) to determine whether the call will be conducted
P1 Ti T2 Tk Pi Pj Pq … … Tree inclusion algorithm • Function: bottom-up(T’, G) • If j > 0 when the loop terminates, bottom-up(T’, G) returns • <j, p1’s parent>. T1 … … …
P2 Tree inclusion algorithm • Function: bottom-up(T’, G) • If j > 0 when the loop terminates, bottom-up(T’, G) returns • <j, p1’s parent>. • Otherwise, j = 0. In this case, we will continue to search for a pair • <i, v> such that T’ contains the first i subtrees of v, where v -1(v’) and • v’ is the leaf node on the left-most path in P1, as described below. • Let <i1, v1>, <i2, v2>, ..., <ik, vk> be the respective return values of • top-down(T1, <P1, ..., Pq>), • top-down(T2, <P1, ..., Pq>), • ... ... • top-down(Tk, <P1, ..., Pq>). • Since j = 0, each vl-1(v’) (l = 1, ..., k). P1 v2 … v1 … vk
T2 Tree inclusion algorithm • Function: bottom-up(T’, G) i) Let <i1, v1>, ..., <ik, vk> be the return values of top-down(T1, <P1, ..., Pq>), ..., top-down(Tk, <P1, ..., Pq>), respectively. Since j = 0, each vl-1(v’) (l = 1, ..., k). ii) If each il = 0, return <0, >, where is considered to be a descendant of any node in G. Otherwise, find the first vg with children w1, ..., wh such that vg is not a descendant of any other vj, and ig > 0. P1 T1 Tg Tg+1 Tk vg … … … … v1 vk ig
Tg+2 Tg+i Tk … … Tree inclusion algorithm • Function: bottom-up(T’, G) iii) Check <P[wig+1], ..., P[wh]>) against <Tg+1, ..., Tk>. This can be done by a series of calls top-down(Tl, <P[wjl] ..., P[wh]>), where l = g + 1, …, k, and j1j2 ... jhh (for some h k), as illustrated below. <Tg+1, ..., Tk>: <P[wig+1], ..., P[wh]>: Tg+1 P[wig+1] P[wig+i] P[wh] … … … • Let <x, y> be its return value. Ify = vg,then thereturn value of • bottom-up(T’, G) is set to be <ig + x, vg>. • Otherwise, the return value is <ig, vg>.
T2 Tree inclusion algorithm • Further improvements In the case j = 0: Let <i1, v1>, ..., <ik, vk> be the return values of top-down(T1, <P1, ..., Pq>), ..., top-down(Tk, <P1, ..., Pq>). We will find the first vg such that it is not a descendant of any other vj and ig > 0. Then, <Tg+1, ..., Tk> will be checked against <P[wig+1], ..., P[wh]>. This shows that all the return values except <ig, vg> are not used in the subsequent computation. Thus, the work for looking for such values should be avoided. P1 T1 Tg Tg+1 Tk vg … … … … v1 vk Any of these subtrees should not be checked against <P1, ..., Pq>).
Tree inclusion algorithm • Further improvements Let <ij, vj> be the return value of top-down(Tj, <P1, ..., Pq>) such that ij > 0 and vj is p1 or a descendant of p1. Then, during the execution of top-down(Tj+1, <P1, ..., Pq>), once we have detected that it can only produce a return value <ij+1, vj+1> with vj+1 being a descendant of vj, we should stop the corresponding computation immediately since this return value will not be used in the subsequent searching. For this purpose, we rearrange top-down(Tj+1, <P1, ..., Pq>) to top-down(Tj+1, <P1, ..., Pq>, vj) with vj being used to transfer information, called a controlling-node. Assume that in the execution of top-down(Tj+1, <P1, ..., Pq>, vj), we have the following function calls: top-down(Tj+1, <P1, ..., Pq>, u1) returns <a1, u1>, top-down(Tj+2, <P1, ..., Pq>, u2) returns <a1, u2>, … … with all uj’s being a proper descendant of vj. Then the bottom-up function call with some ui as a controlling node should not be conducted. bottom-up(<Tj+i, ... >, <… …>, ui).
Summary • An efficient method for tree inclusion problem • - O|T|leaves(P)|)time and • - O(|T| + |P|) space • where leaves(P) - set of the leaf nodes of P. • Future work • - adapt the algorithm to a data stream environment • - adapt the algorithm to an indexing environment