380 likes | 505 Views
A Polynomial Time Matching Algorithm of Ordered Tree Patterns having Height-Constrained Variables. Kazuhide Aikou 1 , Yusuke Suzuki 1,2 , Takayoshi Shoudai 1 , Tomoyuki Uchida 2 , Tetsuhiro Miyahara 2. Department of Informatics, Kyushu University, Japan
E N D
A Polynomial Time Matching Algorithm of Ordered Tree Patterns having Height-Constrained Variables Kazuhide Aikou1, Yusuke Suzuki1,2, Takayoshi Shoudai1, Tomoyuki Uchida2, Tetsuhiro Miyahara2 • Department of Informatics, Kyushu University, Japan • Faculty of Information Sciences, Hiroshima City University, Japan
Contents • Backgrounds and Motivations • Preliminaries • Ordered Term Trees • Height-Constrained Variables • A Matching Algorithm of Ordered Term Trees having Height-Constrained Variables • Conclusions and Future Works
Ordered Term Trees Backgrounds Increase of Tree-structured Data (Web Documents, HTML/XML, etc.) • Our Works: • COLT for Term Trees • Web Mining Systems Using Learning Algorithms for Term Trees <Salesperiod> <Salesperiod> <Quarter>Winter1998</Quarter> <Design> <Designnumber>C365</Designnumber> <Description>North Star Polo</Description> <Unitssold>35500</Unitssold> </Design> </Salesperiod> <Quarter> <Design> <Designnumber> <Unitssold> Winter1998 <Description> C365 35500 North Star Polo App.: Knowledge Discovery from Web Documents Discovery of Tree-structured Patterns Common to Tree-structured Data <HTML> <Head> <Body> <Table> <Table> <Title> <Table> Text_university
<HTML> <HTML> <HEAD> <BODY> <HEAD> <BODY> 1 2 <DIV> <DIV> <FONT> <FONT> text1 text1 <FONT> <FONT> 1 1 2 3 text3 text4 text3 text4 text2 text2 1 1 1 TAG TEXT Preliminaries Ordered trees express semi-structured data (HTML, XML, etc). HTMLData <HTML> <HEAD>text1</HEAD> <BODY> <DIV>text2</DIV> <FONT>text3</FONT> <FONT>text4</FONT> </BODY> </HTML> Object Exchange Model Ordered Trees with Edge Labels
Variables with exactly one child port The parent port of h1 Single-child port variables The parent port of h2 Multi-child port variables The child port of h1 Variables with at least one child port The child ports of h2 Ordered Term Trees with Multi-Child Port Variables Ordered Tree Patterns with Internal Structured Variables An ordered term tree t=(V,E,H) x,y,...: variable labels u1 V: A vertex set E: An edge set H: A variable set Variable h1 x u2 u3 u4 y Variable h2 u5 u7 u8 u6 A variable can be substituted with an arbitrary ordered tree.
Identify the root of T1 with the parent port. Identify the root of T2 with the parent port. w1 w1 w1 u1 u1 u1 u1 u1 u1 v1 v1 v1 v1 x x w2 vi vi w2 w2 vi u2 v2 u2 v2 v2 v2 u3 u3 u3 u3 u3 u3 u4 u4 u4 u4 u4 u4 v2 v2 v2 v2 v3 v3 v3 v3 y y y y Identify the two leaves with the two child ports. w3 w3 w3 w4 w4 w4 u5 u5 u2 u2 u2 u2 u6 u6 vi vi u7 u7 u7 w2 u7 w2 v4 v4 v4 v4 Chose one of the leaves of T2 and Identify it with the child port. u5 u5 u5 u5 u6 u6 u6 u6 u7 u7 w4 w4 Substitutions An ordered tree T1 An ordered tree T2 A new ordered treeT An ordered term treet Replacements of the variables with T1 and T2
Linear Ordered Term Trees:All variables have mutually distinct variable labels. All variable replacements are decided independently. An ordered tree A linear ordered term tree A substitution x match y
Matching Problem for Linear Ordered Term Trees with Multi-Child Port Variables INPUT T: an ordered tree; t: a linear ordered term tree with multi-child port variables. PROBLEM Does t match T? This matching problem is computed in O(nN) time, where n is the number of vertices in t and N is the number of vertices in T[Suzuki et al., ILP 02].
<HTML> <HEAD> <BODY> 1 2 <DIV> <FONT> text1 <FONT> 1 1 2 3 text3 text4 text2 1 1 1 Observation:Most of ordered trees obtained from HTML files have low height. An HTML file <HTML> <HEAD>text1</HEAD> <BODY> <DIV>text2</DIV> <FONT>text3</FONT> <FONT>text4</FONT> </BODY> </HTML> height
Relationships between the size of the tree representing an HTML file and the height of it. A tree of a big height is rare. Then, it becomes a feature if there is a long branch. 40 30 Height 20 10 0 0 500 1000 1500 2000 Size = The number of vertices in a tree
i j The trunk length i The height j Height-constrained single-child port variables 0 < i ≦ j ( i , j ) ( i’, j’) Trunk Length: The path length between the root and the leaf which are identified with the ports.
1 2 3 (2,4) (2,2) Example. N.G. O.K An ordered term tree t An ordered tree T
(4,6) (1,2) MATCHING PROBLEMfor Linear Ordered Term Trees with Height-Constrained Single-Child Port Variables A linear ordered term tree t An ordered tree T INPUT: PROBLEM:Does t match T?
Main Theorem • MATCHING PROBLEMfor Linear Ordered Term Trees with Height-Constrained Single-Child Port Variablesis computed in O(N max{nDmax, S}) time, where n: the number of vertices of t, N: the number of vertices of T, S: the total amount of the lowest trunk lengths of all variables of t, Dmax: the maximum number of children of a vertex of T.
(1,1) (1,1) u’ (4,6) (4,6) (1,2) Sub Term Tree and Subtree A linear ordered term tree t An ordered tree T t[u’] u T[u] -T[v] v u and all descendants of u which are not proper descendants of v
u v v’ u i j v’ v v T[v] t[v’] Idea:Corresponding Sets CS(u) • (v’,i,j)∈CS(u)shows that there is a descendant v of u such that • t[v’] matches T[v], • the length between u and v is i (if i < i’-1), and • (3) the height of T[u]-T[v] is j. t=(Vt,Et,Ht): a term tree, T=(VT,ET): a tree. CS(u)Vt×N×N: a corresponding set of a vertex uVT. (v’,i,j)∈CS(u) T t (i’,j’) v match
u v’ Therefore,(v’,0,0)CS(u) if and only if t[v’] matches T[u]. (v’,0,0)∈CS(u) T t (i’,j’) match (the root of t,0,0)CS(the root of T) if and only if t matches T.
Algorithm Matching(t,T) 1 Initialization; while there is an unmarked vertex u of Tdo begin Mark u; VID-Inheriting(u); C-Set-Attaching(u) end 2 3
Algorithm Matching(t,T) Initialization; while there is an unmarked vertex u of Tdo begin Mark u; VID-Inheriting(u); C-Set-Attaching(u) end
1 Vertex identifiers 2 3 4 5 6 7 8 9 (1,2) (1,2) (2,2) (2,2) The children of an internal vertex have consecutive vertex identifiers. This saves computation time of main processes. Initialization:Vertex Identifiers A linear ordered term tree t Breadth-first search order
CS(Q) (4,0,0),(6,0,0), = (7,0,0),(8,0,0), (9,0,0) height(Q)=0 CS(K) (4,0,0),(6,0,0), = (7,0,0),(8,0,0), (9,0,0) height(K)=0 CS(F) (4,0,0),(6,0,0), = (7,0,0),(8,0,0), (9,0,0) height(F)=0 CS(L) (4,0,0),(6,0,0), = (7,0,0),(8,0,0), (9,0,0) height(L)=0 CS(J) (4,0,0),(6,0,0), = (7,0,0),(8,0,0), (9,0,0) height(J)=0 CS(M) (4,0,0),(6,0,0), = (7,0,0),(8,0,0), (9,0,0) height(M)=0 CS(H) (4,0,0),(6,0,0), = (7,0,0),(8,0,0), (9,0,0) height(H)=0 D F H J CS(D) (4,0,0),(6,0,0), = (7,0,0),(8,0,0), (9,0,0) height(D)=0 7 4 6 K L M 8 9 Q Compute the corresponding set of each vertex from leaves to the root. from leaves to the root A T 1 t B C 2 3 (3,6) D E I F G H J (1,2) 7 4 5 6 N K L M 8 9 O Initialization: For all leaves u of T, Mark u; CS(u):={(u’,0,0) | u’ is a leaf of t.}; height(u):=0; P Q
Algorithm Matching(t,T) Initialization; while there is an unmarked vertex u of Tdo begin Mark u; VID-Inheriting(u); C-Set-Attaching(u) end
If i’=i-1 then the parent of u can match the parent port u’. u’ (i,j) v’ N can become a vertex 3. 3 (3,6) 7 VID-Inheriting (1/3): Let v’ be the child port of an (i,j)-height constrained variable. For an internal vertex u of a tree, if there is an element (v’,i’,j’) in the CS of a child of u, add (v’, min{i’+1,i-1}, *) to CS(u). Next slide C Example Add (7,2,4) to CS(I) (7,0,0)∈CS(J) I J Add (7,2,3) to CS(N) N Add (7,2,2) to CS(O) O Add (7,1,1) to CS(P) P (7,0,0)∈CS(Q) Q
a c b (7,1,3)∈CS(c) (7,1,1)∈CS(b) height(c)=3 height(b)=4 3 4 VID-Inheriting (2/3):Case: At least two children have (v’,i’,*) for a vertex v’ and an integer i’. (7,2,4) (7,2,4)∈CS(a) ∈CS(a) , (7,2,5) T 3 (4,6) c b 7 Choose the smallest height
a c b (7,2,2)∈CS(c) 3 (7,1,3)∈CS(b) 4 height(c)=3 height(b)=4 c b VID-Inheriting (3/3):Case: A child has (v’,i’,*) and another child has (v’,i’’,*) for distinct integers i’ and i’’. Add all triplets to CS(u) (at most i triplets) (7,2,4) (7,3,5) ,∈CS(a) 3 T (4,6) 7 • CS(a) contains at most S triplets. • Then the total time complexity of Inheriting of a vertex a is O(Sma), where ma is the number of the children of a.
Algorithm Matching(t,T) Initialization; while there is an unmarked vertex u of Tdo begin Mark u; VID-Inheriting(u); C-Set-Attaching(u) end
C-Set-Attaching (Small Examples) t (2,0,0) should be added to CS(B). 2 B (4,0,0)CS(D) (6,0,0)CS(H) 4 5 6 D F H (5,0,0)CS(F) (2,0,0) is added to CS(B). t B 2 (4,0,0)CS(D) (6,0,0)CS(H) (1,2) D E F G H 4 5 6 (5,0,0)CS(G) height(E)=1 height(F)=2 (5,0,0)CS(G) covers [E,G].
(2,0,0) is added to CS(B). t B 2 (4,0,0)CS(D) (6,0,0)CS(H) (1,2) D E F G H 4 5 6 (5,1,1)CS(F) height(E)=1 height(G)=2 (5,1,1)CS(F) covers [E,G]. (2,0,0) may not be added to CS(B). t B 2 (4,0,0)CS(D) (6,0,0)CS(H) (1,2) D E F G H 4 5 6 (5,1,1)CS(F) height(E)=3 height(G)=2 (5,1,1)CS(F) covers [F,G] but cannot cover E.
C-Set-Attaching (A Big Example) An ordered term tree 11 t (4,8) (3,4) (4,7) (5,5) 2 1 3 4 5 6 7 8 9 10
CS(F) = CS(B) = (1,0,0), (4,0,0) (7,2,3) (2,0,0), (4,0,0) height(F)=2 height(B)=5 CS(A) = CS(N) = (1,0,0), CS(J) = CS(H) = (6,0,0), (10,3,4) height(A)=9 CS(G) = (7,2,3), (10,3,3) (5,0,0), (6,0,0), (8,4,4), (9,0,0) CS(D) = (2,0,0), (4,0,0), (5,0,0), (8,4,4) CS(C) = CS(I) = height(N)=4 CS(E) = (5,0,0) (3,3,4),(6,0,0) height(J)=7 (3,3,5), (6,0,0) (3,3,3) height(C)=4 CS(M) = height(E)=3 height(D)=5 height(H)=6 height(I)=5 (5,0,0), (9,0,0) CS(L) = height(G)=5 (4,0,0), (8,4,4) height(M)=4 height(L)=9 An ordered tree O A B C D E F G H J N I K L M CS(K) = φ height(K)=1
First, we prepare a virtual table for a new graph. Rows and columns represent vertices of T and t, respectively.
CS(F) = (1,0,0), (4,0,0) (7,2,3) height(F)=2 CS(H) = CS(G) = (5,0,0), (6,0,0), (8,4,4), (9,0,0) CS(D) = (2,0,0), (4,0,0), (5,0,0), (8,4,4) CS(I) = CS(E) = (3,3,4),(6,0,0) (3,3,5), (6,0,0) (3,3,3) height(R)=3 height(F)=5 height(H)=6 height(I)=5 height(G)=5 An ordered term tree An ordered tree 11 O (3,4) D E F G H I 7 (7,2,3)CS(F) covers [E,F]. [E,F] Add a vertex labeled with [E,F] to F7 in the table.
CS(F) = (1,0,0), (4,0,0) (7,2,3) height(F)=2 CS(H) = CS(G) = (5,0,0), (6,0,0), (8,4,4), (9,0,0) CS(D) = (2,0,0), (4,0,0), (5,0,0), (8,4,4) CS(I) = CS(E) = (3,3,4),(6,0,0) (3,3,5), (6,0,0) (3,3,3) height(E)=3 height(D)=5 height(H)=6 height(I)=5 height(G)=5 An ordered term tree An ordered tree 11 O (3,4) (5,5) D E F G H I 7 8 (8,4,4)CS(G) covers [E,G]. [E,F] [E,G] Add a vertex labeled with [E,G] to G8 in the table.
CS(F) = (1,0,0), (4,0,0) (7,2,3) height(F)=2 CS(H) = CS(G) = (5,0,0), (6,0,0), (8,4,4), (9,0,0) CS(D) = (2,0,0), (4,0,0), (5,0,0), (8,4,4) CS(I) = CS(E) = (3,3,4),(6,0,0) (3,3,5), (6,0,0) (3,3,3) height(E)=3 height(D)=5 height(H)=6 height(I)=5 height(G)=5 An ordered term tree An ordered tree 11 O (3,4) (5,5) D E F G H I 7 8 (8,4,4)CS(H) covers [H,H]. Add a directed edge from [E,F] at F7 to [E,G] at G8, because two consecutive variables cover all vertices from E to G. [E,F] [E,G] Add a vertex labeled with [H,H] to H8 in the table. [H,H]
[B,K] [J,K] [K,N] [M,N] vstart • If there is a directed path from vstart to vgoal, (11,0,0) is added to CS(O). • The total time complexity of C-Set-Attaching of a vertex u of T and a vertex u’ of t is O(mu2m’u’), where mu and m’u’ are the numbers of the children of u and u’, respectively. [B,K] [B,K] [E,F] [E,G] [H,H] [B,K] [J,K] [B,K] [K,N] vgoal [M,N]
Total Time Complexity • VID-Inheriting(u): O(Smu) • C-Set-Attaching(u): O(mu2m’u’) mu: the number of children of a vertex u of T, m’u’: the number of children of a vertex u’ of t. • Total: O(N max{nDmax,S}) n: the number of vertices of t, N: the number of vertices of T, S: the total amount of the lowest trunk lengths of all variables of t, Dmax: the maximum number of children of a vertex of T.
Conclusions • An O(N max{nDmax,S}) Time Matching Algorithm for Ordered Term Trees with Height-Constrained Variables. • [Our Related Works] Polynomial-Time Learning Algorithms for Ordered Term Trees with Height-Constrained Variables [Suzuki et al., PRICAI'04], [Matsumoto and Shoudai, ALT'04]. Future Works: • An Efficient Matching Algorithm for Ordered Term Trees with Height-Constrained Multi-Child Port Variables. • Polynomial-Time Learning Algorithms for Ordered Term Trees with Height-Constrained Multi-Child PortVariables.