340 likes | 503 Views
Computing Longest Common Substring/Subsequence of Non-linear Texts . Kouji Shimohira , Shunsuke Inenaga , Hideo Bannai , Masayuki Takeda Kyushu University, Japan. Outline. Non-linear Text Computing Longest Common Subsequence of Acyclic Non-linear Texts
E N D
Computing Longest Common Substring/Subsequence of Non-linear Texts KoujiShimohira, ShunsukeInenaga, Hideo Bannai, Masayuki Takeda Kyushu University, Japan
Outline Non-linear Text Computing Longest Common Subsequence of Acyclic Non-linear Texts Computing Longest Common Subsequence of Cyclic Non-linear Texts Conclusions and Future works
Outline Non-linear Text Computing Longest Common Subsequence of Acyclic Non-linear Texts Computing Longest Common Subsequence of Cyclic Non-linear Texts Conclusions and Future works
Non-linear Text Non-linear TextG =(V,E,L) A directed graph with vertices labeled by characters. V : the set of vertices. E : the set of arcs. L : V →Σ: a labeling function. e.g. V = {v1 , v2 , v3 , v4 , v5 , v6 , v7 , v8 , v9} E = {(v1,v2),(v2,v3),(v2,v7),(v3,v4),(v4,v5), (v5,v6),(v6,v7),(v6,v8),(v7,v8),(v8,v9)} L = {v1 → a, v2 → b, v3 → c, v4 → a, v5 → a, v6 → b, v7 → b, v8 → a, v9 → c} c3 a 4 a 5 b6 G a 1 b2 a 8 c9 b7
Non-linear Text L(v) : the character label of vertex v P(v) : the set of pathsthat end at vartexv. L(P(v)) : the set of stringsspelled by paths in P(v) P(G) : the set of pathsin G. ( P(G) = {P(v) | v∈V} ) L(P(G)) : the set of strings spelled by paths in P(G) (= L(G)) substr(L(G)) : the set of substrings of strings in L(G) subseq(L(G)) : the set of subseqencesof strings in L(G) e.g. L(v7) = b c3 a 4 a 5 b6 P(v7) = {v1v2v3v4v5v6v7 , v1v2v7} G’ a 1 b2 L(P(v7)) = {abcaabb , abb} a 8 c9 bcaa∈substr(L(G)) b7 aca∈subseq(L(G))
Algorithms on Non-linear Text |E1| : the number of arcs in text 1, |E2| : the number of arcs in text 2, |Σ| : the alphabet size, |V1| : the number of vertices in text 1, |V2| : the number of vertices in text 2.
Outline Non-linear Text Computing Longest Common Subsequence of Acyclic Non-linear Texts Computing Longest Common Subsequence of Cyclic Non-linear Texts Conclusions and Future works
Longest Common SubsequenceProblem for Acyclic Non-linear Texts Problem 1 Input : Acyclic non-linear texts G1=(V1,E1, L1) and G2=(V2,E2, L2) Output : Length of longest string in subseq(G1)∩subseq(G2) a1 c3 e.g. d5 b6 subseq(G1) = { a ,b, c, d,ab, ac, ad, bb, bc, bd, cb, cd, db, abb, abc, abd, acb, acd, adb, bcb, bcd, bdb, cdb, abcb, abcd, abdb, acdb, bcdb,abcdb } a 1 c2 d3 a1 c2 a1 b2 c3 d3 G1 G2 b5 subseq(G2) = { a, b, c, d,ab, ac, ad, ba, ca, cb, cd, da, db,aba, aca, acb, acd, ada, adb, cba, cda, cdb, dba, acba, acda, acdb, adba, cdba,acdba } d4 b5 a 6 c4 d5 b6 Output = 4
Algorithm 1 Algorithm 1 : Computing the length of longest common subsequence of acyclic non-linear texts Sort vertices of G1 and G2 in topological order Let C be a |V1|×|V2| integer table (Ci,j : the length of a longest string in subseq(P(v1,i))∩subseq(P(v2,j)) Compute Ci,j using dynamic programingfor all 1≦ i ≦ |V1| and 1≦ j≦|V2| Return max Ci,j
Topological Sort Sort vertices of G1 and G2 in topological order G2 G1 sort v2,6 v2,5 v1,1 v1,2 v1,3 v1,4 v1,5 v1,6 v2,1 v2,2 v2,3 v2,4 a 1 a1 c2 a1 b2 c3 d3 b5 b6 c2 d3 d4 a1 G1 G2 b2 d4 b5 a 6 c4 d5 b6 c3 c4 d5 b 6
Dynamic Programing table Let C be a |V1|×|V2| integer table (Ci,j : the length of a longest string in subseq(P(v1,i))∩subseq(P(v2,j)) C G2 G2 G1 a 1 a1 b5 b5 b6 b6 c2 c2 d3 d3 d4 d4 a1 a1 G1 b2 b2 c3 c3 c4 c4 d5 d5 b 6 b 6
L1(v1,i) ≠ L2(v2,j) Compute Ci,j using dynamic programing for all 1≦ i ≦ |V1| and 1≦ j≦|V2| If L1(v1,i) ≠ L2(v2,j) then Ci,j= max ({ Ck,j | (v1,k, v1,i)∈E1}∪{Ci,ℓ | (v2,ℓ , v2,j)∈E2}∪{0}) c d3 d4 a C c G2 c a1 a1 c2 a1 b2 c3 d3 b5 b6 c2 d3 d4 a1 G1 G1 G2 b2 d4 b5 a 6 c4 d5 b6 c3 c4 d5 b 6
L1(v1,i) = L2(v2,j) Compute Ci,j using dynamic programingfor all 1≦ i ≦ |V1| and 1≦ j≦|V2| If L1(v1,i) = L2(v2,j) then Ci,j= 1 + max ({ Ck,ℓ | (v1,k, v1,i)∈E1 , (v2,ℓ , v2,j)∈E2 }∪{0}) d3 d4 c C c3 G2 a1 a1 c2 a1 b2 c3 d3 b5 b6 c2 d3 d4 a1 G1 G1 G2 d5 b2 d4 b5 a 6 c4 d5 b6 c3 +1 c4 d5 b 6
Output Return max Ci,j max Ci,j= 4 Output : 4 C G2 a1 a1 c2 a1 b2 c3 d3 b5 b6 c2 d3 d4 a1 G1 G1 G2 b2 d4 b5 a 6 c4 d5 b6 c3 c4 d5 b 6
Time Complexity Linear time Sort vertices of G1 and G2 in topological order Let C be a |V1|×|V2| integer table (Ci,j : the length of a longest string in subseq(P(v1,i))∩subseq(P(v2,j)) O(|E1||E2|) time ComputeCi,j using dynamic programingfor all 1≦ i ≦ |V1| and 1≦ j≦|V2| Return max Ci,j
Time Complexity Compute Ci,j using dynamic programingfor all 1≦ i ≦ |V1| and 1≦ j≦|V2| c d3 d4 case of L1(v1,i) ≠ L2(v2,j) a ei : the number of arcs incoming to v1,i fj : the number of arcs incoming to v2,j C c To compute the value of Ci,j , ei+fj elements in tableCare used. G2 c To compute Ci,j forall 1≦ i ≦|V1| and 1≦ j ≦|V2| a1 b5 b6 c2 d3 d4 a1 G1 b2 c3 = O(|E1||E2|) c4 d5 b 6
Time Complexity Compute Ci,j using dynamic programingfor all 1≦ i ≦ |V1| and 1≦ j≦|V2| d3 d4 c case of L1(v1,i) = L2(v2,j) ei : the number of arcs incoming to v1,i fj : the number of arcs incoming to v2,j C c3 To compute the value of Ci,j, eifj elements in table Care used. G2 a1 To compute Ci,j forall 1≦ i ≦|V1| and 1≦ j ≦|V1| b5 b6 c2 d3 d4 a1 G1 d5 b2 c3 = O(|E1||E2|) c4 d5 b 6
Time Complexity Linear time Sort vertices of G1 and G2 in topological order Let C be a |V1|×|V2| interger array (Ci,j : the length of a longest string in subseq(P(v1,i))∩subseq(P(v2,j)) O(|E1||E2|) time ComputeCi,j using dymnamic programingfor all 1≦ i ≦ |V1| and 1≦ j≦|V2| O(|V1||V2|) time Return max Ci,j The total time complexity is O(|E1||E2|) time.
Outline Non-linear Text Computing Longest Common Subsequence of Acyclic Non-linear Texts Computing Longest Common Subsequence of Cyclic Non-linear Texts Conclusions and Future works
Longest Common SubsequenceProblem for Cyclic Non-linear Texts Problem 2 Input : Non-linear texts G1=(V1,E1, L1) and G2=(V2,E2, L2) Output : ∞ (if subseq(G1)∩subseq(G2) is infinite) The Length of longest string in subseq(G1)∩subseq(G2) (otherwise) e.g. 1 Character “d” is in a cycle in bothG1 and G2. a c a a c d G1 G2 ccdccdccd······ ∈ L(G1) bdbdbdbd······ ∈ L(G2) dddddd········ ∈ subseq(G1)∩subseq(G2) a b d b d c Output = ∞
Longest Common SubsequenceProblem for Cyclic Non-linear Texts Problem 2 Input : Non-linear texts G1=(V1,E1, L1) and G2=(V2,E2, L2) Output : ∞ (if subseq(G1)∩subseq(G2) is infinite) The Length of longest string in subseq(G1)∩subseq(G2) (otherwise) e.g. 2 a c a a c d G1 subseq(G1)∩subseq(G2) = {a, b, c, d, aa, ab, ac, ad, cd, aab, aac, aad, acd, aacd} G2 a b a b d c Output = 4
Algorithm 2 Algorithm 2 : Computing the length of longest common subsequenceof cyclic non-linear texts Transform G1 and G2 into G’1 and G’2 based on the strongly connected components Check whether subseq(G1)∩subseq(G2) is infinite or not Sort vertices of G’1 and G’2 in topological order Let C be a |V’1|×|V’2| integer table (Ci,j : the length of a longest string in subseq(P(v’1,i))∩subseq(P(v’2,j)) Compute Ci,j using dynamic programingfor all 1≦ i ≦ |V’1| and 1≦ j≦|V’2| Return max Ci,j
Strongly Connected Component Transform G1 and G2 into G’1 and G’2based on the strongly connected components transform a {a} 1 {c} 2 c {a} 1 a a {a} 2 c {d} 3 d G’1 G1 G’2 G2 a b a {a,b} 4 {b} 3 b d c {c,d} 4 strongly connected component cyclic non-linear texts acyclic non-linear texts
Check whether output is infinity or not. Check whether subseq(G1)∩subseq(G2) is infinite or not S1, S2 : the union of sets of labels of vertices that have a self-loop in G’1 , G’2 case of S1∩S2 ≠ Ø Let cbe any character in S1∩S2. An infinite repetitionc*ofc is a common subsequence ofG1 and G2. Hence, output = ∞. case of S1∩S2 = Ø {a} 1 {c} 2 {a} 1 {a} 2 {d} 3 subseq(G1)∩subseq(G2) is finite. G’1 G’2 S1 = {c,d} S2 = {a}∪{a,b} = {a,b} {a,b} 4 {b} 3 {c,d} 4 S1∩S2 = {c,d}∩{a,b} = Ø
Algorithm 2 G’1 Sort vertices of G’1 and G’2 in topological order {a} 1 {a} 1 {c} 2 {a} 2 {b} 3 {d} 3 {a,b} 4 {c,d} 4 G’2 sort v’2,2 v’1,2 v’1,3 v’1,1 v’1,4 v’2,4 v’2,3 v’2,1 {a} 1 {c} 2 {a} 1 {a} 2 {d} 3 G’1 G’2 {a,b} 4 {b} 3 {c,d} 4
Algorithm 2 G’1 Let C be a |V’1|×|V’2| integer table (Ci,j : the length of a longest string in subseq(P(v’1,i))∩subseq(P(v’2,j)) a 1 {a} 1 {a} 1 {c} 2 c 2 a 2 d3 {d} 3 b3 {a,b} 4 {c,d} 4 {a,b} 4 G’1 {a} 1 {a} 2 G’2 G’2 {b} 3 C {c,d} 4
Algorithm 2 Compute Ci,j using dynamic programingfor all 1≦ i ≦ |V’1| and 1≦ j≦|V’2| {a} 1 {c} 2 {d} 3 {a,b} 4 G’1 If L’1(v’1,i) ∩ L’2(v’2,j) ≠ Ø then Ci,j= 1 + max ({ Ck,ℓ | (v’1,k, v’1,i)∈E’1 , (v’2,ℓ , v’2,j)∈E’2 }∪{0}) {a} 1 {a} 2 G’2 {c} 2 {b} 3 C {c,d} 4 {a} 1 {c} 2 {a} 1 {a} 2 {d} 3 G’1 G’2 {a} 2 {b} 3 {a,b} 4 {b} 3 {c,d} 4 0 0 {c,d} 4 +1
Algorithm 2 Compute Ci,j using dynamic programingfor all 1≦ i ≦ |V’1| and 1≦ j≦|V’2| {a} 1 {c} 2 {d} 3 {a,b} 4 G’1 If L’1(v’1,i) ∩ L’2(v’2,j) = Ø then Ci,j= max ({ Ck,j | (v’1,k, v’1,i)∈E’1}∪{Ci,ℓ | (v’2,ℓ , v’2,j)∈E’2}∪{0}) {a} 1 {a} 2 G’2 {b} 3 C {c,d} 4 {a} 1 {c} 2 {a} 1 {a} 2 {d} 3 G’1 G’2 {a} 2 {b} 3 {a,b} 4 {b} 3 {c,d} 4 0 {c,d} 4
Algorithm 2 Compute Ci,j using dynamic programingfor all 1≦ i ≦ |V’1| and 1≦ j≦|V’2| {a} 1 {c} 2 {d} 3 {a,b} 4 G’1 If L’1(v’1,i) ∩ L’2(v’2,j) = Ø then Ci,j= max ({ Ck,j | (v’1,k, v’1,i)∈E’1}∪{Ci,ℓ | (v’2,ℓ , v’2,j)∈E’2}∪{0}) {a} 1 {a} 2 G’2 {c} 2 {d} 3 {a,b} 4 {b} 3 C {c,d} 4 {a} 1 {c} 2 {a} 1 {a} 2 {d} 3 G’1 G’2 {a,b} 4 {b} 3 {c,d} 4 0
Output Return max Ci,j {a} 1 {c} 2 {d} 3 {a,b} 4 G’1 {a} 1 {a} 2 G’2 {b} 3 C {c,d} 4 {a} 1 {c} 2 {a} 1 {a} 2 {d} 3 G’1 G’2 {a,b} 4 {b} 3 {c,d} 4
Time Complexity Transform G1 and G2 into G’1 and G’2 based on the strongly connected components Linear time Check whether subseq(G1)∩subseq(G2) is infinite or not O(|Σ|log|Σ|) time Sort vertices of G’1 and G’2 in topological order Linear time Let C be a |V’1|×|V’2| integer table (Ci,j : the length of a longest string in subseq(P(v’1,i))∩subseq(P(v’2,j)) Compute Ci,j Compute Ci,j using dynamic programingfor all 1≦ i ≦ |V’1| and 1≦ j≦|V’2| O(|E’1||E’2|+|V’1||V’2|log|Σ|) time Return max Ci,j Compare L’(v1,i) and L’(v2,j)
Time Complexity Transform G1 and G2 into G’1 and G’2 based on the strongly connected components Linear time Check whether subseq(G1)∩subseq(G2) is infinite or not O(|Σ|log|Σ|) time Sort vertices of G’1 and G’2 in topological order Linear time The total time complexity isO(|E1||E2|+|V1||V2|log|Σ|)time. Let C be a |V’1|×|V’2| integer table (Ci,j : the length of a longest string in subseq(P(v’1,i))∩subseq(P(v’2,j)) Compute Ci,j using dynamic programingfor all 1≦ i ≦ |V’1| and 1≦ j≦|V’2| O(|E’1||E’2|+|V’1||V’2|log|Σ|) time O(|V’1||V’2|) time Return max Ci,j
Conclusions and Future works Future works ・Longest Common Substring Problem on Cyclic Non-linear text ・the case where the number of times a vertex can be used is bounded ・pattern matching with non-linear patterns