1 / 34

Computing Longest Common Substring/Subsequence of Non-linear Texts

Computing Longest Common Substring/Subsequence of Non-linear Texts . Kouji Shimohira , Shunsuke Inenaga , Hideo Bannai , Masayuki Takeda Kyushu University, Japan. Outline. Non-linear Text Computing Longest Common Subsequence of Acyclic Non-linear Texts

maxima
Download Presentation

Computing Longest Common Substring/Subsequence of Non-linear Texts

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computing Longest Common Substring/Subsequence of Non-linear Texts KoujiShimohira, ShunsukeInenaga, Hideo Bannai, Masayuki Takeda Kyushu University, Japan

  2. Outline Non-linear Text Computing Longest Common Subsequence of Acyclic Non-linear Texts Computing Longest Common Subsequence of Cyclic Non-linear Texts Conclusions and Future works

  3. Outline Non-linear Text Computing Longest Common Subsequence of Acyclic Non-linear Texts Computing Longest Common Subsequence of Cyclic Non-linear Texts Conclusions and Future works

  4. Non-linear Text Non-linear TextG =(V,E,L) A directed graph with vertices labeled by characters. V : the set of vertices. E : the set of arcs. L : V →Σ: a labeling function. e.g. V = {v1 , v2 , v3 , v4 , v5 , v6 , v7 , v8 , v9} E = {(v1,v2),(v2,v3),(v2,v7),(v3,v4),(v4,v5), (v5,v6),(v6,v7),(v6,v8),(v7,v8),(v8,v9)} L = {v1 → a, v2 → b, v3 → c, v4 → a, v5 → a, v6 → b, v7 → b, v8 → a, v9 → c} c3 a 4 a 5 b6 G a 1 b2 a 8 c9 b7

  5. Non-linear Text L(v) : the character label of vertex v P(v) : the set of pathsthat end at vartexv. L(P(v)) : the set of stringsspelled by paths in P(v) P(G) : the set of pathsin G. ( P(G) = {P(v) | v∈V} ) L(P(G)) : the set of strings spelled by paths in P(G) (= L(G)) substr(L(G)) : the set of substrings of strings in L(G) subseq(L(G)) : the set of subseqencesof strings in L(G) e.g. L(v7) = b c3 a 4 a 5 b6 P(v7) = {v1v2v3v4v5v6v7 , v1v2v7} G’ a 1 b2 L(P(v7)) = {abcaabb , abb} a 8 c9 bcaa∈substr(L(G)) b7 aca∈subseq(L(G))

  6. Algorithms on Non-linear Text |E1| : the number of arcs in text 1, |E2| : the number of arcs in text 2, |Σ| : the alphabet size, |V1| : the number of vertices in text 1, |V2| : the number of vertices in text 2.

  7. Outline Non-linear Text Computing Longest Common Subsequence of Acyclic Non-linear Texts Computing Longest Common Subsequence of Cyclic Non-linear Texts Conclusions and Future works

  8. Longest Common SubsequenceProblem for Acyclic Non-linear Texts Problem 1 Input : Acyclic non-linear texts G1=(V1,E1, L1) and G2=(V2,E2, L2) Output : Length of longest string in subseq(G1)∩subseq(G2) a1 c3 e.g. d5 b6 subseq(G1) = { a ,b, c, d,ab, ac, ad, bb, bc, bd, cb, cd, db, abb, abc, abd, acb, acd, adb, bcb, bcd, bdb, cdb, abcb, abcd, abdb, acdb, bcdb,abcdb } a 1 c2 d3 a1 c2 a1 b2 c3 d3 G1 G2 b5 subseq(G2) = { a, b, c, d,ab, ac, ad, ba, ca, cb, cd, da, db,aba, aca, acb, acd, ada, adb, cba, cda, cdb, dba, acba, acda, acdb, adba, cdba,acdba } d4 b5 a 6 c4 d5 b6 Output = 4

  9. Algorithm 1 Algorithm 1 : Computing the length of longest common subsequence of acyclic non-linear texts Sort vertices of G1 and G2 in topological order Let C be a |V1|×|V2| integer table (Ci,j : the length of a longest string in subseq(P(v1,i))∩subseq(P(v2,j)) Compute Ci,j using dynamic programingfor all 1≦ i ≦ |V1| and 1≦ j≦|V2| Return max Ci,j

  10. Topological Sort Sort vertices of G1 and G2 in topological order G2 G1 sort v2,6 v2,5 v1,1 v1,2 v1,3 v1,4 v1,5 v1,6 v2,1 v2,2 v2,3 v2,4 a 1 a1 c2 a1 b2 c3 d3 b5 b6 c2 d3 d4 a1 G1 G2 b2 d4 b5 a 6 c4 d5 b6 c3 c4 d5 b 6

  11. Dynamic Programing table Let C be a |V1|×|V2| integer table (Ci,j : the length of a longest string in subseq(P(v1,i))∩subseq(P(v2,j)) C G2 G2 G1 a 1 a1 b5 b5 b6 b6 c2 c2 d3 d3 d4 d4 a1 a1 G1 b2 b2 c3 c3 c4 c4 d5 d5 b 6 b 6

  12. L1(v1,i) ≠ L2(v2,j) Compute Ci,j using dynamic programing for all 1≦ i ≦ |V1| and 1≦ j≦|V2| If L1(v1,i) ≠ L2(v2,j) then Ci,j= max ({ Ck,j | (v1,k, v1,i)∈E1}∪{Ci,ℓ | (v2,ℓ , v2,j)∈E2}∪{0}) c d3 d4 a C c G2 c a1 a1 c2 a1 b2 c3 d3 b5 b6 c2 d3 d4 a1 G1 G1 G2 b2 d4 b5 a 6 c4 d5 b6 c3 c4 d5 b 6

  13. L1(v1,i) = L2(v2,j) Compute Ci,j using dynamic programingfor all 1≦ i ≦ |V1| and 1≦ j≦|V2| If L1(v1,i) = L2(v2,j) then Ci,j= 1 + max ({ Ck,ℓ | (v1,k, v1,i)∈E1 , (v2,ℓ , v2,j)∈E2 }∪{0}) d3 d4 c C c3 G2 a1 a1 c2 a1 b2 c3 d3 b5 b6 c2 d3 d4 a1 G1 G1 G2 d5 b2 d4 b5 a 6 c4 d5 b6 c3 +1 c4 d5 b 6

  14. Output Return max Ci,j max Ci,j= 4 Output : 4 C G2 a1 a1 c2 a1 b2 c3 d3 b5 b6 c2 d3 d4 a1 G1 G1 G2 b2 d4 b5 a 6 c4 d5 b6 c3 c4 d5 b 6

  15. Time Complexity Linear time Sort vertices of G1 and G2 in topological order Let C be a |V1|×|V2| integer table (Ci,j : the length of a longest string in subseq(P(v1,i))∩subseq(P(v2,j)) O(|E1||E2|) time ComputeCi,j using dynamic programingfor all 1≦ i ≦ |V1| and 1≦ j≦|V2| Return max Ci,j

  16. Time Complexity Compute Ci,j using dynamic programingfor all 1≦ i ≦ |V1| and 1≦ j≦|V2| c d3 d4 case of L1(v1,i) ≠ L2(v2,j) a ei : the number of arcs incoming to v1,i fj : the number of arcs incoming to v2,j C c To compute the value of Ci,j , ei+fj elements in tableCare used. G2 c To compute Ci,j forall 1≦ i ≦|V1| and 1≦ j ≦|V2| a1 b5 b6 c2 d3 d4 a1 G1 b2 c3 = O(|E1||E2|) c4 d5 b 6

  17. Time Complexity Compute Ci,j using dynamic programingfor all 1≦ i ≦ |V1| and 1≦ j≦|V2| d3 d4 c case of L1(v1,i) = L2(v2,j) ei : the number of arcs incoming to v1,i fj : the number of arcs incoming to v2,j C c3 To compute the value of Ci,j, eifj elements in table Care used. G2 a1 To compute Ci,j forall 1≦ i ≦|V1| and 1≦ j ≦|V1| b5 b6 c2 d3 d4 a1 G1 d5 b2 c3 = O(|E1||E2|) c4 d5 b 6

  18. Time Complexity Linear time Sort vertices of G1 and G2 in topological order Let C be a |V1|×|V2| interger array (Ci,j : the length of a longest string in subseq(P(v1,i))∩subseq(P(v2,j)) O(|E1||E2|) time ComputeCi,j using dymnamic programingfor all 1≦ i ≦ |V1| and 1≦ j≦|V2| O(|V1||V2|) time Return max Ci,j The total time complexity is O(|E1||E2|) time.

  19. Outline Non-linear Text Computing Longest Common Subsequence of Acyclic Non-linear Texts Computing Longest Common Subsequence of Cyclic Non-linear Texts Conclusions and Future works

  20. Longest Common SubsequenceProblem for Cyclic Non-linear Texts Problem 2 Input : Non-linear texts G1=(V1,E1, L1) and G2=(V2,E2, L2) Output : ∞ (if subseq(G1)∩subseq(G2) is infinite) The Length of longest string in subseq(G1)∩subseq(G2) (otherwise) e.g. 1 Character “d” is in a cycle in bothG1 and G2. a c a a c d G1 G2 ccdccdccd······ ∈ L(G1) bdbdbdbd······ ∈ L(G2) dddddd········ ∈ subseq(G1)∩subseq(G2) a b d b d c Output = ∞

  21. Longest Common SubsequenceProblem for Cyclic Non-linear Texts Problem 2 Input : Non-linear texts G1=(V1,E1, L1) and G2=(V2,E2, L2) Output : ∞ (if subseq(G1)∩subseq(G2) is infinite) The Length of longest string in subseq(G1)∩subseq(G2) (otherwise) e.g. 2 a c a a c d G1 subseq(G1)∩subseq(G2) = {a, b, c, d, aa, ab, ac, ad, cd, aab, aac, aad, acd, aacd} G2 a b a b d c Output = 4

  22. Algorithm 2 Algorithm 2 : Computing the length of longest common subsequenceof cyclic non-linear texts Transform G1 and G2 into G’1 and G’2 based on the strongly connected components Check whether subseq(G1)∩subseq(G2) is infinite or not Sort vertices of G’1 and G’2 in topological order Let C be a |V’1|×|V’2| integer table (Ci,j : the length of a longest string in subseq(P(v’1,i))∩subseq(P(v’2,j)) Compute Ci,j using dynamic programingfor all 1≦ i ≦ |V’1| and 1≦ j≦|V’2| Return max Ci,j

  23. Strongly Connected Component Transform G1 and G2 into G’1 and G’2based on the strongly connected components transform a {a} 1 {c} 2 c {a} 1 a a {a} 2 c {d} 3 d G’1 G1 G’2 G2 a b a {a,b} 4 {b} 3 b d c {c,d} 4 strongly connected component cyclic non-linear texts acyclic non-linear texts

  24. Check whether output is infinity or not. Check whether subseq(G1)∩subseq(G2) is infinite or not S1, S2 : the union of sets of labels of vertices that have a self-loop in G’1 , G’2 case of S1∩S2 ≠ Ø Let cbe any character in S1∩S2. An infinite repetitionc*ofc is a common subsequence ofG1 and G2. Hence, output = ∞. case of S1∩S2 = Ø {a} 1 {c} 2 {a} 1 {a} 2 {d} 3 subseq(G1)∩subseq(G2) is finite. G’1 G’2 S1 = {c,d} S2 = {a}∪{a,b} = {a,b} {a,b} 4 {b} 3 {c,d} 4 S1∩S2 = {c,d}∩{a,b} = Ø

  25. Algorithm 2 G’1 Sort vertices of G’1 and G’2 in topological order {a} 1 {a} 1 {c} 2 {a} 2 {b} 3 {d} 3 {a,b} 4 {c,d} 4 G’2 sort v’2,2 v’1,2 v’1,3 v’1,1 v’1,4 v’2,4 v’2,3 v’2,1 {a} 1 {c} 2 {a} 1 {a} 2 {d} 3 G’1 G’2 {a,b} 4 {b} 3 {c,d} 4

  26. Algorithm 2 G’1 Let C be a |V’1|×|V’2| integer table (Ci,j : the length of a longest string in subseq(P(v’1,i))∩subseq(P(v’2,j)) a 1 {a} 1 {a} 1 {c} 2 c 2 a 2 d3 {d} 3 b3 {a,b} 4 {c,d} 4 {a,b} 4 G’1 {a} 1 {a} 2 G’2 G’2 {b} 3 C {c,d} 4

  27. Algorithm 2 Compute Ci,j using dynamic programingfor all 1≦ i ≦ |V’1| and 1≦ j≦|V’2| {a} 1 {c} 2 {d} 3 {a,b} 4 G’1 If L’1(v’1,i) ∩ L’2(v’2,j) ≠ Ø then Ci,j= 1 + max ({ Ck,ℓ | (v’1,k, v’1,i)∈E’1 , (v’2,ℓ , v’2,j)∈E’2 }∪{0}) {a} 1 {a} 2 G’2 {c} 2 {b} 3 C {c,d} 4 {a} 1 {c} 2 {a} 1 {a} 2 {d} 3 G’1 G’2 {a} 2 {b} 3 {a,b} 4 {b} 3 {c,d} 4 0 0 {c,d} 4 +1

  28. Algorithm 2 Compute Ci,j using dynamic programingfor all 1≦ i ≦ |V’1| and 1≦ j≦|V’2| {a} 1 {c} 2 {d} 3 {a,b} 4 G’1 If L’1(v’1,i) ∩ L’2(v’2,j) = Ø then Ci,j= max ({ Ck,j | (v’1,k, v’1,i)∈E’1}∪{Ci,ℓ | (v’2,ℓ , v’2,j)∈E’2}∪{0}) {a} 1 {a} 2 G’2 {b} 3 C {c,d} 4 {a} 1 {c} 2 {a} 1 {a} 2 {d} 3 G’1 G’2 {a} 2 {b} 3 {a,b} 4 {b} 3 {c,d} 4 0 {c,d} 4

  29. Algorithm 2 Compute Ci,j using dynamic programingfor all 1≦ i ≦ |V’1| and 1≦ j≦|V’2| {a} 1 {c} 2 {d} 3 {a,b} 4 G’1 If L’1(v’1,i) ∩ L’2(v’2,j) = Ø then Ci,j= max ({ Ck,j | (v’1,k, v’1,i)∈E’1}∪{Ci,ℓ | (v’2,ℓ , v’2,j)∈E’2}∪{0}) {a} 1 {a} 2 G’2 {c} 2 {d} 3 {a,b} 4 {b} 3 C {c,d} 4 {a} 1 {c} 2 {a} 1 {a} 2 {d} 3 G’1 G’2 {a,b} 4 {b} 3 {c,d} 4 0

  30. Output Return max Ci,j {a} 1 {c} 2 {d} 3 {a,b} 4 G’1 {a} 1 {a} 2 G’2 {b} 3 C {c,d} 4 {a} 1 {c} 2 {a} 1 {a} 2 {d} 3 G’1 G’2 {a,b} 4 {b} 3 {c,d} 4

  31. Time Complexity Transform G1 and G2 into G’1 and G’2 based on the strongly connected components Linear time Check whether subseq(G1)∩subseq(G2) is infinite or not O(|Σ|log|Σ|) time Sort vertices of G’1 and G’2 in topological order Linear time Let C be a |V’1|×|V’2| integer table (Ci,j : the length of a longest string in subseq(P(v’1,i))∩subseq(P(v’2,j)) Compute Ci,j Compute Ci,j using dynamic programingfor all 1≦ i ≦ |V’1| and 1≦ j≦|V’2| O(|E’1||E’2|+|V’1||V’2|log|Σ|) time Return max Ci,j Compare L’(v1,i) and L’(v2,j)

  32. Time Complexity Transform G1 and G2 into G’1 and G’2 based on the strongly connected components Linear time Check whether subseq(G1)∩subseq(G2) is infinite or not O(|Σ|log|Σ|) time Sort vertices of G’1 and G’2 in topological order Linear time The total time complexity isO(|E1||E2|+|V1||V2|log|Σ|)time. Let C be a |V’1|×|V’2| integer table (Ci,j : the length of a longest string in subseq(P(v’1,i))∩subseq(P(v’2,j)) Compute Ci,j using dynamic programingfor all 1≦ i ≦ |V’1| and 1≦ j≦|V’2| O(|E’1||E’2|+|V’1||V’2|log|Σ|) time O(|V’1||V’2|) time Return max Ci,j

  33. Conclusions and Future works Future works ・Longest Common Substring Problem on Cyclic Non-linear text ・the case where the number of times a vertex can be used is bounded ・pattern matching with non-linear patterns

  34. Thank You For Listening

More Related