1 / 43

Advisor: Prof. R. C. T. Lee Speaker: L. Y. Huang

Fast Parallel and Serial Approximate String Matching Journal of Algorithms, Vol.10 (1989), pp.157-169. G. Landau and U. Vishkin. Advisor: Prof. R. C. T. Lee Speaker: L. Y. Huang. Problem.

lew
Download Presentation

Advisor: Prof. R. C. T. Lee Speaker: L. Y. Huang

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fast Parallel and Serial Approximate String MatchingJournal of Algorithms, Vol.10 (1989), pp.157-169.G. Landau and U. Vishkin Advisor: Prof. R. C. T. Lee Speaker: L. Y. Huang

  2. Problem • Give two arrays: P = p1p2…pm – the pattern, and T = t1t2…tn – the text, and an integer k (k≧1), find all occurrences of the pattern in the text with edit distances at most equal to k.

  3. This algorithm improves the Alternative Dynamic Programming Computation. • First, we introduce the Dynamic Programming Computation.

  4. i 1 2 3 4 5 6 7 g g g t c t a j 1 2 3 4 0 0 0 0 0 0 0 0 g 1 0 0 0 1 1 1 1 t 2 1 1 1 0 1 1 2 t 3 2 2 2 1 1 1 2 c 4 3 3 3 2 1 2 2 The Dynamic Programming Algorithm[S80] • In the dynamic programming approach, we construct a matrix Dn+1,m+1 when Di,jis the minimum edit distance between P(1, j) and any substring in T which ends at Ti. • Example: T = gggtcta P = gttc k = 2

  5. i 1 2 3 4 5 6 7 g g g t c t a j 1 2 3 4 0 0 0 0 0 0 0 0 g 1 0 0 0 1 1 1 1 t 2 1 1 1 0 1 1 2 t 3 2 2 2 1 1 1 2 c 4 3 3 3 2 1 2 2 • We found: • gt gt gt • gttc g t t gt • g t c gtc • g t t c gtc (1) Distance =2 Distance =1 (2)

  6. i 1 2 3 4 5 6 7 g g g t c t a j 1 2 3 4 0 0 0 0 0 0 0 0 g 1 0 0 0 1 1 1 1 • g t c t g t c t gtct • g t t c g t t t gtct • g t c t g t c t gtct • g t t c g t t gtct • g t c t a g t c t a gtcta • g t t c g t t a gtcta t 2 1 1 1 0 1 1 2 t 3 2 2 2 1 1 1 2 c 4 3 3 3 2 1 2 2 (3) Distance =2 Distance =2 (4) Distance =2 (5)

  7. i 1 2 3 a b c 0 0 0 0 j 1 2 b 1 1 0 1 c 2 2 1 0 An alternative Dynamic Programming Computation • We should heavily use the concept of diagonal. • Diagonald is defined as all of the Di,j’s where d = i – j. Diagonal 2 Diagonal 0

  8. We first have the following: • (a) If Ti= Pj, Di,j = Di-1,j-1; • (b) otherwise, Di,j = Di-1,j-1+1 (subsitutaion) or Di,j = Di, j-1+1 (deletion) or Di,j = Di-1,j (insertion)

  9. Consider any diagonald. Let us find the largest j, if it exists, such that (i,j) is on Diagonald (i - j = d) and Di,j = 0. Let us now label all of these locations. i 1 2 3 4 5 6 7 g g g t c t a 0 0 0 0 0 0 0 0 j 1 2 3 4 g 0 0 0 t 0 t c Diagonal 2 Diagonal 1 Diagonal 0

  10. Having found the above locations (i, j) where Di,j = 0, we can further find the largest j, if it exists, such that (i, j) is on Diagonald and Di,j = 1. • To do this, we use the following observation: Each element in Diagonald can only influence elements in Diagonalsd-1, d and d+1.

  11. Let us consider any (i, j) location on Diagonald. Why can Di,j suddenly become 1? It can only be influenced as shown below: Thus, we conclude that we only need to consider Diagonalsd-1, d and d+1. Di, j-1 Di-1, j-1 substitution delete d+1 Di, j Di-1, j insert d d-1

  12. Let us consider the following table. Question: what is the value of D4,3? It can not be 0 because we have already decided that on Diagonal 1, the largest j on Diagonal 1 is 1. Thus D4,3=1. i 1 2 3 4 5 6 7 g g g t c t a 0 0 0 0 0 0 0 0 j 1 2 3 4 g 0 0 0 0 t 0 0 t 0 ? c 0 d =1

  13. i 1 2 3 4 5 6 7 g g g t c t a 0 0 0 0 0 0 0 0 j 1 2 3 4 g 0 0 0 0 t 0 0 t 0 1 c 0 ? d =1 • Question: What is the value of D5,4? • Since T5 =P4, D5,4 =D4,3 =1.

  14. Based upon the above discussion, we can find all (i,j)s where Di,j=1 after finding all (i’, j’)s when Di’,j’ =0. • In fact, after finding all Di,js where Di,j = e, we can find all (i’, j’)s where Di’,j’ = e+1. Thus the dynamic programming table does not have to computed. • In the following, we shall give the Alternative Dynamic Programming Computations Method formally.

  15. Let Ld,e denote the largest rowj such that Di,j is on the Diagonald (i- j = d) and Di,j =e. Based upon this definition, e is the minimum edit distance between any substring of T ending at TLd,e+d and PLd,e+1 ≠TLd,e+d+1 Let d =3. L3,0 = 0, L3,1=3, L3,2 =4 i 1 2 3 4 5 6 7 g g g t c t a j 1 2 3 4 0 0 0 0 0 0 0 0 g 1 0 0 0 1 1 1 1 t 2 1 1 1 0 1 1 2 t 3 2 2 2 1 1 1 2 c 4 3 3 3 2 1 2 2

  16. Example: T = gggtcta P = gttc k = 2 Now, L3,1 = 3. It means that we have found a substring A, which is T(3,6)=gtct, ending at TLd,e+d = T3+3 =T6, such that the edit distance between A and P(1,3) = gtt is 1. PLd,e+1 ≠TLd,e+d+1 P3+1≠T3+3+1 i 1 2 3 4 5 6 7 j 1 2 3 4

  17. i 1 2 3 4 5 6 7 g g g t c t a 0 0 0 0 0 0 0 0 j 1 2 3 4 g 1 0 0 0 1 1 1 1 t 2 1 1 1 0 1 1 2 t 3 2 2 2 1 1 1 2 c 4 3 3 3 2 1 2 2 • Example: • T = gggtcta • P = gttc • k = 2 • Now, L1,1 = 4 = m. It means that we have found substring A, which is T(2,5)=ggtc, ending at TLd,e+d = T3+3 =T6, such that the edit distance between A and P(1,3) = gttis 1. • They are T(2,5) = ggtcand P = gttc.

  18. The alternative dynamic algorithm computation is to compute the Ld,e’s value.

  19. An alternative Dynamic Programming Computation • First, we set the initial value. • Example: • T = gggtcta • P= gttc

  20. e =0 • From d = 0to d = n, if P[1…j] is equal T[d+1…i], then we set the value of Ld,0 = j. • d = 0 • P1 = T1, L0,0 =1 i 1 2 3 4 5 6 7 j 1 2 3 4 d=0

  21. e =0 • d = 1 • P1 = T2, L1,0 =1 i 1 2 3 4 5 6 7 j 1 2 3 4 d=1

  22. e =0 • d =2 • P1=T3, P2 = T4, L2,0 = 2 i 1 2 3 4 5 6 7 j 1 2 3 4 d=2

  23. Our approach is based upon Rule 1 proposed by Professor Lee. • Consider tow substring A1 and A2 as shown below: A1 P1 S1 A2 P2 S2 If d(A1, A2) ≦k and S1=S2, then d(P1, P2) ≦k.

  24. Observe the following: If d(A1,A2) = k, S1 = S2, x ≠ y, then d(A1+S1+x, A2+S2+y) ≦ k+1

  25. For e≠0, we search through d = -e to d =n. • Let row = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)]. (subsitutaion)(deletion) (insertion) • Find the largest j, if it exists, such that P(row+1, j) = T(row+1+d, i) =T(row +1+i-j, i), set Ld,e =j. If no such j exists, set Ld,e = row.

  26. Let row = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)]. (subsitutaion)(deletion) (insertion) substitution deletion insertion Diagonal d+1 Diagonal d Diagonal d-1

  27. i 1 2 3 4 5 6 7 g g g t c t a 0 0 0 0 0 0 0 0 j 1 2 3 4 g 0 0 0 0 t 0 0 t 0 c 0 d = -1 • row = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)] = max[1+1, 2, 1+1] = max[2, 2, 2] = 2 • P(row+1, j) ≠ T(row+1+d, i) , P3 ≠ T2 • L-1,1 = 2

  28. g g g t c t a 0 0 0 0 0 0 0 0 j 1 2 3 4 g 0 0 0 0 t 0 1 0 t 0 c 0 d =0 i 1 2 3 4 5 6 7 • row = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)] = max[1+1, 1, 1+1] = max[2, 1, 2] = 2 • P(row+1, j) ≠ T(row+1+d, i) , P3 ≠ T3 • L0,1 = 2

  29. row = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)] = max[1+1, 1, 2+1]= max[2, 1, 3] = 3 P(row+1, j) = T(row+1+d, i) = P4 = T5 = c L1,1 = 4 = m We find an occurrence of the pattern in the text with edit distance at most 1 that ends at Td+m = T1+4 =T5 i 1 2 3 4 5 6 7 g g g t c t a 0 0 0 0 0 0 0 0 j 1 2 3 4 g 0 0 0 0 t 0 1 1 0 t 0 c 0 d =1

  30. i 1 2 3 4 5 6 7 g g g t c t a 0 0 0 0 0 0 0 0 j 1 2 3 4 g 0 0 0 0 t 0 1 1 0 t 0 1 1 c 0 1 • row = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)] = max[0+1, 2, 0+1] = max[1, 2, 1] = 2 • P(row+1, j) = T(row+1+d, i) , P3 = T6 , P4 ≠T7 • L3,1 = 3 d =3

  31. j 1 2 3 4 5 6 7 g g g t c t a 0 0 0 0 0 0 0 0 i 1 2 3 4 g 0 0 0 0 1 1 1 t 0 1 1 1 0 1 1 t 0 2 2 2 1 1 1 c 0 2 1 2 2 d =3 • row = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)] = max[3+1, 3, 2+1] = max[4, 3, 3] = 4 • L3,2 = 4 = m • We find an occurrence of the pattern in the text with edit distance at most 2 that ends at td+m = t3+4=t7.

  32. An alternative Dynamic Programming Computation Initialization for all d, 0≦d ≦n, Ld,-1 = -1 for all d, -(k+1) ≦d ≦-1, Ld,|d|-1 = |d|, Ld,|d|-2 = |d|-2 for all e, -1 ≦e ≦k, Ln+1,e = -1 For e = 0 to k do For d = -e to n do row = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)] row = min(row,m) while row < m and row +d <n and arow+1 = trow+1+d do row = row + 1 Ld,e = row if Ld,e = m then print *there is an occurrence ending at td+m*

  33. Different with this algorithm • In the alternative dynamic algorithm computation, we must search j such that P(row+1,j) = T (row +1+d, i) = T (row +1+i-j, i). • Essentially, we are looking for S1 and S2 in T and P respectively, as show below: • This paper will use LCA (lowest common ancestor) to improve this searching part.

  34. Algorithm • This algorithm has two steps: • Concatenate the text and the pattern to one string t1,…,tn,p1,…pm. Compute the “suffix tree” of this string. • Find all occurrence of the pattern in the text with edit distance at most k.

  35. T = ABCDEA P = DDBE S = ABCDEADDBE Suffix tree of a string with length n can be constructed in O(n). Weiner, 1973 McCreight, 1976 Ukkonen, 1995

  36. The lowest common ancestor of two leaf nodes can be found in O(1) by O(n) preprocessing in constructing time. Harel and Tarjan, 1984

  37. To find such S, if it exists, we may concatenate T and P to find a new string. Obviously, on the suffix tree, suffixes S1 and S2 have a common ancestor S. T P S1 S2

  38. If we want to compute L3,1, we will use L2,0, L3,0, L4,0 to decide the row value (row =2). S1 S2 i 1 2 3 4 5 6 7 8 g g g t c t a c In this paper, we find the length of LCA2,3 is 2. q = 2 L3,1 = row +2 =4 j 1 2 3 4 5 0 0 0 0 0 0 0 0 0 g 0 0 0 0 1 t 0 1 1 1 0 1 t 0 1 1 1 a 0 1 a 0 d=3

  39. S= gggtctacgttac text pattern

  40. Time Complexity • An alternative Dynamic Programming Computation takes O(mn) time. • The suffix tree has O(n) nodes. • LCA query responds in O(1) time. • For each of the n+k+1 diagonals, we evaluate (k+1)Ld,e’s • This algorithm takes O(nk) time.

  41. Reference • [AHU-74] A. V. AHO, J. W. HOPCROFT, AND J. D. ULLMAN, “The Designand Analysis of Computer Algorithms,” Addison- Wesley, Reading, MA, 1974 • [AILSV-88] A. APOSTOLICO, C. ILIOPOULOS, G.M. LANDAU, B. SCHIEBER, AND U. VISHKIN, Parallel construction of a suffix tree with applications, Algorithmica 3(1988), 347-365. • [BM-77] R.S. BOYER AND J. S. MOORE, Afast string searching algorithm, Comm. ACM 20(1977), 762-772 • [CS-85] M. T. CHEN AND J. SEIFERAS, Efficient and elegant subword tree construction, in “Combinatiorial Algorithms on Words,” (A. Apostolico and Z. Galil, ED.), NATO ASI Series F: Computer and System Sciences Vol. 12, pp. 97-107, Springer-Verlag, New York/ Berlin, 1985. • [G-84] Z. GALIL, Optimal parallel algorithms for string matching, in “”Proceedings, 16th ACM Symposium on Theory of Computing, 1984” pp..240-248; Inform. And CONTROL 67(1985), 144-157. • [GG-86] Z. GALIL AND R. GIANCARLO, Improved string matching with k mismatches, SIGACT News 17, No. 4(1986), 52-54. • [GG-87] Z. GALIL AND R. GIANCARLO, Parallel string matching with k mismatches, Theoret. Comput. Sci. 51(1987), 341-348. • [GS-83] Z. GALIL AND J. I. SEFIERAS, Time-space-optimal string matching, J. Comput. System Sci. 26(1983),280-294 • [HT-84] D. HAREL AND R. E. TARJAN, Fast algorithms for finding nearest common ancestors, SIAM J. Comput. 13, No. 2(1984), 338-355. • [KMP-77] D.E. KNUTH, J. H. MORRIS, AND V. R. PRATT, Fast pattern matching in strings, SIAM J. COMPUT. 6(1977), 323-350. • [KR-87] R. KARP AND M. O. RABIN, Efficient randomized pattern-matching algortihms, IBM J. Res. Develop. 31, No.2(1987), 249-260

  42. [LSV-87] G. M. LANDAU, B. SCHIEBER, AND U. VISHKIN, Parallel construction of a suffix tree, in “Proceedings 14th ICALP,” Lecture Notes in Computer Science Vol. 267, pp. 314-325, Springer-Verlag, New York/Berlin,1987. • [LV-86a] G. M. Landau and U. Vishkin, Introducing efficient parallelism into approximate string matching, in “Proc. 18th ACM Symposium on Theory of Computing, 1986,” pp. 220-230. • [LV-86b] G. M. Landau and U. Vishkin, Efficient string with k mismatches, Theoret. Comput. Sci.,43(1986), 239-249. • [LV-88] G. M. LANDAU AND VISHKIN, Fast string matching with k differences, J. Comput. System Sci. 37(No. 1), 1988,63-78 • [S80] The Theory and Computation of Evolutionary Distances: Pattern Recognition, Sellers, P. H., Journal of Algorithms, Vol. 20, No. 1, 1980, pp. 359~373. • [SK-83] D. SANKOFF AND J. B. KURSKAL (Eds.),”Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison,” Addison-Wesley, Reading, MA, 1983. • [SV-88] B. SCHIEBER AND U. VISHIN, Parallel computation of lowest common ancestor in trees, SIAM J. Comput., in press. • [U-83]E. UKKONEN, On approximate string matching, in press. In “Proceedings Int. Conf. Found. Comput. Theory,” Lecture Notes in Computer Science Vol. 158, pp. 487-495, Springer-Verlag, Berlin/New York, 1983. • [U-85] E. UKKONEN, Finding approximate pattern in strings, J. Algorithms 6(1985),132-137. • [V-83] U. VISHKIN, “Synchronous parallel computation-A survey,” TR-71, Department of Computer Science, Courant Institute, NYU, 1983. • [V-85] U. VISHKIN, Optimal parallel pattern matching in strings, in “Proceedings 12th ICALP,” Lecture Notes in Computer Science Vol. 194, pp. 497-508, Springer-Verlag, New York/Berlin, Inform. and Control 67(1985, 91-113.)

  43. Thank you

More Related