1 / 40

Advisor: Prof. R. C. T. Lee Speaker: C. W. Lu

Approximate String Matching Using Compressed Suffix Arrays Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol. 352, 2006, pp. 240-249. Advisor: Prof. R. C. T. Lee Speaker: C. W. Lu.

zilya
Download Presentation

Advisor: Prof. R. C. T. Lee Speaker: C. W. Lu

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Approximate String Matching Using Compressed Suffix ArraysTrinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol. 352, 2006, pp. 240-249 Advisor: Prof. R. C. T. Lee Speaker: C. W. Lu

  2. Let x and y be two strings. Edit distance d(x, y) is the minimum number of character insertions, deletions, and replacements to covert string x to y. • k-difference string matching problem: • Given a text T with length n, a pattern P with length m, and an error bound k. • Find all position i of T such that there exists an suffix S of T(1, i), d(S, P) ≦ k.

  3. The approach of this paper is as the follows: • Given a pattern P and an error bound k, we generate all possible P’s which contain (≦k) errors deduced from P. • Then we conduct an exact match of all such P’s against T.

  4. Example: T=abbaaa, P=aba and k=1. From P and k, we generate the following P’s: ba, aaba, baba, bba, aa, abba, aaa, ab, abaa, abb, aba.

  5. Then we conduct an exact matching of all P’s against T. Any success indicates that there is a substring S in T such that d(S,T)≦k. • How can we generate all P’s which we want? • We use the following observation.

  6. S S1 S2 T P P1 P2 Let S be a substring of T, and S= S1S2. P = P1P2. If d(S1, P1) ≦k, and Dist(S2, P2) = 0, d(S, P) ≦ k.

  7. k = 2 1 2 3 4 5 6 7 8 9 10 11 12 13 T Example: A C A C A A A A A C A C C S1 S2 1 2 3 4 5 6 P A G A B C A P1 P2 Consider the substring S = T(6, 11) = AAAACA, Let S1 = T(6, 9) = AAAA, and S2 = T(10, 11) = CA. Dist(S1, P1) = 2 ≦k, and Dist(S2, P2) = 0. We have Dist(S, P) = 2 ≦k.

  8. k = 2 1 2 3 4 5 6 7 8 9 10 11 12 13 T Example: A C A C A A A A A C A C C S1 S2 1 2 3 4 5 6 P A G A B C A P1 P2 Consider the substring S = T(8, 11) = AACA, Let S1 = T(8, 9) = AA, and S2 = T(10, 11) = CA. Dist(S1, P1) = 2 ≦k, and Dist(S2, P2) = 0. We have Dist(S, P) = 2 ≦k.

  9. Based upon the above observation, we can generate all edited pattern P’s by editing the prefix and keeping the suffix untouched, in some manner. • Consider P=aba, k=1.

  10. ba (Deletion) k = 1 aaba (Insertion) k = 1 i = 1 baba (Insertion) k = 1 P = aba • P=aba, k=1. bba (Substution) k = 1 aa (Deletion) k = 1 aba k = 0 aaba (Insertion) k = 1 abba (Insertion) k = 1 i = 2 aaa (Substution) k = 1 ab (Deletion) k = 1 aba k = 0 abaa (Insertion) k = 1 abba (Insertion) k = 1 i = 3 abb (Substution) k = 1 aba k = 0 abaa (Insertion) k = 1 abab (Insertion) k = 1 i = 4

  11. ba (Deletion) k = 1 aaba (Insertion) k = 1 i = 1 baba (Insertion) k = 1 P = aba • P=aba, k=2. bba (Substution) k = 1 aa (Deletion) k = 1 aba k = 0 aaba (Insertion) k = 1 abba (Insertion) k = 1 i = 2 aaa (Substution) k = 1 ab (Deletion) k = 1 aba k = 0 abaa (Insertion) k = 1 abba (Insertion) k = 1 i = 3 abb (Substution) k = 1 aba k = 0 abaa (Insertion) k = 1 abab (Insertion) k = 1 i = 4

  12. a (Deletion) k = 2 i = 2 • P=aba, k=2. aba (Insertion) k = 2 bba (Insertion) k = 2 ba (k = 1) aa (Substution) k = 2 b (Deletion) k = 2 ba k = 1 baa (Insertion) k = 2 bba (Insertion) k = 2 i = 3 bb (Substution) k = 2 ba k = 1 baa (Insertion) k = 2 bab (Insertion) k = 2 i = 4

  13. PR’ PL’ i For i=1 to m+1 Deletion, k’++ P’ PR PL i P PL’ PR’ P’ PL’ PR’ A Replacement , k’++ P’ P’ C … PL’ PR’ k’=Dist(PL’, PL)≦k. Dist(PR’, PR) = 0 P’ Insertion, k’++ A P’ C … PL’ PR’ No operation. P’ i Terminate if k’ > k.

  14. Our problem now becomes the following: Given a pattern P, we produce a modified pattern P’. Our job is to determine whether P’ exactly matches some substring of T or not. • For example, Suppose P=aba. We have ba as one of the modified patterns. So, we like to find out whether ba matches exactly with a substring in T.

  15. This exact matching can be found by using the suffix array and the inverse suffix array.

  16. Suffix Array • Let , where t0, t1, …tn-1 an alphabet A and tn=$ is a special symbol that is not in A and smaller than any symbol in A. • The jth suffix of T is defined as T(j, n) = tj…tn and is denoted by Tj. • The suffix array SA[0..n] of T is an array of integers j that represent suffix Tj and the integers are sorted in lexicographic order of corresponding suffixes.

  17. Example: 0 1 2 3 4 5 6 7 8 9 T G A C A G T T C G $ Suffixes of T: {GACAGTTCG$, ACAGTTCG$, CAGTTCG$, AGTTCG$, GTTCG$, TTCG$, TCG$, CG$, G$, $} Lexicographic order: $, ACAGTTCG$, AGTTCG$, CAGTTCG$, CG$, G$, GACAGTTCG$, GTTCG$, TCG$, TTCG$. = T9, T1, T3, T2, T7, T8, T0, T4, T6, T5 i 0 1 2 3 4 5 6 7 8 9  SA[i] 9 1 3 2 7 8 0 4 6 5

  18. Inverse Suffix Array • The inverse suffix array of T is denoted as SA-1[i]. • SA-1[i] equals the number of suffix which are lexicographically smaller then Ti.

  19. Example: 0 1 2 3 4 5 6 7 8 9 T G A C A G T T C G $ Lexicographic order: $ (T9) ACAGTTCG$ (T1) AGTTCG$ (T3) CAGTTCG$ (T2) CG$ (T7) G$ (T8) GACAGTTCG$ (T0) GTTCG$ (T4) TCG$ (T6) TTCG$. (T5) i SA[i] SA-1[i] SA-1[0]=6 because there are 6 suffixes smaller than T0= GACAGTTCG. 0 9 6 1 1 1 2 3 3 3 2 2 4 7 7 5 8 9 6 0 8 7 4 4 SA-1[SA[x] ] = x. 8 6 5 9 5 0

  20. The size of SA and SA-1 are O(nlogn) bits. Both data structures can be constructed in linear time[13, 15, 17].

  21. In this paper, an interval [st..ed] is called the range of the suffix array of T corresponding to a string P if [st..ed] is the largest interval such that P is a prefix of every suffix Tj for j = SA[st], SA[st+1], …, SA[ed]. We write [st..ed ] = range(T, P).

  22. Example: 0 1 2 3 4 5 6 7 8 9 T G A C A G T T C G $ i SA[i] P = G. Lexicographic order: $ (T9) ACAGTTCG$ (T1) AGTTCG$ (T3) CAGTTCG$ (T2) CG$ (T7) G$ (T8) GACAGTTCG$ (T0) GTTCG$ (T4) TCG$ (T6) TTCG$. (T5) 0 9 G is a prefix of T8, T0 and T4. 1 1 2 3 T8 = TSA[5] T0 = TSA[6] T4 = TSA[7]  st=5, ed=7, range(T, P) = [5..7]. 3 2 4 7 5 8 6 0 7 4 8 6 9 5

  23. Lemma 1 (Gusfild [12]) Given a text T together with its suffix array, assume [st..ed] = range(T, P). Then, for any character c, the interval[st’..ed’] = range(T, Pc) can be computed in O(logn) time.

  24. Lemma 2 Given the interval [st1..ed1] = range(T , P1) and the interval [st2..ed2] = range(T , P2), we can find the interval [st..ed] = range(T , P1P2) in O(logn) time using the suffix array and the inverse suffix array of T.

  25. Let [st1..ed1] = range(T , P1), [st2..ed2] = range(T , P2), [st..ed] = range(T , P1P2). • [st..ed] is a subinterval of [st1..ed1].

  26. Example: 0 1 2 3 4 5 6 7 8 9 T G A C A G T T C G $ i SA[i] Lexicographic order: $ (T9) ACAGTTCG$ (T1) AGTTCG$ (T3) CAGTTCG$ (T2) CG$ (T7) G$ (T8) GACAGTTCG$ (T0) GTTCG$ (T4) TCG$ (T6) TTCG$. (T5) P1 = G. P2 = A. 0 9 1 1 range(T, P1) = [5..7]. 2 3 3 2 range(T, P1P2) must be within [5..7]. How can we find the exact interval with [5..7]? 4 7 5 8 6 0 7 4 8 6 9 5

  27. By the definition of suffix array, the lexicographic order of are increasing. • The lexicographic order of are also increasing.

  28. T2 = CAGTTCG$ T2+1 = T3 = AGTTCG$ T2+1 is obtained by deleting the prefix with length 1 from T2. In general, Ti+1 can be obtained by deleting the prefix with length 1 from Ti. Lexicographic order: $ (T9) ACAGTTCG$ (T1) AGTTCG$ (T3) CAGTTCG$ (T2) CG$ (T7) G$ (T8) GACAGTTCG$ (T0) GTTCG$ (T4) TCG$ (T6) TTCG$. (T5)

  29. Example: 0 1 2 3 4 5 6 7 8 9 T G A C A G T T C G $ P1 = G. P2 = A. i SA[i] Lexicographic order: $ (T9) ACAGTTCG$ (T1) AGTTCG$ (T3) CAGTTCG$ (T2) CG$ (T7) G$ (T8) GACAGTTCG$ (T0) GTTCG$ (T4) TCG$ (T6) TTCG$. (T5) 0 9 range(T, P1) = [5..7]. 1 1 2 3 3 2  T8 < T0 < T4 4 7 5 8 6 0 • T8+1, T0+1, T4+1 • T9 < T1 < T5 7 4 8 6 9 5

  30. The lexicographic order of are also increasing. • Thus • To find st and ed, we find the smallest st such that and the largest ed such that

  31. Example: 0 1 2 3 4 5 6 7 8 9 T G A C A G A T C G $ P1 = G. P2 = A. i SA[i] SA-1[i] Lexicographic order: $ (T9) ACAGTTCG$ (T1) AGTTCG$ (T3) ATCG$. (T5) CAGTTCG$ (T2) CG$ (T7) G$ (T8) GACAGTTCG$ (T0) GATCG$ (T4) TCG$ (T6) 0 9 7 range(T, P1) = [6..8]. 1 1 1 range(T, P2) = [1..3]. 2 3 4 range(T, P1P2) = [st..ed]. 3 5 2 4 2 8 6≦ st, ed ≦8 5 7 3 6 8 9 7 0 5 8 4 6 9 6 0  st = 7and ed =8.

  32. To find the interval of the first character of P: We construct an array C such that for any c in A, C[c] stores the total number of occurrences of all c’ in T, where c’ ≦ c. range(T, p1) = [C[c2]+1 … C[c]] where c2 is a character immediately before c in A.

  33. Example: 0 1 2 3 4 5 6 7 8 9 T G A C A G T T C G $ i SA[i] C[A] = 2 C[C] = 4 C[G] = 7 C[T] = 9 Lexicographic order: $ (T9) ACAGTTCG$ (T1) AGTTCG$ (T3) CAGTTCG$ (T2) CG$ (T7) G$ (T8) GACAGTTCG$ (T0) GTTCG$ (T4) TCG$ (T6) TTCG$. (T5) 0 9 1 1 2 3 3 2 4 7 5 8 P = GACAGCA 6 0 7 4 range(T, p1) = [C[C]+1…C[G] ] = [5…7]. 8 6 9 5

  34. Lemma 3 Given the suffix array and the inverse suffix array of T, assume [st..ed] = range(T, P). For any character c, assume we have in advance the array C, we can find the interval [st’..ed’] = range(T, cP) in O(logn) time.

  35. I Construct Fst [1..m+1] and Fed [1..m+1] such that [Fst [i]..Fed [i]]= range(T ,P[i..m]). II Call kapproximate([0..n], 1, 0, ε, ε). kapproximate([s’..e’], i, k’, PL’, Υ ) begin 1. Given [Fst [i]..Fed [i]] = range(T , P[i..m]) and [s’..e’] = range(T , PL’), by Lemma 2 find [st..ed] = range(T , PL’P[i..m]). 2. Report occurrences of P∗ = PL’P[i..m] in [st..ed] if the interval exists. 3. If (k’ = k) return. 4. For j :=i to m+1 (a) (when j ≦m, deletion at j) Call kapproximate([s’..e’], j+1, k’+1, PL’, dΥ). (b) (when j≦ m, replacement at j ) for each c in A i. Given [s’..e’] = range(T , PL’), by Lemma 1 find [s’’..e’’] = range(T , PL’c). ii. Call kapproximate([s’’..e’’], j+1, k’+1, PL’c, rΥ). (c) (insertion at j) for each c in A i. Given [s’..e’] = range(T , PL’), by Lemma 1 find [s’’..e’’] = range(T , PL’c). ii. Call kapproximate([s’’..e’’], j, k’+1, PL’c, iΥ). (d) (when j≦m) Given [s’..e’] = range(T , PL’), by Lemma 1 find [s’’..e’’] = range(T , PL’P[j]). s’ := s’’; e’ := e’’; PL’ := PL’P[j]; Υ := uΥ; end

  36. After an O(n) time preprocessing the text T into an O(nlogn)-bit data structure, the algorithm solves the k-difference problem in O(|A|kmklogn + outputtime) time.

  37. References • [1] A. Amir, D. Keselman, G.M. Landau, M. Lewenstein, N. Lewenstein, M. Rodeh, Indexing and dictionary matching with one error, in: Proc. • Sixth WADS, Lecture Notes in Computer Science, vol. 1663, Springer, Berlin, 1999, pp. 181–192. • [2] A. Amir, M. Lewenstein, Ely. Porat, Faster algorithms for string matching with k mismatches, in: Proc. 11th Ann. ACM-SIAM Symp. on • Discrete Algorithms, 2000, pp. 794–803. • [3] R.A. Baeza-Yates, G. Navarro, A faster algorithm for approximate string matching, in: Proc. Seventh Ann. Symp. on Combinatorial Pattern • Matching (CPM’96), pp. 1–23. • [4] R.A. Baeza-Yates, G. Navarro, A practical index for text retrieval allowing errors, in: CLEI, vol. 1, November 1997, pp. 273–282. • [5] R. Boyer, S. Moore, A fast string matching algorithm, CACM 20 (1977) 762–772. • [6] A.L. Buchsbaum, M.T. Goodrich, J. Westbrook, Range searching over tree cross products. in: ESA 2000, pp. 120–131. • [7] A. Cobbs, Fast approximate matching using suffix trees. in: Proc. Sixth Ann. Symp. on Combinatorial Pattern Matching (CPM’95), Lecture • Notes in Computer Science, vol. 807, Springer, Berlin, 1995, pp. 41–54. • [8] R. Cole, L.A. Gottlieb, M. Lewenstein, Dictionary matching and indexing with errors and don’t cares, in: Proc. 36th Ann. ACM Symp. on • Theory of Computing, 2004, pp. 91–100. • [9] P. Ferragina, G. Manzini, Opportunistic data structures with applications, in: Proc. 41st IEEE Symp. on Foundations of Computer Science • (FOCS’00), 2000, pp. 390–398.

  38. [10] G. Gonnet, A tutorial introduction to computational biochemistry using Darwin, Technical Report, Informatik E.T.H., Zurich, Switzerland, • 1992. • [11] R. Grossi, J.S. Vitter, Compressed suffix arrays and suffix trees with applications to text indexing and string matching, in: Proc. 32nd ACM • Symp. on Theory of Computing, 2000, pp. 397–406. • [12] D. Gusfield, Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, Cambridge University Press, • Cambridge, 1997. • [13] W.K. Hon, K. Sadakane,W.K. Sung. Breaking a time-and-space barrier in constructing full-text indices, in: Proc. IEEE Symp. on Foundations • of Computer Science, 2003. • [14] P. Jokinen, E. Ukkonen, Two algorithms for approximate string matching in static texts. in: Proc. MFCS’91, Lecture Notes in Computer Science, • vol. 520, Springer, Berlin, 1991, pp. 240–248. • [15] D.K. Kim, J.S. Sim, H. Park, K. Park, Linear-time construction of suffix arrays, in: CPM 2003, pp. 186–199. • [16] D.E. Knuth, J. Morris, V. Pratt, Fast pattern matching in strings, SIAM J. Comput. 6 (1977) 323–350. • [17] P. Ko, S. Aluru, Space efficient linear time construction of suffix arrays. in: CPM 2003, pp. 200–210. • [18] G.M. Landau, U. Vishkin, Fast parallel and serial approximate string matching, J. Algorithms 10 (1989) 157–169. • [19] U. Manber, G. Myers, Suffix arrays: a new method for on-line string searches, SIAM J. Comput. 22 (5) (1993) 935–948.

  39. [20] E.M. MCreight, A space economical suffix tree construction algorithm, J. ACM 23 (2) (1976) 262–272. • [21] G. Navarro, A guided tour to approximate string matching, ACM Comput. Surveys 33 (1) (2001) 31–88. • [22] G. Navarro, R.A. Baeza-Yates, A new indexing method for approximate string matching, in: Proc. 10th Ann. Symp. on Combinatorial Pattern • Matching (CPM’99), pp. 163–185. • [23] G. Navarro, R.A. Baeza-Yates, A hybrid indexing method for approximate string matching, J. Discrete Algorithms 1 (1) (2000) 205–239 18. • [24] G. Navarro, R. Baeza-Yates, E. Sutinen, J. Tarhio, Indexing methods for approximate string matching, IEEE Data Eng. Bull. 24 (4) (2001) • 19–27. • [25] G. Navarro, E. Sutinen, J. Tanninen, J. Tarhio, Indexing text with approximate q-grams, in: Proc. 11th Ann. Symp. on Combinatorial Pattern • Matching, Lecture Notes in Computer Science, vol. 1848, Springer, Berlin, 2000. • [26] K. Sadakane, T. Shibuya, Indexing huge genome sequences for solving various problems, Genome Informatics 12 (2001) 175–183. • [27] F. Shi, Fast approximate string matching with q-blocks sequences, in: Proc. Third South American Workshop on String Processing (WSP’96), • Carleton University Press, 1996. • [28] E. Sutinen, J. Tarhio, Filtration with q-samples in approximate string matching. in: Proc. Seventh Ann. Symp. on Combinatorial Pattern Matching • (CPM’96), pp. 50–63. • [29] E. Ukkonen, Approximate matching over suffix trees, in: Proc. Combinatorial Pattern Matching 1993, vol. 4, Springer, Berlin, June 1993, • pp. 228–242. • [30] R.A. Wagner, M.J. Fischer, The string-to-string correction problem, J. ACM 21 (1974) 168–173.

  40. Thank you!

More Related