1 / 56

Advisor: Prof. R. C. T. Lee Reporter: C. C. Yen

The MaxSuffix-Matching Algorithm On maximal suffixes and constant-space versions of KMPalgorithm LATIN 2002: Theoretical Informatics : 5th Latin American Symposium, Cancun, Mexico, April 3-6, 2002. Proceedings. Rytter, W. Advisor: Prof. R. C. T. Lee Reporter: C. C. Yen. Maximal Suffix.

serenam
Download Presentation

Advisor: Prof. R. C. T. Lee Reporter: C. C. Yen

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The MaxSuffix-Matching Algorithm On maximal suffixes and constant-space versions of KMPalgorithm LATIN 2002: Theoretical Informatics : 5th Latin American Symposium, Cancun, Mexico, April 3-6, 2002. Proceedings. Rytter, W. Advisor: Prof. R. C. T. Lee Reporter: C. C. Yen

  2. Maximal Suffix • A maximal suffix of a string is a suffix which is lexicographically maximal of all suffixes of a string. • The maximal suffix of string w is denoted by MaxSuf(w) • Ex: Consider string w = abaaba The set of its suffixes : {a, ba, aba, aaba, baaba, abaaba} The set of its sorted suffixes:{a, aaba, aba, abaaba, ba, baaba} • Thus we can find that MaxSuf(w) = baaba.

  3. Self-Maximal String • A string w is said to be self-maximal if MaxSuf(w) = w. • Ex: Consider strings w = abaaba, x = baaba. • The MaxSuf(w) = baaba. • The MaxSuf(x) = baaba. • Hence, we say that x is a self-maximal string but w is not.

  4. Important Properties of Self-Maximal Strings • By definition, we have the following observation about self-maximal strings: • For a self-maximal string P, suppose a prefix P1,P2,…,Pi of P is equal to a substring, Pk,Pk+1,…, Pk+i-1, of P, then Pi+1>=Pk+i. … P u x u y  x > y

  5. Example: TCATBTCATA is a self-maximal string. • But, TBATATBATB is not a self-maximal string because B after the substring TBAT is lexically larger than A after prefix TBAT.

  6. A period of a string w is an integer p, , such that : Ex: Consider string w = bbabbabbabba bbabbabbabba → period = 3 and period =6. abcdefg →period=word length=7 abcdeab →period=5 We define period(w) as the smallest period of w. If w = bbabbabbabba, period(w) is 3. The Period of a String

  7. Given a string P, we are actually interested in the period of every prefix. Note that the period of i-prefix(i) in the MP-algorithm which is the number of steps which we can move the pattern. (The index starts from 1 in this case.)

  8. Why are we interested in the period function? • If the period function is actually the same as the prefix function of the MP_algorithm, why are we interested in it? • To calculate the prefix function, we must use pointers which point back to some characters way back. • In the following, we shall introduce a naïve period function which never looks back.

  9. Naive-Period Function • Function Naive-Period can be used to compute the period of a string if this string is self-maximal. • For a general string, the Naive-Period function will not work. This is why our algorithm only works for the self-maximal strings.

  10. Function Naive-Period (j); { computes the period of self-maximal pat} period (1):= 1; for i := 2 to j do if then period (i):= i ; else period(i) := period(i - 1) return period; Algorithm of Naive-Period Function

  11. Function Naive-Period (j); { computes the period of self-maximal pat} period (1):= 1; for i := 2 to j do if then period (i):= i ; else period(i) := period(i - 1) return period; An Example of Naive-Period Function

  12. An Example of Naive-Period Function • Consider a string w = bbabbabbab • w isaself-maximalstring and period(w)=3.

  13. An Example of Naive-Period Function • Consider a string w = bbabbabbab • w isaself-maximalstring and period(w)=3.

  14. An Example of Naive-Period Function • Consider a string w = bbabbabbab • w isaself-maximalstring and period(w)=3.

  15. An Example of Naive-Period Function • Consider a string w = bbabbabbab • w isaself-maximalstring and period(w)=3.

  16. An Example of Naive-Period Function • Consider a string w = bbabbabbab • w isaself-maximalstring and period(w)=3.

  17. An Example of Naive-Period Function • Consider a string w = bbabbabbab • w isaself-maximalstring and period(w)=3.

  18. An Example of Naive-Period Function • Consider a string w = bbabbabbab • w is aself-maximalstring and period(w)=3.

  19. An Example of Naive-Period Function • Consider a string w = bbabbabbab • w is aself-maximalstring and period(w)=3.

  20. Why can Naïve period work in the self-maximal string? • Given any pattern P, let k be the length of the longest proper suffix of P[1, i-1] equal to a prefix P[1, k] of a P[1, i-1]. • Let k’ be the length of the longest proper suffix of P[1, i] equal to a prefix P[1, k’] of a P[1, i]. • For any i, we consider the following possibilities: k k P i-1 k’ k’ P i

  21. k ≠ 0 and P[k + 1] = P[i] : Period(i) = Period(i - 1) • k ≠ 0, P[k + 1] ≠ P[i] and k’ ≠ 0 : Period(i) = i – k’ • k ≠ 0, P[k + 1] ≠ P[i] and k’ = 0 : Period(i) = i • k = 0 and k’ ≠ 0 : Period(i) = i – k’ • k = 0 and k’ = 0 : Period(i) = i

  22. 1. k ≠ 0 and P[k + 1] = P[i]: Period(i) = Period(i - 1) For i = 8, the substring “abc” of length 3 (k = 3) is the longest suffix of P(1, 7) which equals to a prefix of P(1, 7) and P(8) = P(4) period(8) = period(7)=4.

  23. 2. k ≠ 0, P[k + 1] ≠ P[i] and k’ ≠ 0: Period(i) = i – k’ For i = 9, the substring “abca” of length 4 (k = 4) is the longest suffix of P(1, 8) which equals to a prefix of P(1, 8) and P(9) ≠ P(5) There is a suffix of P(1, 9)which equals to a prefixof P(1, 9), P(1, 2) = ab of length 2 (k’ = 2)  period(9) = i - | P(1, 2)| = 9 - 2 =7.

  24. 3. k ≠ 0, P[k + 1] ≠ P[i] and k’ = 0: Period(i) = i For i = 9, the substring “abcc” of length 4 (k = 4) is the longest suffix of P(1, 8) which equals to a prefix of P(1, 8) and P(9) ≠ P(5) There is no suffix of P(1, 9)which equals to a prefixof P(1, 9) , (k’ = 0).  period(9) = i = 9.

  25. 4. k = 0 and k’ ≠ 0: Period(i) = i – k’ For i = 9, the is no suffix of P(1, 8) which equals to a prefix of P(1, 8), (k = 0) The substring “a” of length 1 (k’ = 1) is a suffix of P(1, 9)which equals to a prefixof P(1, 9), P(1, 1) = a.  period(9) = i - |P(1, 1)| = 9-1 = 8.

  26. 5. k = 0 and k’ = 0: Period(i) = i For i = 9, there is no suffix of P(1, 8) which equals to a prefix of P(1, 8), (k = 0). There is no suffix of P(1, 9)which equals to a prefixof P(1, 9), (k’ = 0).  period(9) = i = 9.

  27. But, the conditions 2 (k ≠ 0, P[k + 1] ≠ P[i] and k’ ≠ 0) and 4 (k = 0 and k’ ≠ 0) do not exist in self-maximal suffix. Why? Assume that the conditions 2 & 4 holds. There must be a suffix which is equal to a prefix. Let u be the such a longest suffix.

  28. 2. k ≠ 0, P[k + 1] ≠ P[i] and k’ ≠ 0 j i P x y u u period period Suppose that P is self-maximal. Since P[i]=y≠P[j]=x holds, x >y. Since k’ ≠ 0, there is a v+y whichis the longest suffix of P(1,i) equal to a prefix of P(1,i) as shown above. i P v y x v y u u period period

  29. i P v y x v y u u period period Since k ≠ 0, we must have the following. j i P v y v x v y v y u u period period Since P is a self-maximal string, from the prefix u, we may conclude that y>x. Contradiction! k ≠ 0, P[k + 1] ≠ P[i] and k’ ≠ 0 cannot hold for self-maximal strings.

  30. Using similar reasoning, we can prove that for self-maximal strings, k = 0 and k’ ≠ 0 does not hold. Thus we may have the following: For self-maximal strings, Period(i)=Period(i - 1) or Period(i)=i. That is, the naïve period function works for Self-maximal strings.

  31. What is the advantage of the naïve-period function? • It is linear and we never need to look back to some characters way back, as we need in calculating the prefix function in MP-algorithm.

  32. For a string which is not self-maximal, we use the following algorithm, called the Max-Suffix Matching Algorithm.

  33. MaxSuffix-Matching Algorithm • First, we decompose the pattern string P to be u · v, where v= MaxSuf(P) and u is the other part of P. • Note that v is unique in the string P, and this is a very important property. • Property 1: No suffix of u is equal to a prefix of v., because v is uniqueness. • Example: P = dababdadad MaxSuf(P) = dadad P = u·v = dabab ·dadad

  34. MaxSuffix-Matching Algorithm • If v is found in T, we next find the part u of P which occurs in the left of v by a naive testing way. • Assume i is the location of an occurrence of v in T and the string before i is denoted as prev because of Property 1. prev i Text

  35. Maxsuffix-Matching Algorithm AlgorithmMaxsuffix-Matching i:= 0; j:=0; period:=1;prev:=0; whilei≤ n - |v| do begin while j < |v| and v[i+1]= T[i+j+1] do begin j=j+1; if j > period and v[j] ≠ v[j -period] thenperiod:=jend; {MATCH OFv} if j = |u| then begin if i− prev > |u| and u = T[i− |u| + 1… i] then reportmatch at i− |u|; prev := i; end i := i + period; ifj≥ 2 ・period thenj := j− period else beginj:= 0; period := 1 end; end; Naive-Period Function Test u by usingany algorithm

  36. Example • Text = adadaddadabababadada • P = u·v = abababa · dada • case1 • If i < |u|, that there is no occurrence of u·v at beginning. Text 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 i

  37. Example • Text = adadaddadabababadada • P = u·v = abababa · dada • Case2 • If i – prev <|u|, then there is no occurrence of u·v at position i - |u|. This is because the maximal suffix v of P only start at one position on P. i = 7, |u| = 7, prev =2 Text 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

  38. Example • Text = adadaddadabababadada • P = u·v = abababa · dada • So, we only need to check whether u exists in the left of third v in this example. Third occurrence Second occurrence First occurrence Text 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

  39. Time Complexity and Space Complexity • Hence, the MaxSuffix-Matching Algorithm can find all occurrences of a pattern in O(1) space (i, j, period) and linear time complexity.

  40. Reference • Maxime Crochemore, String-matching on ordered alphabets, Theoretical Computer Science, v.92 n.1, p.33-47, Jan. 6, 1992 • Maxime Crochemore, Dominique Perrin, Two-way string-matching, Journal of the ACM (JACM), v.38 n.3, p.650-674, July 1991 • Maxime Crochemore, Wojcjech Rvtter, Text algorithms, Oxford University Press, Inc.,New York, NY, 1994 • M. Crochemore, W. Rytter, Cubes, squares and time space efficient string matching, Algorithmica 13 (5) (1995) 405-425. • J.-P. Duval, Factorizing words over an ordered alphabet, J. Algorithms 4 (1983) 363-381.

  41. Reference • Z Galil, J. Seiferas, Time-space-optimal string matching, J. Comput. System Sci. 26 (1983) 280-294. • L. Gasieniec, W. Plandowski, W. Rytter, Constant-space string matching with smaller number of comparisons: sequential sampling, in: Z. Galil, E. Ukkonen (Eds.), Combinatorial Pattern Matching, 6th Annual Symposium, CPM gs, Lecture Notes in Computer Science, Vol. 937, Springer, Berlin, 1995, pp. 78-89. • Leszek Gasieniec , Woiciech Plandowski , Woiciech Rytter, The zooming method: a recursive approach to time-space efficient string-matching, Theoretical Computer Science, v. 147 n. 1-2, p. 19-30, Aug. 7, 1995 • D.E. Knuth, J.H. Morris, V.R. Pratt, Fast pattern matching in strings, SIAM J. Comput. 6 (1977) 322-350. • M. Lothaire, Combinatorics on Words, Addison-Wesley, Reading, MA, USA, 1983.

  42. Two Way Algorithm Two-way string-matching Journal of the ACM 38(3):651-675, 1991 Crochemore M., Perrin D. Advisor: Prof. R. C. T. Lee Speaker: C. C. Yen

  43. In 2003 ,Rytter proposed a constant space and linear time string matching algorithm • To achieving the good constant space , this algorithm avoids the preprocessing function table of the KMP algorithm • Before introducing this algorithm , we shall define some characteristic of the strings

  44. The Property of Maximal Suffix • Consider a string P. Let P = uv where v = MaxSuf(P). The property of the maximal suffix of a string is: If u is non-empty, no suffix of u will be equal to a prefix of v. Example : Consider a pattern = ababadada. Let P = uv =ababa.dada No suffix of u is equal to a prefix of v.

  45. Short Maximal Suffix • If a maximal suffix of a string x satisfies , we say that this maximal suffix of x is a short maximal suffix of x. Example: Consider a string x = abcdda ,dda is a maximal suffix of x and . Hence we say that dda is a short maximal suffix of x

  46. Short Prefixes Lemma • Let the decomposition of P = uv, where v is the maximal suffix of P and v is also a short maximal suffix. Suppose that we start to match v with T at position i, a part of v is matched and a mismatch occurs at the j +1-th position on v. Then we can shift P safely by j + 1 positions without missing any occurrence of P in T. i i+j+1 T: mismatch j j P: u v j P: v u

  47. j i v’ T: Why do we have to use short maximal suffix? Suppose V’ is very long, then we move pattern which is incorrect. j i v’ P: u v j j+1 T: j i P: u v

  48. In the following , we will introduce the basic rule of the Two Way Matching algorithm with short maximal pattern strings The basic rules are given in the next slides.

  49. Basic rule of the Two-Way algorithm with short maximal 1. Let the decomposition of P=uv, where v is the maximal suffix of P and v is also a short maximal suffix. • We then find where v appears in T from left to right. Assume the comparison starts at position i. When a mismatch occurs at v[j + 1], we shift v with j + 1 characters and start next comparison at P[1] with T[i + j + 1]. • When the part of v has be found in T, we scan the part of u from right to left. If a mismatch occurs when scanning u, we shift P with Period(P) 4. If we find both the parts of v and u in T, we report an occurrence of P in T. We then shift v with Period(P)

  50. Full Example T=adadadaddadababadada P=u.v = ababa .dada

More Related