Understanding Maximal Suffixes and Naive-Period Function for Pattern Matching

Learn about self-maximal strings, suffix properties, period computation, and Naive-Period function for efficient pattern matching algorithms.

  1. The MaxSuffix-Matching Algorithm On maximal suffixes and constant-space versions of KMPalgorithm LATIN 2002: Theoretical Informatics : 5th Latin American Symposium, Cancun, Mexico, April 3-6, 2002. Proceedings. Rytter, W. Advisor: Prof. R. C. T. Lee Reporter: L. Y. Huang

  2. Maximal Suffix • A maximal suffix of a string is a suffix which is lexicographically maximal of all suffixes of a string. • The maximal suffix of string w is denoted by MaxSuf(w) • Ex: Consider string w = abaaba The set of its suffixes : {a, ba, aba, aaba, baaba, abaaba} The set of its sorted suffixes:{a, aaba, aba, abaaba, ba, baaba} • Thus we can find that MaxSuf(w) = baaba.

  3. Self-Maximal String • A string w is said to be self-maximal if MaxSuf(w) = w. • Ex: Consider strings w = abaaba, x = baaba. • The MaxSuf(w) = baaba. • The MaxSuf(x) = baaba. • Hence, we say that x is a self-maximal string but w is not.

  4. Important Properties of Self-Maximal Strings • By definition, we have the following observation about self-maximal strings: • For a self-maximal string P, suppose a prefix P1,P2,…,Pi of P is equal to a substring, Pk,Pk+1,…, Pk+i-1, of P, then Pi+1>=Pk+i. … P u x u y  x > y

  5. Example: TCATBTCATA is a self-maximal string. • But, TBATATBATB is not a self-maximal string because B after the substring TBAT is lexically larger than A after prefix TBAT.

  6. A period of a string w is an integer p, , such that : Ex: Consider string w = bbabbabbabba bbabbabbabba → period = 3 and period =6. abcdefg →period=word length=7 abcdeab →period=5 We define period(w) as the smallest period of w. If w = bbabbabbabba, period(w) is 3. The Period of a String

  7. Given a string P, we are actually interested in the period of every prefix. Note that the period of i-prefix(i) in the MP-algorithm which is the number of steps which we can move the pattern. (The index starts from 1 in this case.)

  8. Why are we interested in the period function? • If the period function is actually the same as the prefix function of the MP_algorithm, why are we interested in it? • To calculate the prefix function, we must use pointers which point back to some characters way back. • In the following, we shall introduce a naïve period function which never looks back.

  9. Naive-Period Function • Function Naive-Period can be used to compute the period of a string if this string is self-maximal. • For a general string, the Naive-Period function will not work. This is why our algorithm only works for the self-maximal strings.

  10. Function Naive-Period (j); { computes the period of self-maximal pat} period (1):= 1; for i := 2 to j do if then period (i):= i ; else period(i) := period(i - 1) return period; Algorithm of Naive-Period Function

  11. Function Naive-Period (j); { computes the period of self-maximal pat} period (1):= 1; for i := 2 to j do if then period (i):= i ; else period(i) := period(i - 1) return period; An Example of Naive-Period Function

  12. An Example of Naive-Period Function • Consider a string w = bbabbabbab • w isaself-maximalstring and period(w)=3.

  13. An Example of Naive-Period Function • Consider a string w = bbabbabbab • w isaself-maximalstring and period(w)=3.

  14. An Example of Naive-Period Function • Consider a string w = bbabbabbab • w isaself-maximalstring and period(w)=3.

  15. An Example of Naive-Period Function • Consider a string w = bbabbabbab • w isaself-maximalstring and period(w)=3.

  16. An Example of Naive-Period Function • Consider a string w = bbabbabbab • w isaself-maximalstring and period(w)=3.

  17. An Example of Naive-Period Function • Consider a string w = bbabbabbab • w isaself-maximalstring and period(w)=3.

  18. An Example of Naive-Period Function • Consider a string w = bbabbabbab • w is aself-maximalstring and period(w)=3.

  19. An Example of Naive-Period Function • Consider a string w = bbabbabbab • w is aself-maximalstring and period(w)=3.

  20. Why can Naïve period work in the self-maximal string? • Given any pattern P, let k be the length of the longest proper suffix of P[1, i-1] equal to a prefix P[1, k] of a P[1, i-1]. • Let k’ be the length of the longest proper suffix of P[1, i] equal to a prefix P[1, k’] of a P[1, i]. • For any i, we consider the following possibilities: k k P i-1 k’ k’ P i

  21. k ≠ 0 and P[k + 1] = P[i] : Period(i) = Period(i - 1) • k ≠ 0, P[k + 1] ≠ P[i] and k’ ≠ 0 : Period(i) = i – k’ • k ≠ 0, P[k + 1] ≠ P[i] and k’ = 0 : Period(i) = i • k = 0 and k’ ≠ 0 : Period(i) = i – k’ • k = 0 and k’ = 0 : Period(i) = i

  22. 1. k ≠ 0 and P[k + 1] = P[i]: Period(i) = Period(i - 1) For i = 8, the substring “abc” of length 3 (k = 3) is the longest suffix of P(1, 7) which equals to a prefix of P(1, 7) and P(8) = P(4) period(8) = period(7)=4.

  23. 2. k ≠ 0, P[k + 1] ≠ P[i] and k’ ≠ 0: Period(i) = i – k’ For i = 9, the substring “abca” of length 4 (k = 4) is the longest suffix of P(1, 8) which equals to a prefix of P(1, 8) and P(9) ≠ P(5) There is a suffix of P(1, 9)which equals to a prefixof P(1, 9), P(1, 2) = ab of length 2 (k’ = 2)  period(9) = i - | P(1, 2)| = 9 - 2 =7.

  24. 3. k ≠ 0, P[k + 1] ≠ P[i] and k’ = 0: Period(i) = i For i = 9, the substring “abcc” of length 4 (k = 4) is the longest suffix of P(1, 8) which equals to a prefix of P(1, 8) and P(9) ≠ P(5) There is no suffix of P(1, 9)which equals to a prefixof P(1, 9) , (k’ = 0).  period(9) = i = 9.

  25. 4. k = 0 and k’ ≠ 0: Period(i) = i – k’ For i = 9, the is no suffix of P(1, 8) which equals to a prefix of P(1, 8), (k = 0) The substring “a” of length 1 (k’ = 1) is a suffix of P(1, 9)which equals to a prefixof P(1, 9), P(1, 1) = a.  period(9) = i - |P(1, 1)| = 9-1 = 8.

  26. 5. k = 0 and k’ = 0: Period(i) = i For i = 9, there is no suffix of P(1, 8) which equals to a prefix of P(1, 8), (k = 0). There is no suffix of P(1, 9)which equals to a prefixof P(1, 9), (k’ = 0).  period(9) = i = 9.

  27. But, the conditions 2 (k ≠ 0, P[k + 1] ≠ P[i] and k’ ≠ 0) and 4 (k = 0 and k’ ≠ 0) do not exist in self-maximal suffix. Why? Assume that the conditions 2 & 4 holds. There must be a suffix which is equal to a prefix. Let u be the such a longest suffix.

  28. 2. k ≠ 0, P[k + 1] ≠ P[i] and k’ ≠ 0 j i P x y u u period period Suppose that P is self-maximal. Since P[i]=y≠P[j]=x holds, x >y. Since k’ ≠ 0, there is a v+y whichis the longest suffix of P(1,i) equal to a prefix of P(1,i) as shown above. i P v y x v y u u period period

  29. i P v y x v y u u period period Since k ≠ 0, we must have the following. j i P v y v x v y v y u u period period Since P is a self-maximal string, from the prefix u, we may conclude that y>x. Contradiction! k ≠ 0, P[k + 1] ≠ P[i] and k’ ≠ 0 cannot hold for self-maximal strings.

  30. Using similar reasoning, we can prove that for self-maximal strings, k = 0 and k’ ≠ 0 does not hold. Thus we may have the following: For self-maximal strings, Period(i)=Period(i - 1) or Period(i)=i. That is, the naïve period function works for Self-maximal strings.

  31. What is the advantage of the naïve-period function? • It is linear and we never need to look back to some characters way back, as we need in calculating the prefix function in MP-algorithm.

  32. For a string which is not self-maximal, we use the following algorithm, called the Max-Suffix Matching Algorithm.

  33. MaxSuffix-Matching Algorithm • First, we decompose the pattern string P to be u · v, where v= MaxSuf(P) and u is the other part of P. • Note that v is unique in the string P, and this is a very important property. • Property 1: No suffix of u is equal to a prefix of v., because v is uniqueness. • Example: P = dababdadad MaxSuf(P) = dadad P = u·v = dabab ·dadad

  34. MaxSuffix-Matching Algorithm • If v is found in T, we next find the part u of P which occurs in the left of v by a naive testing way. • Assume i is the location of an occurrence of v in T and the string before i is denoted as prev because of Property 1. prev i Text

  35. Maxsuffix-Matching Algorithm AlgorithmMaxsuffix-Matching i:= 0; j:=0; period:=1;prev:=0; whilei≤ n - |v| do begin while j < |v| and v[i+1]= T[i+j+1] do begin j=j+1; if j > period and v[j] ≠ v[j -period] thenperiod:=jend; {MATCH OFv} if j = |u| then begin if i− prev > |u| and u = T[i− |u| + 1… i] then reportmatch at i− |u|; prev := i; end i := i + period; ifj≥ 2 ・period thenj := j− period else beginj:= 0; period := 1 end; end; Naive-Period Function Test u by usingany algorithm

  36. Example • Text = adadaddadabababadada • P = u·v = abababa · dada • case1 • If i < |u|, that there is no occurrence of u·v at beginning. Text 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 i

  37. Example • Text = adadaddadabababadada • P = u·v = abababa · dada • Case2 • If i – prev <|u|, then there is no occurrence of u·v at position i - |u|. This is because the maximal suffix v of P only start at one position on P. i = 7, |u| = 7, prev =2 Text 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

  38. Example • Text = adadaddadabababadada • P = u·v = abababa · dada • So, we only need to check whether u exists in the left of third v in this example. Third occurrence Second occurrence First occurrence Text 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

  39. Time Complexity and Space Complexity • Hence, the MaxSuffix-Matching Algorithm can find all occurrences of a pattern in O(1) space (i, j, period) and linear time complexity.

