Advisor: Prof. R. C. T. Lee Reporter: C. C. Yen

The MaxSuffix-Matching Algorithm On maximal suffixes and constant-space versions of KMPalgorithm LATIN 2002: Theoretical Informatics : 5th Latin American Symposium, Cancun, Mexico, April 3-6, 2002. Proceedings. Rytter, W. Advisor: Prof. R. C. T. Lee Reporter: C. C. Yen

Maximal Suffix • A maximal suffix of a string is a suffix which is lexicographically maximal of all suffixes of a string. • The maximal suffix of string w is denoted by MaxSuf(w) • Ex: Consider string w = abaaba The set of its suffixes : {a, ba, aba, aaba, baaba, abaaba} The set of its sorted suffixes:{a, aaba, aba, abaaba, ba, baaba} • Thus we can find that MaxSuf(w) = baaba.

Self-Maximal String • A string w is said to be self-maximal if MaxSuf(w) = w. • Ex: Consider strings w = abaaba, x = baaba. • The MaxSuf(w) = baaba. • The MaxSuf(x) = baaba. • Hence, we say that x is a self-maximal string but w is not.

Important Properties of Self-Maximal Strings • By definition, we have the following observation about self-maximal strings: • For a self-maximal string P, suppose a prefix P1,P2,…,Pi of P is equal to a substring, Pk,Pk+1,…, Pk+i-1, of P, then Pi+1>=Pk+i. … P u x u y  x > y

Example: TCATBTCATA is a self-maximal string. • But, TBATATBATB is not a self-maximal string because B after the substring TBAT is lexically larger than A after prefix TBAT.

A period of a string w is an integer p, , such that : Ex: Consider string w = bbabbabbabba bbabbabbabba → period = 3 and period =6. abcdefg →period=word length=7 abcdeab →period=5 We define period(w) as the smallest period of w. If w = bbabbabbabba, period(w) is 3. The Period of a String

Given a string P, we are actually interested in the period of every prefix. Note that the period of i-prefix(i) in the MP-algorithm which is the number of steps which we can move the pattern. (The index starts from 1 in this case.)

Why are we interested in the period function? • If the period function is actually the same as the prefix function of the MP_algorithm, why are we interested in it? • To calculate the prefix function, we must use pointers which point back to some characters way back. • In the following, we shall introduce a naïve period function which never looks back.

Naive-Period Function • Function Naive-Period can be used to compute the period of a string if this string is self-maximal. • For a general string, the Naive-Period function will not work. This is why our algorithm only works for the self-maximal strings.

Function Naive-Period (j); { computes the period of self-maximal pat} period (1):= 1; for i := 2 to j do if then period (i):= i ; else period(i) := period(i - 1) return period; Algorithm of Naive-Period Function

Function Naive-Period (j); { computes the period of self-maximal pat} period (1):= 1; for i := 2 to j do if then period (i):= i ; else period(i) := period(i - 1) return period; An Example of Naive-Period Function

An Example of Naive-Period Function • Consider a string w = bbabbabbab • w isaself-maximalstring and period(w)=3.

An Example of Naive-Period Function • Consider a string w = bbabbabbab • w is aself-maximalstring and period(w)=3.

Why can Naïve period work in the self-maximal string? • Given any pattern P, let k be the length of the longest proper suffix of P[1, i-1] equal to a prefix P[1, k] of a P[1, i-1]. • Let k’ be the length of the longest proper suffix of P[1, i] equal to a prefix P[1, k’] of a P[1, i]. • For any i, we consider the following possibilities: k k P i-1 k’ k’ P i

k ≠ 0 and P[k + 1] = P[i] : Period(i) = Period(i - 1) • k ≠ 0, P[k + 1] ≠ P[i] and k’ ≠ 0 : Period(i) = i – k’ • k ≠ 0, P[k + 1] ≠ P[i] and k’ = 0 : Period(i) = i • k = 0 and k’ ≠ 0 : Period(i) = i – k’ • k = 0 and k’ = 0 : Period(i) = i

1. k ≠ 0 and P[k + 1] = P[i]: Period(i) = Period(i - 1) For i = 8, the substring “abc” of length 3 (k = 3) is the longest suffix of P(1, 7) which equals to a prefix of P(1, 7) and P(8) = P(4) period(8) = period(7)=4.

2. k ≠ 0, P[k + 1] ≠ P[i] and k’ ≠ 0: Period(i) = i – k’ For i = 9, the substring “abca” of length 4 (k = 4) is the longest suffix of P(1, 8) which equals to a prefix of P(1, 8) and P(9) ≠ P(5) There is a suffix of P(1, 9)which equals to a prefixof P(1, 9), P(1, 2) = ab of length 2 (k’ = 2)  period(9) = i - | P(1, 2)| = 9 - 2 =7.

3. k ≠ 0, P[k + 1] ≠ P[i] and k’ = 0: Period(i) = i For i = 9, the substring “abcc” of length 4 (k = 4) is the longest suffix of P(1, 8) which equals to a prefix of P(1, 8) and P(9) ≠ P(5) There is no suffix of P(1, 9)which equals to a prefixof P(1, 9) , (k’ = 0).  period(9) = i = 9.

4. k = 0 and k’ ≠ 0: Period(i) = i – k’ For i = 9, the is no suffix of P(1, 8) which equals to a prefix of P(1, 8), (k = 0) The substring “a” of length 1 (k’ = 1) is a suffix of P(1, 9)which equals to a prefixof P(1, 9), P(1, 1) = a.  period(9) = i - |P(1, 1)| = 9-1 = 8.

5. k = 0 and k’ = 0: Period(i) = i For i = 9, there is no suffix of P(1, 8) which equals to a prefix of P(1, 8), (k = 0). There is no suffix of P(1, 9)which equals to a prefixof P(1, 9), (k’ = 0).  period(9) = i = 9.

But, the conditions 2 (k ≠ 0, P[k + 1] ≠ P[i] and k’ ≠ 0) and 4 (k = 0 and k’ ≠ 0) do not exist in self-maximal suffix. Why? Assume that the conditions 2 & 4 holds. There must be a suffix which is equal to a prefix. Let u be the such a longest suffix.

2. k ≠ 0, P[k + 1] ≠ P[i] and k’ ≠ 0 j i P x y u u period period Suppose that P is self-maximal. Since P[i]=y≠P[j]=x holds, x >y. Since k’ ≠ 0, there is a v+y whichis the longest suffix of P(1,i) equal to a prefix of P(1,i) as shown above. i P v y x v y u u period period

i P v y x v y u u period period Since k ≠ 0, we must have the following. j i P v y v x v y v y u u period period Since P is a self-maximal string, from the prefix u, we may conclude that y>x. Contradiction! k ≠ 0, P[k + 1] ≠ P[i] and k’ ≠ 0 cannot hold for self-maximal strings.

Using similar reasoning, we can prove that for self-maximal strings, k = 0 and k’ ≠ 0 does not hold. Thus we may have the following: For self-maximal strings, Period(i)=Period(i - 1) or Period(i)=i. That is, the naïve period function works for Self-maximal strings.

What is the advantage of the naïve-period function? • It is linear and we never need to look back to some characters way back, as we need in calculating the prefix function in MP-algorithm.

For a string which is not self-maximal, we use the following algorithm, called the Max-Suffix Matching Algorithm.

MaxSuffix-Matching Algorithm • First, we decompose the pattern string P to be u · v, where v= MaxSuf(P) and u is the other part of P. • Note that v is unique in the string P, and this is a very important property. • Property 1: No suffix of u is equal to a prefix of v., because v is uniqueness. • Example： P = dababdadad MaxSuf(P) = dadad P = u·v = dabab ·dadad

MaxSuffix-Matching Algorithm • If v is found in T, we next find the part u of P which occurs in the left of v by a naive testing way. • Assume i is the location of an occurrence of v in T and the string before i is denoted as prev because of Property 1. prev i Text

Maxsuffix-Matching Algorithm AlgorithmMaxsuffix-Matching i:= 0; j:=0; period:=1;prev:=0; whilei≤ n - |v| do begin while j < |v| and v[i+1]= T[i+j+1] do begin j=j+1; if j > period and v[j] ≠ v[j -period] thenperiod:=jend; {MATCH OFv} if j = |u| then begin if i− prev > |u| and u = T[i− |u| + 1… i] then reportmatch at i− |u|; prev := i; end i := i + period; ifj≥ 2 ・period thenj := j− period else beginj:= 0; period := 1 end; end; Naive-Period Function Test u by usingany algorithm

Example • Text = adadaddadabababadada • P = u·v = abababa · dada • case1 • If i < |u|, that there is no occurrence of u·v at beginning. Text 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 i

Example • Text = adadaddadabababadada • P = u·v = abababa · dada • Case2 • If i – prev <|u|, then there is no occurrence of u·v at position i - |u|. This is because the maximal suffix v of P only start at one position on P. i = 7, |u| = 7, prev =2 Text 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

Example • Text = adadaddadabababadada • P = u·v = abababa · dada • So, we only need to check whether u exists in the left of third v in this example. Third occurrence Second occurrence First occurrence Text 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

Time Complexity and Space Complexity • Hence, the MaxSuffix-Matching Algorithm can find all occurrences of a pattern in O(1) space (i, j, period) and linear time complexity.

Reference • Maxime Crochemore, String-matching on ordered alphabets, Theoretical Computer Science, v.92 n.1, p.33-47, Jan. 6, 1992 • Maxime Crochemore, Dominique Perrin, Two-way string-matching, Journal of the ACM (JACM), v.38 n.3, p.650-674, July 1991 • Maxime Crochemore, Wojcjech Rvtter, Text algorithms, Oxford University Press, Inc.,New York, NY, 1994 • M. Crochemore, W. Rytter, Cubes, squares and time space efficient string matching, Algorithmica 13 (5) (1995) 405-425. • J.-P. Duval, Factorizing words over an ordered alphabet, J. Algorithms 4 (1983) 363-381.

Reference • Z Galil, J. Seiferas, Time-space-optimal string matching, J. Comput. System Sci. 26 (1983) 280-294. • L. Gasieniec, W. Plandowski, W. Rytter, Constant-space string matching with smaller number of comparisons: sequential sampling, in: Z. Galil, E. Ukkonen (Eds.), Combinatorial Pattern Matching, 6th Annual Symposium, CPM gs, Lecture Notes in Computer Science, Vol. 937, Springer, Berlin, 1995, pp. 78-89. • Leszek Gasieniec , Woiciech Plandowski , Woiciech Rytter, The zooming method: a recursive approach to time-space efficient string-matching, Theoretical Computer Science, v. 147 n. 1-2, p. 19-30, Aug. 7, 1995 • D.E. Knuth, J.H. Morris, V.R. Pratt, Fast pattern matching in strings, SIAM J. Comput. 6 (1977) 322-350. • M. Lothaire, Combinatorics on Words, Addison-Wesley, Reading, MA, USA, 1983.

Two Way Algorithm Two-way string-matching Journal of the ACM 38(3):651-675, 1991 Crochemore M., Perrin D. Advisor: Prof. R. C. T. Lee Speaker: C. C. Yen

In 2003 ,Rytter proposed a constant space and linear time string matching algorithm • To achieving the good constant space , this algorithm avoids the preprocessing function table of the KMP algorithm • Before introducing this algorithm , we shall define some characteristic of the strings

The Property of Maximal Suffix • Consider a string P. Let P = uv where v = MaxSuf(P). The property of the maximal suffix of a string is: If u is non-empty, no suffix of u will be equal to a prefix of v. Example ： Consider a pattern = ababadada. Let P = uv =ababa.dada No suffix of u is equal to a prefix of v.

Short Maximal Suffix • If a maximal suffix of a string x satisfies , we say that this maximal suffix of x is a short maximal suffix of x. Example： Consider a string x = abcdda ,dda is a maximal suffix of x and . Hence we say that dda is a short maximal suffix of x

Short Prefixes Lemma • Let the decomposition of P = uv, where v is the maximal suffix of P and v is also a short maximal suffix. Suppose that we start to match v with T at position i, a part of v is matched and a mismatch occurs at the j +1-th position on v. Then we can shift P safely by j + 1 positions without missing any occurrence of P in T. i i+j+1 T: mismatch j j P: u v j P: v u

j i v’ T: Why do we have to use short maximal suffix? Suppose V’ is very long, then we move pattern which is incorrect. j i v’ P: u v j j+1 T: j i P: u v

In the following , we will introduce the basic rule of the Two Way Matching algorithm with short maximal pattern strings The basic rules are given in the next slides.

Basic rule of the Two-Way algorithm with short maximal 1. Let the decomposition of P=uv, where v is the maximal suffix of P and v is also a short maximal suffix. • We then find where v appears in T from left to right. Assume the comparison starts at position i. When a mismatch occurs at v[j + 1], we shift v with j + 1 characters and start next comparison at P[1] with T[i + j + 1]. • When the part of v has be found in T, we scan the part of u from right to left. If a mismatch occurs when scanning u, we shift P with Period(P) 4. If we find both the parts of v and u in T, we report an occurrence of P in T. We then shift v with Period(P)

Full Example T=adadadaddadababadada P=u.v = ababa .dada

Advisor: Prof. R. C. T. Lee Reporter: C. C. Yen