340 likes | 593 Views
KMP algorithm. KNUTH D.E., MORRIS (Jr) J.H., PRATT V.R.,, Fast pattern matching in strings, SIAM Journal on Computing 6(1), 1977, pp.323-350. Advisor: Prof. R. C. T. Lee Reporter: C. W. Lu. 0. -1. -1. 4. 0. 9. 6. 0. 2. 7. 11. 5. 10. 8. 3. 1. -1. 12. 15. 16. 1. -1. 1. 4.
E N D
KMP algorithm KNUTH D.E., MORRIS (Jr) J.H., PRATT V.R.,, Fast pattern matching in strings, SIAM Journal on Computing 6(1), 1977, pp.323-350. Advisor: Prof. R. C. T. Lee Reporter: C. W. Lu
0 -1 -1 4 0 9 6 0 2 7 11 5 10 8 3 1 -1 12 15 16 1 -1 1 4 0 -1 -1 0 -1 1 13 1 14 -1 b c a c a b b b b b a b a b c c e KMP Table • The KMP algorithm constructs a table in preprocessing phase. • In searching phase, the window shifting will be determined easily by looking up the table. • Example: P = bcbabcbaebcbabcba KMP Table i P
Once the KMP table is constructed, whenever a mismatch occurs at location i, for the KMP algorithm, we move the pattern i-KMPtable(i) steps under the assumption that the location starts with 0.
5 1 6 1 7 8 -1 2 0 -1 11 12 16 10 13 9 1 0 1 3 4 14 -1 15 -1 1 0 -1 0 -1 -1 0 -1 4 c b a b c b b e a b b c b b a c a e a b c a b c b b b b a b c b c b a b a c c b c b a a c b … … c c b c a b a b a b b b c b e e b a a c b b i P T: P: Mismatch occurs at location 4 of P. Move P (4 - KMPtable[4]) = 4 - (-1) = 5 steps.
5 1 6 1 7 8 -1 2 0 -1 11 12 16 10 13 9 1 0 1 3 4 14 -1 15 -1 1 0 -1 0 -1 -1 0 -1 4 c b a b c b b e a b b c b b a c a e a b c a b c b b b b a b c b c b a b a c b b c b a a c b … … c c b c a b a b a b b b c b b e b a a c b b i P T: P: Mismatch occurs in position 8 of P. Move P (8 - KMPtable[8]) = 8 - 4 = 4 steps.
3 -1 1 6 7 8 2 1 3 5 4 -1 0 0 -1 0 -1 0 e b c a c b b b b The Definition of the KMP Table • For location i, if j is the largest such that P(0,j-1) is a suffix of P(0,i-1) and P(i) not equal to P(j), then KMPtable(i)=j. • Example ∵P(0, 2) is the longest prefix which is equal to a suffix of P(0, 6), and P(7)≠P(3). ∴KMPtable[7] = 3. i P
Condition for KMPtable[i] = -1 Condition A: P(0) = P(i) Condition B: P(0, j) is a suffix of P(0, i-1) Condition C: P(j+1) = P(i) KMPtable(i) = -1 :
0 -1 5 6 16 7 8 15 9 14 -1 10 12 13 1 1 2 -1 3 0 4 11 e a a c b b b c c b a b b c a b b i P • There is no suffix of P(0, 3) which is equal to a prefix of P(0, 3). ( ) • P(0) = P(4). (A) • KMPtable[4] = -1 because it satisfies the condition .
13 7 5 6 2 11 3 9 10 12 8 -1 4 -1 -1 1 1 -1 1 14 15 16 0 0 -1 -1 0 4 0 -1 -1 0 1 b c c a a b b b c a c b b e a b b i P • There are two suffixes of P(0, 14) which are equal to a prefix of P(0, 14): bcbabc (P(0, 5)) and P(6) = P(15); bc (P(0, 1)), and P(2) = P(15). ( ) • P(0) = P(4). (A) • KMPtable[15] = -1 because it satisfies the condition .
Condition for KMPtable[i] = 0 • Condition A: P(0) = P(i) • Condition B: P(0, j) is a suffix of P(0, i-1) • Condition C: P(j+1) = P(i) KMPtable(i) = 0 :
0 10 9 11 8 7 12 1 5 2 1 3 -1 -1 0 -1 6 0 -1 1 -1 13 15 16 14 -1 4 0 1 4 -1 0 a c b b b a b c c c a e a b b b b i P • There are two suffixes of P(0, 13) which are equal to a prefix of P(0, 13): bcbab (P(0, 4)) and P(5) = P(14); b (P(0)), and P(1) = P(14). ( ) • P(0) = P(4). ( ) • KMPtable[14] = 0 because it satisfies the condition .
How to Construct the KMP Table Efficiently? • Note that the KMP algorithm is actually an improvement of the MP algorithm. Therefore, we may now take a look at the table used in the MP algorithm. • We call the table used in the MP algorithm the prefix table.
The Definition of the Prefix Table. • For location i, let j be the largest j, if it exists, such that P(0,j-1) is a suffix of P(0,i), Prefix(i)=j. • If, for P(0,i), there is no prefix equal to a suffix, Prefix (i)=0.
2 16 4 3 6 8 6 15 9 5 8 1 2 12 7 10 14 11 4 7 5 1 0 3 0 1 4 3 0 0 1 0 13 2 a e a b b c b a b b b c b b a c c Example i • Note that, in the MP algorithm, we move the pattern i-Prefix(i)+1 steps when a mismatch occurs at location i. P Prefix
3 6 7 5 2 9 8 11 12 1 4 0 10 g 4 a a c a g c a 0 0 c c 5 1 3 2 1 g 0 0 1 a 0 a 0 How can we construct the Prefix Table Efficiently? • To compute Prefix(i), we look at Prefix(i-1). • In the following example, since Prefix(11)=4, we know that there exists a prefix of length 4 which is equal to a suffix with length 4 of P(0,11). Besides, P(4)=P(12). We may conclude that Prefix(12)=Prefix(11)+1=4+1=5. i P Prefix
3 6 7 2 8 5 10 1 4 0 9 c ? g g g c a 4 g 0 c 2 1 0 c 0 0 1 2 3 c g Another Case • Consider the following example. • Prefix(9)=4. But P(4)≠P(10). • Can we conclude that Prefix(10)=0? • No, we cannot. i P Prefix
5 10 4 9 2 1 0 7 3 6 8 ? 0 g a g c g c g c 0 4 2 c 3 2 1 0 0 1 c g • There exists a shorter prefix with length 2 which is equal to a suffix of P(0, 9), and P(10)=P(2). We should conclude that Prefix(10)=2+1=3. i P Prefix
j i-1 • In other words, we may use the pointer idea expressed below: • It may be necessary to examine P(0, j) to see whether there exists a prefix of P(0, j) equal to a suffix of P(0, j). • Thus the Prefix function can be found recursively.
f [0]=0 • For ( i=1 ; i<m ; i++ ){ • t = f (i-1); • While(t>=0){ • if ( P(i) = P(t) ) { • f [i] = t + 1; • break; • } • else{ • if ( t != 0) • t = f [t-1]; /*recursive*/ • else{ • f [i] = 0; • break; • } • } • } • } Construct the Prefix Function f
Example: 15 16 12 0 5 7 11 14 12 1 2 6 10 5 3 4 8 13 9 0 0 15 16 0 0 6 7 8 14 10 13 11 1 2 3 4 9 b c c b a b b a b b c e b b b c a a b b c a a b b a b c e c a b c b i P Prefix i P Prefix t = f[i-1] = f[0] = 0; ∵P[1] = c ≠ P[t] = P[0] = b ∴f [1] = 0.
Example: 0 0 0 5 6 7 16 8 9 1 2 10 15 0 4 1 3 14 12 1 12 13 11 1 2 10 11 1 1 0 13 0 4 14 3 15 2 0 16 0 9 8 2 3 6 4 5 0 7 b c b a b a b a c b e c b b a b c b e b c b b a a a c b c b c b a b i P Prefix t = f[i-1] = f[4] = 1; ∵P[5] = c = P[t] = P[1] = c ∴f [5] = t +1 = 2. i P Prefix t = f[i-1] = f[7] = 4; ∵P[8] = e ≠ P[t] = P[4] = b, t != 0; t = f[t-1] = f[3] = 0; ∵P[8] = e ≠ P[t] = P[0] = b, ∴f [8] = 0.
Example: 0 5 0 6 8 7 8 7 9 6 10 5 11 12 16 1 1 2 15 3 4 14 1 0 3 2 0 1 2 3 4 3 4 0 7 10 0 0 7 6 5 0 5 16 0 14 13 4 13 1 15 3 1 4 8 0 9 4 6 11 3 12 1 1 2 2 2 b b a a b b b a b a c c b b e c c a b a b b b c b c a c b a e b c b i P Prefix t = f[i-1] = f[14] = 6; ∵P[15] = b = P[t] = P[6] = b ∴f [15] = t +1 = 7. i P Prefix
The KMPtable KMPtable[0] = -1 For ( i=1 ; i<m ; i++ ) { t=f (i-1) While ( t > 0 ) { if ( P(i) ≠ P(t) ) { KMPtable[i]=t break } else t = f ( t – 1 ) /*recursive*/ } if ( KMPtable[i] = ψ) if ( P(i) = P(0)) . KMPtable[i] = -1 else KMPtable[i] = 0 }
Example: 1 2 3 4 3 13 2 5 0 0 -1 0 -1 1 1 14 2 12 9 -1 16 0 2 3 4 0 1 1 8 15 11 5 6 7 10 3 8 0 7 8 7 7 9 10 11 12 1 2 3 4 13 6 4 6 5 14 0 1 4 3 0 0 1 0 4 2 6 0 16 15 8 1 5 c b b b b c c a b a b a a b a a b b c a c c b e b e b b c b c b b a i P Prefix KMPtable i P Prefix KMPtable t = f[i-1] = f[2] = 1; ∵P[3] = a ≠ P[t] = P[1] = c, ∴ KMPtable [3] = t = 1.
Example: 5 6 4 7 0 8 1 0 5 10 11 3 12 2 1 1 0 0 0 2 9 -1 3 7 4 3 2 13 1 14 6 15 8 16 4 -1 0 -1 1 -1 1 0 b c a a a b c b b b e a b c b b c i P Prefix KMPtable t = f[i-1] = f[6] = 3; ∵P[7] = a = P[t] = P[3] = a; t = f[t-1] = f[2] = 1; ∵P[7] = a ≠P[t] = P[1] = c; ∴ KMPtable [3] = t = 1.
Example: 1 0 4 5 8 6 7 10 5 14 6 13 12 7 4 11 3 8 2 9 1 0 0 15 2 16 3 -1 0 -1 1 4 -1 0 -1 -1 0 -1 0 1 4 3 2 0 -1 1 1 1 c b b a a b a b b c b a b c b e c i P Prefix KMPtable t = f[i-1] = f[12] = 4; ∵P[13] = b = P[t] = P[4] = b; t = f[t-1] = f[3] = 0; ∵P[13] = b =P[0] = b; ∴ KMPtable [13] = -1.
Example: 2 1 0 4 3 0 8 0 10 2 9 3 4 8 11 13 12 14 6 15 16 5 7 3 0 1 1 0 1 1 1 -1 0 -1 -1 0 -1 4 1 -1 0 -1 0 5 6 7 4 1 -1 -1 2 c c c b a a b b b b b b e b a c a i P f KMPtable t = f[i-1] = f[15] = 7; ∵P[16] = a = P[t] = P[7] = a; t = f[t-1] = f[6] = 3; ∵P[16] = a = P[t] = P[3] = a; t = f[t-1] = f[2] = 1; ∵P[16] = a ≠P[t] = P[1] = c; ∴ KMPtable [16] = t = 1.
Time Complexity Preprocessing phase in O(m) space and time complexity. Searching phase in O(n+m) time complexity.
References • AHO, A.V., 1990, Algorithms for finding patterns in strings. in Handbook of Theoretical Computer Science, Volume A, Algorithms and complexity, J. van Leeuwen ed., Chapter 5, pp 255-300, Elsevier, Amsterdam. • AOE, J.-I., 1994, Computer algorithms: string pattern matching strategies, IEEE Computer Society Press. • BAASE, S., VAN GELDER, A., 1999, Computer Algorithms: Introduction to Design and Analysis, 3rd Edition, Chapter 11, pp. ??-??, Addison-Wesley Publishing Company. • BAEZA-YATES R., NAVARRO G., RIBEIRO-NETO B., 1999, Indexing and Searching, in Modern Information Retrieval, Chapter 8, pp 191-228, Addison-Wesley. • BEAUQUIER, D., BERSTEL, J., CHRÉTIENNE, P., 1992, Éléments d'algorithmique, Chapter 10, pp 337-377, Masson, Paris. • CORMEN, T.H., LEISERSON, C.E., RIVEST, R.L., 1990. Introduction to Algorithms, Chapter 34, pp 853-885, MIT Press. • CROCHEMORE, M., 1997. Off-line serial exact string searching, in Pattern Matching Algorithms, ed. A. Apostolico and Z. Galil, Chapter 1, pp 1-53, Oxford University Press. • CROCHEMORE, M., HANCART, C., 1999, Pattern Matching in Strings, in Algorithms and Theory of Computation Handbook, M.J. Atallah ed., Chapter 11, pp 11-1--11-28, CRC Press Inc., Boca Raton, FL. • CROCHEMORE, M., LECROQ, T., 1996, Pattern matching and text compression algorithms, in CRC Computer Science and Engineering Handbook, A. Tucker ed., Chapter 8, pp 162-202, CRC Press Inc., Boca Raton, FL. • CROCHEMORE, M., RYTTER, W., 1994, Text Algorithms, Oxford University Press. • GONNET, G.H., BAEZA-YATES, R.A., 1991. Handbook of Algorithms and Data Structures in Pascal and C, 2nd Edition, Chapter 7, pp. 251-288, Addison-Wesley Publishing Company.
References • GOODRICH, M.T., TAMASSIA, R., 1998, Data Structures and Algorithms in JAVA, Chapter 11, pp 441-467, John Wiley & Sons. • GUSFIELD, D., 1997, Algorithms on strings, trees, and sequences: Computer Science and Computational Biology, Cambridge University Press. • HANCART, C., 1992, Une analyse en moyenne de l'algorithme de Morris et Pratt et de ses raffinements, in Théorie des Automates et Applications, Actes des 2e Journées Franco-Belges, D. Krob ed., Rouen, France, 1991, PUR 176, Rouen, France, 99-110. • HANCART, C., 1993. Analyse exacte et en moyenne d'algorithmes de recherche d'un motif dans un texte, Ph. D. Thesis, University Paris 7, France. • KNUTH D.E., MORRIS (Jr) J.H., PRATT V.R., 1977, Fast pattern matching in strings, SIAM Journal on Computing 6(1):323-350. • SEDGEWICK, R., 1988, Algorithms, Chapter 19, pp. 277-292, Addison-Wesley Publishing Company. • SEDGEWICK, R., 1988, Algorithms in C, Chapter 19, Addison-Wesley Publishing Company. • SEDGEWICK, R., FLAJOLET, P., 1996, An Introduction to the Analysis of Algorithms, Chapter ?, pp. ??-??, Addison-Wesley Publishing Company. • STEPHEN, G.A., 1994, String Searching Algorithms, World Scientific. • WATSON, B.W., 1995, Taxonomies and Toolkits of Regular Language Algorithms, Ph. D. Thesis, Eindhoven University of Technology, The Netherlands. • WIRTH, N., 1986, Algorithms & Data Structures, Chapter 1, pp. 17-72, Prentice-Hall.