190 likes | 344 Views
KMP Skip Search Algorithm. Very Fast String Matching Algorithm for Small Alphabets and Long Patterns, Christian, C., Thierry, L. and Joseph, D.P., Lecture Notes in Computer Science, Vol. 1448, 1998, pp. 55-64. Advisor: Prof. R. C. T. Lee Speaker: Z. H. Pan. 3. b. c. d. a. b. a. a.
E N D
KMP Skip Search Algorithm Very Fast String Matching Algorithm for Small Alphabets and Long Patterns, Christian, C., Thierry, L. and Joseph, D.P., Lecture Notes in Computer Science, Vol. 1448, 1998, pp. 55-64 Advisor: Prof. R. C. T. Lee Speaker: Z. H. Pan
3 b c d a b a a a b c d a d a d b d c c b d d d a a 2 a 4 5 6 7 8 17 d 19 20 18 12 a 11 16 15 b 10 9 14 13 1 d Definition • String Matching Problem: Input: a text string T of length n and a pattern string P of length m. Output: Find all occurrence of P in T. Example T: P: The occurrences of P in T : T5
The KMP Skip Search algorithm consists two phases which are processing and searching. • KMP Skip Search algorithm uses KMP table to improve the Skip Search algorithm.
c A C G T i 0 1 2 3 4 5 6 7 Z[c] 6 1 7 -1 List[i] -1 -1 -1 0 2 3 4 5 0 1 2 3 4 5 6 7 8 mpNext -1 0 0 0 1 0 1 0 1 kmpNext -1 0 0 -1 1 -1 1 -1 1 Preprocessing • The preprocessing phase computes the buckets for all characters of the alphabet , list table , MP table and KMP table. Example: Text stringT=GCATCGCAGAGAGTATACAGTACG 0 12 3 4 5 6 7 Pattern string P=GCAGAGAG P = G C A G A G A G 0 1 2 3 4 5 6 7
A general situation for the search phase i T j P start wall i T k X j P k • First it uses skip search algorithm which makes T[i]=P[j]. • wall is the first mismatch position of T when T align with P. • start is the first position of T when T align with P. • k is a small string when the substring of P equal to the substring of T. • KmpStart is the next shift position of kmp. • Skipstart is the next shift position of skip.
If k=0, that there is not the prefix of P which equals the substring of T, it uses skip search algorithm; otherwise, when k>0, that there is not the prefix of P which equals the substring of T, we have to find out Kmpstart、wall and Skipstart to compare its four cases. Case1. skipStart < kmpStart then a shift according to the skip algorithm is applied which gives a new value for skipStart, and we have to compare again skipStart and kmpStart. Case2. kmpStart < skipStart < wall then a shift according to the shift table of Morris-Pratt is applied. This gives a new value for kmpStart. We have to compare again skipStart and kmpStart. Case3. skipStart = kmpStart then another step can be performed with start = skipStart. Case4. kmpStart < wall < skipStart then another step can be performed with start = skipStart.
Example: step 1 First it uses the Skip Search algorithm to align T and P. start = 0 wall = 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 T =ACTACATATAGGACTACGTACCAGCATTACTACGTT 0 1 2 3 4 5 6 P = ACTACGT k = 5 0 1 2 3 4 5 6 ACTACGT (kmp’s shift) kmpstart = 3 0 1 2 3 4 5 6 ACTACGT (skip’s shift) skipstart = 4 wall kmpstart skipstart = 5 = 3 = 4 Case2. kmpStart < skipStart < wall then a shift according to the shift table of Morris-Pratt is applied. This gives a new value for kmpStart. We have to compare again skipStart and kmpStart. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 T =ACTACATATAGGACTACGTACCAGCATTACTACGTT 0 1 2 3 4 5 6 ACTACGT
Example: step 1-1 start = 0 wall = 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 T =ACTACATATAGGACTACGTACCAGCATTACTACGTT 0 1 2 3 4 5 6 ACTACGT k = 2 0 1 2 3 4 5 6 ACTACGT (kmp’s shift) kmpstart = 5 0 1 2 3 4 5 6 ACTACGT (skip’s shift) skipstart = 4 wall kmpstart skipstart = 5 = 5 = 4 Case1. skipStart < kmpStart then a shift according to the skip algorithm is applied which gives a new value for skipStart, and we have to compare again skipStart and kmpStart. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 T =ACTACATATAGGACTACGTACCAGCATTACTACGTT 0 1 2 3 4 5 6 ACTACGT
Example: step 1-2 start = 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 T =ACTACATATAGGACTACGTACCAGCATTACTACGTT 0 1 2 3 4 5 6 ACTACGT k = 0 ∴ uses skip search algorithm 0 1 2 3 4 5 6 ACTACGT start = 9 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 T =ACTACATATAGGACTACGTACCAGCATTACTACGTT 0 1 2 3 4 5 6 ACTACGT
Example: step 2 start = 9 wall = 10 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 T =ACTACATATAGGACTACGTACCAGCATTACTACGTT 0 1 2 3 4 5 6 ACTACGT k = 1 0 1 2 3 4 5 6 ACTACGT (kmp’s shift) kmpstart = 10 0 1 2 3 4 5 6 ACTACGT (skip’s shift) skipstart = 12 wall kmpstart skipstart = 10 = 10 = 12 Case4. kmpStart < wall < skipStart then another attempt can be performed with start = skipStart. start = 12 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 T =ACTACATATAGGACTACGTACCAGCATTACTACGTT 0 1 2 3 4 5 6 ACTACGT
Example: step 3 start = 12 wall = 19 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 T =ACTACATATAGGACTACGTACCAGCATTACTACGTT 0 1 2 3 4 5 6 ACTACGT match, k=7 0 1 2 3 4 5 6 ACTACGT (kmp’s shift) kmpstart = 19 0 1 2 3 4 5 6 ACTACGT (skip’s shift) skipstart = 16 wall kmpstart skipstart = 19 = 19 = 16 Case1. skipStart < kmpStart then a shift according to the skip algorithm is applied which gives a new value for skipStart, and we have to compare again skipStart and kmpStart. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 T =ACTACATATAGGACTACGTACCAGCATTACTACGTT 0 1 2 3 4 5 6 ACTACGT
Example: step 3-1 start = 12 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 T =ACTACATATAGGACTACGTACCAGCATTACTACGTT 0 1 2 3 4 5 6 ACTACGT k=0 ∴ uses skip search algorithm 0 1 2 3 4 5 6 ACTACGT start = 19 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 T =ACTACATATAGGACTACGTACCAGCATTACTACGTT 0 1 2 3 4 5 6 ACTACGT
Example: step 4 start = 19 wall = 21 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 T =ACTACATATAGGACTACGTACCAGCATTACTACGTT 0 1 2 3 4 5 6 ACTACGT k=2 0 1 2 3 4 5 6 ACTACGT (kmp’s shift) kmpstart = 21 0 1 2 3 4 5 6 ACTACGT (skip’s shift) skipstart = 21 wall kmpstart skipstart = 21 = 21 = 21 Case3. skipStart = kmpStart then another attempt can be performed with start = skipStart. start = 21 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 T =ACTACATATAGGACTACGTACCAGCATTACTACGTT 0 1 2 3 4 5 6 ACTACGT
Example: step 5 start = 21 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 T =ACTACATATAGGACTACGTACCAGCATTACTACGTT 0 1 2 3 4 5 6 ACTACGT k=0 ∴ uses skip search algorithm 0 1 2 3 4 5 6 ACTACGT start = 25 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 T =ACTACATATAGGACTACGTACCAGCATTACTACGTT 0 1 2 3 4 5 6 ACTACGT
Example: step 6 start = 25 wall = 26 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 T =ACTACATATAGGACTACGTACCAGCATTACTACGTT 0 1 2 3 4 5 6 ACTACGT k=1 0 1 2 3 4 5 6 ACTACGT (kmp’s shift) kmpstart = 26 0 1 2 3 4 5 6 ACTACGT (skip’s shift) skipstart = 28 wall kmpstart skipstart = 26 = 26 = 28 Case4. kmpStart < wall < skipStart then another attempt can be performed with start = skipStart. start = 28 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 T =ACTACATATAGGACTACGTACCAGCATTACTACGTT 0 1 2 3 4 5 6 ACTACGT
Example: step 7 start = 28 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 T =ACTACATATAGGACTACGTACCAGCATTACTACGTT 0 1 2 3 4 5 6 ACTACGT match, k=7
Time Complexity • The preprocessing phase of kmp Skip Search is O(m+σ)(σ is the number of alphabet.) • The Searching Phase of Kmp Skip Search algorithm is O(n).
References [BM77] A Fast String Searching Algorithm , Boyer, R. S. and Moore, J. S. , Communication of the ACM , Vol. 20 , 1977 , pp. 762-772 . [HS91] Fast String Searching , Hume, A. and Sundy, D. M. , Software, Practice and Experience , Vol. 21 , 1991 , pp. 1221-1248 . [MTALSWW92] Speeding Up Two String-Matching Algorithms, Maxime C., Thierry L., Artur C., Leszek G., Stefan J., Wojciech P. and Wojciech R., Lecture Notes In Computer Science, Vol. 577, 1992, pp. 589-600 . [MW94] Text algorithms, M. Crochemore and W. Rytter, Oxford University Press, 1994. [KMP77] Fast Pattern Matching in Strings, D.E. Knuth, J.H. Morris and V.R. Pratt, SIAM Journal on Computing, Vol. 6, No.2, 1977, pp 323-350 . [T92] A variation on the Boyer-Moore algorithm, Thierry Lecroq, Theoretical Computer Science archive, Vol. 92 , No.1, 1992, pp 119-144 . [T98] Experiments on string matching in memory structures, Thierry Lecroq, Software—Practice & Experience archive, Vol. 28, No.5, 1998, pp 561-568 [T92] Tuning the Boyer-Moore-Horspool string searching algorithm, Timo Raita, Software—Practice & Experience archive, Vol. 22, No.10, 1992, pp. 879-884 . [G94] String searching algorithms, G.A. Stephen, World Scientific Lecture Notes Series On Computing, Vol. 3, 1994, pp. 243 .