Knuth-Morris-Pratt Algorithm

Knuth-Morris-Pratt Algorithm • left to right scan like the naïve algorithm • one main improvement • on a mismatch, calculate maximum possible shift to the right for the pattern

Basic Idea • Definition • For each position i in pattern P, define spi(P) to be the length of the longest proper suffix of P[1..i] that matches a prefix of P • Define spi’(P) to have the added condition that P(i+1) is not equal to P(spi’(P) + 1) • may denote as spi and spi’ when P is clear from context • Usage • mismatch occurs between P(i+1) and T(k) • Shift P to the right so that P(spi’+1) aligns with T(k) • shift P i-spi’ spaces total • If P is found, shift by n - spn’ places

Illustration of sp and sp’ 0 1 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 a b c d a b c e a b c d a b c e f spi 0 0 0 0 1 2 3 0 1 2 3 4 5 6 7 8 0 spi’0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 8 0

Illustration 1 of KMP shift 0 1 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 x y a b c a b d a b c f q f e a b a b c a b d a b d a b c a b d a b d

Illustration 2 of KMP shift 0 1 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 x y a b x a b d a b c f q f e a b a b x a b d a b d a b x a b d a b d

spi’ and Z-boxes • Definitions • Position j > 1 maps to i if i is the right end of a Z-box that starts at j • Note, i = j + Zj-1 in this case • Observation • For any i > 1, spi’ = 0 if no j maps to i • Otherwise, spi’ = maxj maps to i Zj • Choosing the smallest j that maps to i leads to the maximum possible Zj value

Z-based computation of spi’ for (i=1;i<=n;i++) spi’ = 0; for (j=n; j>=2; j--) { i = j+Zj-1; spi’ = Zj; }

Observations • Original KMP defined in terms of failure functions F(i) and F’(i) • F’(i) = spi-1’ and F(i) = spi-1 for i = 1 to n+1 • 2m upper bound on number of comparisons • once a position in T matches, it is never compared again to any position in P • there may be cases where positions in T that mismatch are compared against multiple positions in P, but this can happen at most m times total • Full implementation of KMP is on page 27

FSA KMP algorithm • Definition • For each position i in pattern P and each character x in S, define sp(i,x) (P) to be the length of the longest proper suffix of P[1..i] that matches a prefix of P and P(spi+1) = x • Observation • Now each position in T will be compared exactly once, even on a mismatch

Z-based computation of sp(i,x) for (i=1;i<=n;i++) for (all x in S) sp(i,x) = 0; for (j=n; j>=2; j--) { i = j+Zj-1; x = P(Zj+1); sp(i,x) = Zj; }

Knuth-Morris-Pratt Algorithm