Bioinformatics Algorithms and Data Structures

Bioinformatics Algorithms and Data Structures Chapter 2: KMP Algorithm Lecturer: Dr. Rose Slides by: Dr. Rose January 30, 2007

KMP Algorithm • Preliminaries: • KMP can be easily explained in terms of finite state machines. • KMP has a easily proved linear bound • KMP is usually not the method of choice

KMP Algorithm • Recall that the naïve approach to string matching is Q(mn). • How can we reduce this complexity? • Avoid redundant comparisons • Use larger shifts • Boyer-Moore good suffix rule • Boyer-Moore extended bad character rule

KMP Algorithm • KMP finds larger shifts by recognizing patterns in P. • Let spi(P) denote the length of the longest proper suffix of P[1..i] that matches a prefix of P. • By definition sp1 = 0 for any string. • Q: Why does this make sense? • A: The proper suffix must be the empty string

KMP Algorithm • Example: P = abcaeabcabd • P[1..2] = ab hence sp2 = ? • sp2 = 0 • P[1..3] = abc hence sp3 = ? • sp3 = 0 • P[1..4] = abca hence sp4 = ? • sp4 = 1 • P[1..5] = abcae hence sp5 = ? • sp5 = 0 • P[1..6] = abcaea hence sp6 = ? • sp6 = 1

KMP Algorithm • Example Continued • P[1..7] = abcaeab hence sp7 = ? • sp7 = 2 • P[1..8] = abcaeabc hence sp8 = ? • sp8 = 3 • P[1..9] = abcaeabca hence sp9= ? • sp9 = 4 • P[1..10] = abcaeabcab hence sp10 = ? • sp10 = 2 • P[1..11] = abcaeabcabd hence sp11 = ? • sp11 = 0

KMP Algorithm • Like the a/a concept for Boyer-Moore, there is an analogous spi/spí concept. • Let spí(P) denote the length of the longest proper suffix of P[1..i] that matches a prefix of P, with the added condition that characters P(i + 1) and P(spí + 1) are unequal. • Example: P = abcdabce sp´7 = 3 Obviously spí(P) <= spi(P), since the later is less restrictive.

KMP Algorithm • KMP Shift Rule: • Mismatch case: • Let position i+1 in P and position k in T be the first mismatch in a left-to-right scan. • Shift P to the right, aligning P[1..spí] with T[k- spí..k-1] • Match case: • If no mismatch is found, an occurrence of P has been found. • Shift P by n – spń spaces to continue searching for other occurrences.

KMP Algorithm • Observations: • The prefix P[1..spí] of the shifted P is shifted to match the corresponding substring in T. • Subsequent character matching proceeds from position spí + 1 • Unlike Boyer-Moore, the matched substring is not compared again. • The shift rule based on spí guarantees that the exact same mismatch won’t occur at spí + 1 but doesn’t guarantee that P(spí+1) = T(k)

KMP Algorithm • Example: P = abcxabcde • If a mismatch occurs at position 8, P will be shifted 4 positions to the right. • Q: Where did the 4 position shift come from? • A: The number of position is given by i - sp´i , in this example i = 7, sp´7 = 3, 7 – 3 = 4 • Notice that we know the amount of shift without knowing anything about T other than there was a mismatch at position 8..

KMP Algorithm • Example Continued: P = abcxabcde • After the shift, P[1..3] lines up with T[k-4..k-1] • Since it known that P[1..3] must match T[k-4..k-1], no comparison is needed. • The scan continues from P(4) & T(k) • Advantages of KMP Shift Rule • P is often shifted by more than 1 character, (i - sp´i ) • The left-most sp´i characters in the shifted P are known to match the corresponding characters in T.

KMP Algorithm Full Example: T = xyabcxabcxadcdqfeg P = abcxabcde Assume that we have already shifted past the first two positions in T. xyabcxabcxadcdqfeg abcxabcde abcxabcde ^ 8 d!=x, shift 4 places ^ 3 ^ 4 ^ 2 ^ 5 ^ 6 ^ 7 ^ 1 ^ 1 start again from position 4

Preprocessing for KMP Approach:show how to derive sp´ values from Z values. Definition:Position j > 1 maps toi if i = j + Zj(P) – 1 • Recall that Zj(P) denotes the length of the Z-box starting at position j. • This says that j maps to i if i is the right end of a Z-box starting at j.

Preprocessing for KMP Theorem. For any i > 1, sp´i(P) = Zj= i – j + 1 Where j > 1 is the smallest position that maps toi. If  j then sp´i(P) = 0 Similarly for sp: For any i > 1, spi(P) = i – j + 1 Where j, ij > 1, is the smallest position that maps toi or beyond. If  j then spi(P) = 0

Preprocessing for KMP Given the theorem from the preceding slide, the spí and spi values can be computed in linear time using Zi values: For i = 1 to n { spí = 0;} For j = n downto 2 { i = j + Zi(P) – 1; spí = Zi; } spn(P) = spń(P); For i = n - 1 downto 2 { spi (P) = max[spi+1 (P) - 1, spí(P)];}

Preprocessing for KMP Defn. Failure function F´(i) = spí-1 + 1 , 1  i n + 1, sp´0 = 0 (similarly F(i) = spi-1 + 1 , 1  i n + 1, sp0 = 0) • Idea: • We maintain a pointer i in P and c in T. • After a mismatch at P(i+1) with T(c), shift P to align P(spí + 1) with T(c), i.e., i = spí + 1. • Special case 1: i = 1  set i = F´(1) = 1 & c = c + 1 • Special case 2: we find P in T,  shift n - spń spaces, i.e., i = F´(n + 1) = spń + 1.

Full KMP Algorithm Preprocess P to find F´(k) = sp´k-1 +1 for k from 1 to n + 1 c = 1; p = 1; While c + (n – p) m { While P(p) = T( c )and pn { p = p + 1; c = c + 1;} If (p = n + 1) then report an occurrence of P at position c – n of T. if (p = 1) then c = c + 1; p = F´(p) ;}

Full KMP Algorithm c = 1; p = 1; While c + (n – p) m { While P(p) = T( c )and pn { p = p + 1; c = c + 1;} If (p = n + 1) then report an occurrence of P at position c – n of T. if (p = 1) then c = c + 1; p = F´(p) ; } p != n+1 p = 1!  c = 2 p = F’(1) = 1 xyabcxabcxabcdefeg abcxabcde ^ 1 a!=x

Full KMP Algorithm c = 1; p = 1; While c + (n – p) m { While P(p) = T( c )and pn { p = p + 1; c = c + 1;} If (p = n + 1) then report an occurrence of P at position c – n of T. if (p = 1) then c = c + 1; p = F´(p) ; } p != n+1 p = 1!  c = 3 p = F’(1) = 1 xyabcxabcxabcdefeg abcxabcde abcxabcde ^ 1 a!=y

Full KMP Algorithm c = 1; p = 1; While c + (n – p) m { While P(p) = T( c )and pn { p = p + 1; c = c + 1;} If (p = n + 1) then report an occurrence of P at position c – n of T. if (p = 1) then c = c + 1; p = F´(p) ; } p != n+1 p = 8!  don’t change c p = F´(8) = 4 xyabcxabcxabcdefeg abcxabcde abcxabcde ^ 8 d!=x ^ 3 ^ 4 ^ 2 ^ 5 ^ 6 ^ 7 ^ 1

Full KMP Algorithm c = 1; p = 1; While c + (n – p) m { While P(p) = T( c )and pn { p = p + 1; c = c + 1;} If (p = n + 1) then report an occurrence of P at position c – n of T. if (p = 1) then c = c + 1; p = F´(p) ; } p = 4, c = 10 p = n+1 ! xyabcxabcxabcdefeg abcxabcde abcxabcde abcxabcde abcxabcde ^ 8 ^ 5 ^ 6 ^ 7 ^ 4 ^ 9

Real-Time KMP • Q: What is meant by real-time algorithms? • A: Typically these are algorithms that are meant to interact synchronously in the real world. • This implies a known fixed turn-around time for processing a task • Many embedded scheduling systems are examples involving real-time algorithms. • For KMP this means that we require a constant time for processing all strings of length n.

Real-Time KMP • Q: Why is KMP not real-time? • A: For any mismatched character in T, we may try matching it several times. • Recall that spí only guarantees that P(i + 1) and P(spí + 1) differ • There is NO guarantee that P(i + 1) and T(k) match • We need to ensure that a mismatch at T(k) does NOT entail additional matches at T(k). • This means that we have to compute spí values with respect to all characters in S since any could appear in T.

Real-Time KMP • Define:sp´(i,x)(P) to be the length of the longest proper suffix of P[1..i] that matches a prefix of P, with the added condition that character P(sp´i + 1) is x. • This is will tell us exactly what shift to use for each possible mismatch. • A mismatched character T(k) will never be involved in subsequent comparisons.

Real-Time KMP • Q: How do we know that the mismatched character T(k) will never be involved in subsequent comparisons? • A: Because the shift will shift P so that either the matching character aligns with T(k) or P will be shifted past T(k). • This results in a real-time version of KMP. • Let’s consider how we can find the sp´(i,x)(P) values in linear time.

Real-Time KMP Thm. For P[i + 1]  x, sp´(i,x)(P) = i- j + 1 • Here j is the smallest position such that j maps to i and P(Zj+ 1) = x. • If there is no such j then where sp´(i,x)(P) = 0 For i = 1 to n { sp´(i,x) = 0 for every character x;} For j = n downto 2 { i = j + Zi(P) – 1; x = P(Zj + 1); sp´(i,x) = Zi; }

Real-Time KMP For i = 1 to n { sp´(i,x) = 0 for every character x;} For j = n downto 2 { i = j + Zi(P) – 1; x = P(Zj + 1); sp´(i,x) = Zi;} • Notice how this works: • Starting from the right • Find i the right end of the Z box associated with j • Find x the character immediately following the prefix corresponding to this Z box. • Set sp´(i,x) = Zi, the length of this Z box.

Bioinformatics Algorithms and Data Structures