1 / 27

Bioinformatics Algorithms and Data Structures

Learn how the Knuth-Morris-Pratt (KMP) Algorithm improves string matching efficiency by avoiding redundant comparisons and using smart shifting strategies. Explore its applications and comparison with other algorithms.

corrales
Download Presentation

Bioinformatics Algorithms and Data Structures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bioinformatics Algorithms and Data Structures Chapter 2: KMP Algorithm Lecturer: Dr. Rose Slides by: Dr. Rose January 28, 2003

  2. KMP Algorithm • Preliminaries: • KMP can be easily explained in terms of finite state machines. • KMP has a easily proved linear bound • KMP is usually not the method of choice

  3. KMP Algorithm • Recall that the naïve approach to string matching is Q(mn). • How can we reduce this complexity? • Avoid redundant comparisons • Use larger shifts • Boyer-Moore good suffix rule • Boyer-Moore extended bad character rule

  4. KMP Algorithm • KMP finds larger shifts by recognizing patterns in P. • Let spi(P) denote the length of the longest proper suffix of P[1..i] that matches a prefix of P. • By definition sp1 = 0 for any string. • Q: Why does this make sense? • A: The proper suffix must be the empty string

  5. KMP Algorithm • Example: P = abcaeabcabd • P[1..2] = ab hence sp2 = ? • sp2 = 0 • P[1..3] = abc hence sp3 = ? • sp3 = 0 • P[1..4] = abca hence sp4 = ? • sp4 = 1 • P[1..5] = abcae hence sp5 = ? • sp5 = 0 • P[1..6] = abcaea hence sp6 = ? • sp6 = 1

  6. KMP Algorithm • Example Continued • P[1..7] = abcaeab hence sp7 = ? • sp7 = 2 • P[1..8] = abcaeabc hence sp8 = ? • sp8 = 3 • P[1..9] = abcaeabca hence sp9= ? • sp9 = 4 • P[1..10] = abcaeabcab hence sp10 = ? • sp10 = 2 • P[1..11] = abcaeabcabd hence sp11 = ? • sp11 = 0

  7. KMP Algorithm • Like the a/a concept for Boyer-Moore, there is an analogous spi/sp´i concept. • Let sp´i(P) denote the length of the longest proper suffix of P[1..i] that matches a prefix of P, with the added condition that characters P(i + 1) and P(sp´i + 1) are unequal. • Example: P = abcdabce sp´7 = 3 Obviously sp´i(P) <= spi(P), since the later is less restrictive.

  8. KMP Algorithm • KMP Shift Rule: • Mismatch case: • Let position i+1 in P and position k in T be the first mismatch in a left-to-right scan. • Shift P to the right, aligning P[1..sp´i] with T[k- sp´i..k-1] • Match case: • If no mismatch is found, an occurrence of P has been found. • Shift P by n – sp´n spaces to continue searching for other occurrences.

  9. KMP Algorithm • Observations: • The prefix P[1..sp´i] of the shifted P is shifted to match the corresponding substring in T. • Subsequent character matching proceeds from position sp´i + 1 • Unlike Boyer-Moore, the matched substring is not compared again. • The shift rule based on sp´i guarantees that the exact same mismatch won’t occur at sp´i + 1 but doesn’t guarantee that P(sp´i+1) = T(k)

  10. KMP Algorithm • Example: P = abcxabcde • If a mismatch occurs at position 8, P will be shifted 4 positions to the right. • Q: Where did the 4 position shift come from? • A: The number of position is given by i - sp´i , in this example i = 7, sp´7 = 3, 7 – 3 = 4 • Notice that we know the amount of shift without knowing anything about T other than there was a mismatch at position 8..

  11. KMP Algorithm • Example Continued: P = abcxabcde • After the shift, P[1..3] lines up with T[k-4..k-1] • Since it known that P[1..3] must match T[k-4..k-1], no comparison is needed. • The scan continues from P(4) & T(k) • Advantages of KMP Shift Rule • P is often shifted by more than 1 character, (i - sp´i ) • The left-most sp´i characters in the shifted P are known to match the corresponding characters in T.

  12. KMP Algorithm Full Example: T = xyabcxabcxadcdqfeg P = abcxabcde Assume that we have already shifted past the first two positions in T. xyabcxabcxadcdqfeg abcxabcde abcxabcde ^ 8 d!=x, shift 4 places ^ 3 ^ 4 ^ 2 ^ 5 ^ 6 ^ 7 ^ 1 ^ 1 start again from position 4

  13. Preprocessing for KMP Approach:show how to derive sp´ values from Z values. Definition:Position j > 1 maps toi if i = j + Zj(P) – 1 • Recall that Zj(P) denotes the length of the Z-box starting at position j. • This says that j maps to i if i is the right end of a Z-box starting at j.

  14. Preprocessing for KMP Theorem. For any i > 1, sp´i(P) = Zj= i – j + 1 Where j > 1 is the smallest position that maps toi. If  j then sp´i(P) = 0 Similarly for sp: For any i > 1, spi(P) = i – j + 1 Where j, ij > 1, is the smallest position that maps toi or beyond. If  j then spi(P) = 0

  15. Preprocessing for KMP Given the theorem from the preceding slide, the sp´i and spi values can be computed in linear time using Zi values: For i = 1 to n { sp´i = 0;} For j = n downto 2 { i = j + Zi(P) – 1; sp´i = Zi; } spn(P) = sp´n(P); For i = n - 1 downto 2 { spi (P) = max[spi+1 (P) - 1, sp´i(P)];}

  16. Preprocessing for KMP Defn. Failure function F´(i) = sp´i-1 + 1 , 1  i n + 1, sp´0 = 0 (similarly F(i) = spi-1 + 1 , 1  i n + 1, sp0 = 0) • Idea: • We maintain a pointer i in P and c in T. • After a mismatch at P(i+1) with T(c), shift P to align P(sp´i + 1) with T(c), i.e., i = sp´i + 1. • Special case 1: i = 1  set i = F´(1) = 1 & c = c + 1 • Special case 2: we find P in T,  shift n - sp´n spaces, i.e., i = F´(n + 1) = sp´n + 1.

  17. Full KMP Algorithm Preprocess P to find F´(k) = sp´k-1 +1 for k from 1 to n + 1 c = 1; p = 1; While c + (n – p) m { While P(p) = T( c )and pn { p = p + 1; c = c + 1;} If (p = n + 1) then report an occurrence of P at position c – n of T. if (p = 1) then c = c + 1; p = F´(p) ;}

  18. Full KMP Algorithm c = 1; p = 1; While c + (n – p) m { While P(p) = T( c )and pn { p = p + 1; c = c + 1;} If (p = n + 1) then report an occurrence of P at position c – n of T. if (p = 1) then c = c + 1; p = F´(p) ; } p != n+1 p = 1!  c = 2 p = F’(1) = 1 xyabcxabcxabcdefeg abcxabcde ^ 1 a!=x

  19. Full KMP Algorithm c = 1; p = 1; While c + (n – p) m { While P(p) = T( c )and pn { p = p + 1; c = c + 1;} If (p = n + 1) then report an occurrence of P at position c – n of T. if (p = 1) then c = c + 1; p = F´(p) ; } p != n+1 p = 1!  c = 3 p = F’(1) = 1 xyabcxabcxabcdefeg abcxabcde abcxabcde ^ 1 a!=y

  20. Full KMP Algorithm c = 1; p = 1; While c + (n – p) m { While P(p) = T( c )and pn { p = p + 1; c = c + 1;} If (p = n + 1) then report an occurrence of P at position c – n of T. if (p = 1) then c = c + 1; p = F´(p) ; } p != n+1 p = 8!  don’t change c p = F´(8) = 4 xyabcxabcxabcdefeg abcxabcde abcxabcde ^ 8 d!=x ^ 3 ^ 4 ^ 2 ^ 5 ^ 6 ^ 7 ^ 1

  21. Full KMP Algorithm c = 1; p = 1; While c + (n – p) m { While P(p) = T( c )and pn { p = p + 1; c = c + 1;} If (p = n + 1) then report an occurrence of P at position c – n of T. if (p = 1) then c = c + 1; p = F´(p) ; } p = 4, c = 10 p = n+1 ! xyabcxabcxabcdefeg abcxabcde abcxabcde abcxabcde abcxabcde ^ 8 ^ 5 ^ 6 ^ 7 ^ 4 ^ 9

  22. Real-Time KMP • Q: What is meant by real-time algorithms? • A: Typically these are algorithms that are meant to interact synchronously in the real world. • This implies a known fixed turn-around time for processing a task • Many embedded scheduling systems are examples involving real-time algorithms. • For KMP this means that we require a constant time for processing all strings of length n.

  23. Real-Time KMP • Q: Why is KMP not real-time? • A: For any mismatched character in T, we may try matching it several times. • Recall that sp´i only guarantees that P(i + 1) and P(sp´i + 1) differ • There is NO guarantee that P(i + 1) and T(k) match • We need to ensure that a mismatch at T(k) does NOT entail additional matches at T(k). • This means that we have to compute sp´i values with respect to all characters in S since any could appear in T.

  24. Real-Time KMP • Define:sp´(i,x)(P) to be the length of the longest proper suffix of P[1..i] that matches a prefix of P, with the added condition that character P(sp´i + 1) is x. • This is will tell us exactly what shift to use for each possible mismatch. • A mismatched character T(k) will never be involved in subsequent comparisons.

  25. Real-Time KMP • Q: How do we know that the mismatched character T(k) will never be involved in subsequent comparisons? • A: Because the shift will shift P so that either the matching character aligns with T(k) or P will be shifted past T(k). • This results in a real-time version of KMP. • Let’s consider how we can find the sp´(i,x)(P) values in linear time.

  26. Real-Time KMP Thm. For P[i + 1]  x, sp´(i,x)(P) = i- j + 1 • Here j is the smallest position such that j maps to i and P(Zj+ 1) = x. • If there is no such j then where sp´(i,x)(P) = 0 For i = 1 to n { sp´(i,x) = 0 for every character x;} For j = n downto 2 { i = j + Zi(P) – 1; x = P(Zj + 1); sp´(i,x) = Zi; }

  27. Real-Time KMP For i = 1 to n { sp´(i,x) = 0 for every character x;} For j = n downto 2 { i = j + Zi(P) – 1; x = P(Zj + 1); sp´(i,x) = Zi;} • Notice how this works: • Starting from the right • Find i the right end of the Z box associated with j • Find x the character immediately following the prefix corresponding to this Z box. • Set sp´(i,x) = Zi, the length of this Z box.

More Related