1 / 46

Bioinformatics Algorithms and Data Structures

Learn about the Boyer-Moore Algorithm's key concepts like right-to-left scan, bad character rule, suffix shift rule for efficient pattern matching. Discover how to implement and apply these rules to optimize pattern search in bioinformatics and data structures.

mbuell
Download Presentation

Bioinformatics Algorithms and Data Structures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bioinformatics Algorithms and Data Structures Chapter 2: Boyer-Moore Algorithm Lecturer: Dr. Rose Slides by: Dr. Rose January 21, 2003

  2. Boyer-Moore Algorithm • Basic ideas: • Previously discussed ideas for naïve matching • successively align P and T to check for a match. • Shift P to the right on match failure. • new concepts wrt the naïve algorithm • Scan from right-to-left, i.e.,  • Special Bad character rule • Suffix shift rule

  3. Concept: Right-to-left Scan • How can we check for a match of pattern P at location i in target T? • Naïve algorithm scanned left-to-right, i.e., T[i+k]&P[1+k], k = 0 to length(P)-1 Example: P = adab, T = abaracadabara a b a r a c a d a b a r a a d a b ^ 1 a == a ^ 2 d != b

  4. Concept: Right-to-left Scan • Alternative, scan right-to-left, i.e., T[i+k]&P[1+k], k = length(P)-1 down-to 0 Example: P = adab, T = abaracadabara a b a r a c a d a b a r a a d a b ^ 1 b != r

  5. Concept: Right-to-left Scan • Why is scanning right-to-left a good idea? • Answer: by itself, it isn’t any better than left-to-right. • A naïve approach with right-to-left scanning is also Q(nm). • Larger shifts, supported by a clever bad character rule and a suffix shift rule make it better.

  6. Concept: Bad Character Rule • Idea: the mismatched character indicates a safe minimum shift. Example: P = adacara, T=abaracadabara a b a r a c a d a b a r a a d a c a r a ^ 1 a == a ^ 2 r != c Here the bad character is c. Perhaps we should shift to align this character with its rightmost occurrence in P?

  7. Concept: Bad Character Rule Shift two positions to align the rightmost occurrence of the mismatched character c in P. a b a r a c a d a b a r a a d a c a r a a d a c a r a Now, start matching again from right to left.

  8. Concept: Bad Character Rule Second Example: P = adacara, T=abaxaradabara a b a x a r a d a b a r a a d a c a r a ^ 1 a == a ^ 3 a == a ^ 4 c != x ^ 2 r == r Here the bad character is x. The minimum that we should shift should align this character with its occurrence in P. But x doesn’t occur in P!!!!

  9. Concept: Bad Character Rule Since x doesn’t occur in P, we can shift past it. Second Example: P = adacara, T=abaxaradabara a b a x a r a d a b a r a a d a c a r a a d a c a r a Now, start matching again from right to left.

  10. Concept: Bad Character Rule • We will define a bad character rule that uses the concept of the rightmost occurrence of each letter. • Let R(x) be the rightmost position of the letter x in P for each letter x in our alphabet. • If x doesn’t occur in P, define R(x) to be 0.

  11. Concept: Bad Character Rule • Bad Character Rule: If P[i] mismatches T[k], shift P along T by max[1, i - R(T[k])] • This rule is allows us to shift by more than 1 when R(T[k]) + 1 < i. • Otherwise, we will shift P one position, i.e., when R(T[k]) >= i, 1 >= i - R(T[k]) • Obviously this rule is not very useful when R(T[k]) >= i.

  12. Concept: Extended Bad Character Rule Extended Bad Character Rule: If P[i] mismatches T[k], shift P along T so that the closest occurrence of the letter T[k] in P to the left of i in P is aligned with T[k]. Example: P = aracara, T=abararadabara a b a r a r a d a b a r a a r a c a r a ^ 1 a == a ^ This is the rightmost occurrence of r in P. Notice that i - R(T(k)) < 0 , i.e.,4 – 6 < 0 ^ 2 r == r ^ 3 a == a ^ 4 c != r ^ This is the rightmost occurrence of r to the left of i in P. Notice that 4 – 2 > 0, i.e.,this gives us a positive shift.

  13. Concept: Extended Bad Character Rule The amount of shift is i – j, where: • i is the index of the mismatch in P. • j is the rightmost occurrence of T[k] to the left of i in P. Example: P = aracara, T=abataradabara a b a t a r a d a b a r a a r a c a r a ^ 1 a == a ^ 2 r == r ^ 3 a == a ^ 4 c != t There is no occurrence of t in P, thus j = 0. Notice that i – j = 4, i.e.,this gives us a positive shift past the point of mismatch.

  14. Concept: Extended Bad Character Rule • How do we implement this rule? • We preprocess P (from right to left), recording the position of each occurrence of the letters. • For each character x in S, the alphabet, create a list of its occurrences in P. If x doesn’t occur in P, then it has an empty list.

  15. Concept: Extended Bad Character Rule Example: S = {a, b, c, d, r, t}, P = abataradabara • a_list = <13, 11,9,7,5,3,1> since ‘a’ occurs at these positions in P, i.e., abataradabara • b_list = <10,2> (abataradabara) • c_list = Ø • d_list = <8> (abataradabara) • r_list = <12,6> (abataradabara) • t_list = <4> (abataradabara)

  16. Concept: Suffix Shift Rule • Recall that we investigated finding prefixes last week. • Since we are matching P to T from right-to-left, we will instead need to use suffixes. • Note: historically, the preprocessing method for finding good suffixes for Boyer-Moore has been regarded as inscrutable. • If you are confused, that is ok  • If you are not confused does that mean you aren’t paying close enough attention?

  17. Concept: Suffix Shift Rule • Consider the partial right-to-left matching of P to T below. • This partial match involves a, a suffix of P.

  18. Concept: Suffix Shift Rule • This partial match ends where the first mismatch occurs, where x is aligned with d.

  19. Concept: Suffix Shift Rule We want to find a right-most copy a´ of this substring a in P such that: • a´ is not a suffix of P and • The character to the left of a´ is not the same as the character to the left of a

  20. Concept: Suffix Shift Rule • If a´ exists, shift P to the right such that a´ is now aligned with the substring in T that was previously aligned with a.

  21. Concept: Suffix Shift Rule • If a´ doesn’t exist, shift P right by the least amount such that a prefix of P is aligned with a suffix of a in T.

  22. Concept: Suffix Shift Rule • If a´ doesn’t exist, and there is no prefix of P that matches a suffix of a in T, shift P left by n positions.

  23. Concept: Suffix Shift Rule • Let L(i) denote the largest position less than n s.t. P[i..n] matches a suffix of P[1..L(i)]. • If there is no such position, then L(i) = 0 • Example 1: If i = 17 then L(i) = 9 • Example 2: If i = 16 then L(i) = 0

  24. Concept: Suffix Shift Rule • Let L´(i) denote the largest position less than n s.t. P[i..n] matches a suffix of P[1..L´(i)] and s.t. the character preceding the suffix is not equal to P(i-1). • If there is no such position, then L´(i) = 0 • Example 1: If i = 20 then L(i) = 12 and L´(i) = 6

  25. slydogsaddogdbadbaddog P L (1 9 ) 1 9 Concept: Suffix Shift Rule • Example 2: If i = 19 then L(i) = 12 and L´(i) = 0

  26. Concept: Suffix Shift Rule • Notice that L(i) indicates the right-most copy of P[i..n] that is not a suffix of P. • In contrast, L´(i) indicates the right-most copy of P[i..n] that is not a suffix of P and whose preceding character doesn’t match P(i-1). • The relation between L´(i) and L(i) is analogous to the relation between a´ and a.

  27. Concept: Suffix Shift Rule • Q: What is the point? • A: If P(i - 1) causes the mismatch and L´(i) > 0, then we can shift P right by n - L´(i) positions. Example:

  28. Concept: Suffix Shift Rule • If L(i) and L´(i) are different, then obviously shifting by n - L´(i) positions is a greater shift than n - L(i). • Example:

  29. Concept: Suffix Shift Rule • Let Nj(P) denote the length of the longest suffix of P[1..j] that is also a suffix of P. • Example 1: N6(P) = 3 and N12(P) = 5. • Example 2: N3(P) = 2, N9(P) = 3, N15(P) = 5, N19(P) = 0.

  30. Concept: Suffix Shift Rule • Q: How are the concepts of Ni and Zi related? • Recall that Zi is the length of a maximal substring starting at position i of P that matches a prefix of P. • In contrast, Ni is the length of a maximal substring ending at position i in P that matches a suffix of P. • In the case of Boyer-Moore, we are naturally interested in suffixes since we are scanning right-to-left

  31. Concept: Suffix Shift Rule • Let Pr denote the mirror image of P, then the relationship can be expressed as Nj(P)=Zn-j+1(Pr). • In words, the length of the substring matching a suffix at position j in P is equal to the length of the corresponding substring matching a prefix in the reverse of P. • Q: Why must this true? • A: Because they are the same substring, except that one is the reverse of the other.

  32. Concept: Suffix Shift Rule • Since Nj(P) = Zn-j+1(Pr), we can use the Z algorithm to compute N in O(n). • Q: How do we do this? • A: We create Pr, the reverse of P, and process it with the Z algorithm.

  33. Concept: Suffix Shift Rule • We can then find L´(i) and L(i)values from N values in linear time with the following: For i = 1 to n {L´(i) = 0;} For j = 1 to n – 1 { i = n - Nj(P) + 1; L´(i) = j; } L(2) = L´(2) ; For i = 3 to n { L(i) = max(L(i - 1), L´(i));}

  34. Concept: Suffix Shift Rule For i = 1 to n {L´(i) = 0;} For j = 1 to n – 1 { i = n - Nj(P) + 1; L´(i) = j; } L(2) = L´(2) ; For i = 3 to n { L(i) = max(L(i - 1), L´(i));} • Example: P = asdbasasas, n = 10 • Values of Ni(P): 0, 2, 0, 0, 0, 2, 0, 4, 0 • Computed values i: 11, 9, 11, 11, 11, 9, 11, 7, 11 • Values of L´: 0, 0, 0, 0, 0, 0, 8, 0, 6

  35. Concept: Suffix Shift Rule • Let l´(i) denote the length of the largest suffix of P[i..n] that is also a prefix of P. Let l´(i) = 0 if no such suffix exists. Example: P = asasbsasas ^ l’(1) = 4 ^ l’(3) = 4 ^ l’(8) = 2 ^ l’(7) = 4 ^ l’(9) = 2 ^ l’(6) = 4 ^ l’(5) = 4 ^ l’(4) = 4 ^ l’(10) = 0 ^ l’(2) = 4

  36. Concept: Suffix Shift Rule • Thm: l´(i) = largest j <= n – i + 1 s.t. Nj(P) = j. • Q: How can we compute l´(i) values in linear time? • A: This is problem #9 in Chapter 2. This would make an interesting homework problem.

  37. Boyer-Moore Algorithm Preprocessing: Compute L´(i) and l´(i) for each position i in P, Compute R(x), the right-most occurrence of x in P, for each character x in S. Search: k = n; While k <= m { i = n; h = k; While i > 0 and P(i) = T(j) { i = i – 1; h = h – 1;} if i = 0 { report occurrence of P in T at position k. k = k + n - l´(2);} else Shift P (increase k) by the max amount indicated by the extended bad character rule and the good suffix rule. }

  38. Boyer-Moore Algorithm Example: P = golgol Preprocessing: Compute L´(i) and l´(i) for each position i in P For i = 1 to n {L´(i) = 0;} For j = 1 to n – 1 { i = n - Nj(P) + 1; L´(i) = j; } Notice that first we need Nj(P) values in order to compute L´(i) and l´(i) for each position i in P.

  39. Boyer-Moore Algorithm Example: P = golgol Recall that Nj(P) is the length of the longest suffix of P[1..j] that is also a suffix of P. N1(P) = 0, there is no suffix of P that ends with g N2(P) = 0, there is no suffix of P that ends with o N3(P) = 3, there is a suffix of P that ends with l N4(P) = 0, there is no suffix of P that ends with g N5(P) = 0, there is no suffix of P that ends with o N1(P) = N2(P) = N4(P) = N5(P) = 0 and N3(P) = 3

  40. Boyer-Moore Algorithm • Preprocessing: P = golgol, n = 6 • N1(P) = N2(P) = N4(P) = N5(P) = 0 and N3(P) = 3 • Compute L´(i) and l´(i) for each position i in P For i = 1 to n {L´(i) = 0;} For j = 1 to n – 1 { i = n - Nj(P) + 1; L´(i) = j; } j = 1 i = 7ThereforeL´(7) = 1 j = 2 i = 7ThereforeL´(7) = 2 j = 3 i = 4ThereforeL´(4) = 3 j = 4 i = 7ThereforeL´(7) = 4 j = 5 i = 7ThereforeL´(7) = 5 L´(1) = L´(2) = L´(3) = L´(5) = 0 and L´(4) = 3

  41. Boyer-Moore Algorithm • Preprocessing: P = golgol, n = 6 • N1(P) = N2(P) = N4(P) = N5(P) = 0 and N3(P) = 3 • L´(1) = L´(2) = L´(4) = L´(5) = 0 and L´(4) = 3 • Compute l´(i) for each position i in P. • Recall that l´(i) is the length of the longest suffix of P[i..n] that • is also a prefix of P. l´(1)= 6 since gol is the longest suffix ofP[1..n] that is a prefix ofP. l´(2)= 3 since gol is the longest suffix ofP[2..n] that is a prefix ofP. l´(3)= 3 since gol is the longest suffix ofP[3..n] that is a prefix ofP. l´(4)= 3 since gol is the longest suffix ofP[4..n] that is a prefix ofP. l´(5)= 0 since there is no suffix of P[5..n] that is a prefix ofP. l´(6)= 0 since there is no suffix of P[6..n] that is a prefix ofP. l´(1)= 6, l´(2)= l´(3)= l´(4)= 3 and l´(5)= l´(6)= 0

  42. Boyer-Moore Algorithm • Preprocessing: P = golgol, n = 6 • N1(P) = N2(P) = N4(P) = N5(P) = 0 and N3(P) = 3 • L´(1) = L´(2) = L´(4) = L´(5) = 0 and L´(4) = 3 • l´(1) = 6, l´(2) = l´(3) = l´(4) = 3 and l´(5) = l´(6) = 0 • Compute the list R(x), the right-most occurrences of x in P, • for each character x in S = {g, o, l} R(g) = <4, 1> R(o) = <5, 2> R(l) = <6, 3>

  43. Boyer-Moore Algorithm • Preprocessing: P = golgol, n = 6, T = lolgolgol, m = 9 • L´(1) = L´(2) = L´(4) = L´(5) = 0 and L´(4) = 3 • l´(1) =6, l´(2) = l´(3) = l´(4) = 3 and l´(5) = l´(6) = 0 • R(g) = <4, 1>, R(o) = <5, 2>, R(l) = <6, 3> Search: k = n; While k <= m { i = n; h = k; While i > 0 and P(i) = T(j) { i = i – 1; h = h – 1;} if i = 0 { report occurrence of P in T at position k. k = k + n - l´(2);} else Shift P (increase k) by the max amount indicated by the extended bad character rule and the good suffix rule. }

  44. Search k = 6; While k <= 9 { i = 6; h = k; While i > 0 and P(i) = T(j) { i = i – 1; h = h – 1;} if i = 0 { report occurrence of P in T at position k. k = k + 6 - l´(2);} else Shift P (increase k) by the max amount indicated by the extended bad character rule and the good suffix rule. } lolgolgol golgol ^ i = 3, h = 3 ^ i = 4, h = 4 ^ i = 5, h = 5 ^ i = 6, h = 6 ^ i = 1, h = 1, P(1) != T(1)  ^ i = 2, h = 2 But i = 1! Bad Character Rule: there is no occurrence of l, the mismatched character in T, to the left of P(1). This suggests shifting only 1 place Good Suffix Rule: Since L´(2) = 0, l´(2) = 3 therefore shift P by n - l´(2) places, i.e., 6-3=3 places. Thus k = k + 3 = 9

  45. Search k = 6; While k <= 9 { i = 6; h = k; While i > 0 and P(i) = T(j) { i = i – 1; h = h – 1;} if i = 0 { report occurrence of P in T at position k. k = k + 6 - l´(2);} else Shift P (increase k) by the max amount indicated by the extended bad character rule and the good suffix rule. } lolgolgol golgol lolgolgol golgol k = 12, we are done!  ^ i = 6, h = 9 ^ i = 0, h = 3 ^ i = 5, h = 8 ^ i = 1, h = 4 ^ i = 2, h = 5 ^ i = 3, h = 6 ^ i = 4, h = 7 • i = 0, report occurrence of P in T at position 4, k = k + 6 - l´(2) = 9 + 6- 3 = 12

  46. Homework 1: Due 2/4/03 • Problems from Chapter 1 pages 12-14 • #2 • #4 • #6 • For P = tuttifruttiohrutti, calculate: • R(x) for all x in S. Assume that P contains all x. • L(i) for each position i. • L´(i) for each position i. • Nj(P) for each position 0 < j < n. • l´(i) for each position i. Additional problem for graduate students: • Problem from Chapter 2 page 30 • #9

More Related