Bioinformatics Algorithms and Data Structures

Bioinformatics Algorithms and Data Structures Chapter 2: Boyer-Moore Algorithm Lecturer: Dr. Rose Slides by: Dr. Rose January 21, 2003

Boyer-Moore Algorithm • Basic ideas: • Previously discussed ideas for naïve matching • successively align P and T to check for a match. • Shift P to the right on match failure. • new concepts wrt the naïve algorithm • Scan from right-to-left, i.e.,  • Special Bad character rule • Suffix shift rule

Concept: Right-to-left Scan • How can we check for a match of pattern P at location i in target T? • Naïve algorithm scanned left-to-right, i.e., T[i+k]&P[1+k], k = 0 to length(P)-1 Example: P = adab, T = abaracadabara a b a r a c a d a b a r a a d a b ^ 1 a == a ^ 2 d != b

Concept: Right-to-left Scan • Alternative, scan right-to-left, i.e., T[i+k]&P[1+k], k = length(P)-1 down-to 0 Example: P = adab, T = abaracadabara a b a r a c a d a b a r a a d a b ^ 1 b != r

Concept: Right-to-left Scan • Why is scanning right-to-left a good idea? • Answer: by itself, it isn’t any better than left-to-right. • A naïve approach with right-to-left scanning is also Q(nm). • Larger shifts, supported by a clever bad character rule and a suffix shift rule make it better.

Concept: Bad Character Rule • Idea: the mismatched character indicates a safe minimum shift. Example: P = adacara, T=abaracadabara a b a r a c a d a b a r a a d a c a r a ^ 1 a == a ^ 2 r != c Here the bad character is c. Perhaps we should shift to align this character with its rightmost occurrence in P?

Concept: Bad Character Rule Shift two positions to align the rightmost occurrence of the mismatched character c in P. a b a r a c a d a b a r a a d a c a r a a d a c a r a Now, start matching again from right to left.

Concept: Bad Character Rule Second Example: P = adacara, T=abaxaradabara a b a x a r a d a b a r a a d a c a r a ^ 1 a == a ^ 3 a == a ^ 4 c != x ^ 2 r == r Here the bad character is x. The minimum that we should shift should align this character with its occurrence in P. But x doesn’t occur in P!!!!

Concept: Bad Character Rule Since x doesn’t occur in P, we can shift past it. Second Example: P = adacara, T=abaxaradabara a b a x a r a d a b a r a a d a c a r a a d a c a r a Now, start matching again from right to left.

Concept: Bad Character Rule • We will define a bad character rule that uses the concept of the rightmost occurrence of each letter. • Let R(x) be the rightmost position of the letter x in P for each letter x in our alphabet. • If x doesn’t occur in P, define R(x) to be 0.

Concept: Bad Character Rule • Bad Character Rule: If P[i] mismatches T[k], shift P along T by max[1, i - R(T[k])] • This rule is allows us to shift by more than 1 when R(T[k]) + 1 < i. • Otherwise, we will shift P one position, i.e., when R(T[k]) >= i, 1 >= i - R(T[k]) • Obviously this rule is not very useful when R(T[k]) >= i.

Concept: Extended Bad Character Rule Extended Bad Character Rule: If P[i] mismatches T[k], shift P along T so that the closest occurrence of the letter T[k] in P to the left of i in P is aligned with T[k]. Example: P = aracara, T=abararadabara a b a r a r a d a b a r a a r a c a r a ^ 1 a == a ^ This is the rightmost occurrence of r in P. Notice that i - R(T(k)) < 0 , i.e.,4 – 6 < 0 ^ 2 r == r ^ 3 a == a ^ 4 c != r ^ This is the rightmost occurrence of r to the left of i in P. Notice that 4 – 2 > 0, i.e.,this gives us a positive shift.

Concept: Extended Bad Character Rule The amount of shift is i – j, where: • i is the index of the mismatch in P. • j is the rightmost occurrence of T[k] to the left of i in P. Example: P = aracara, T=abataradabara a b a t a r a d a b a r a a r a c a r a ^ 1 a == a ^ 2 r == r ^ 3 a == a ^ 4 c != t There is no occurrence of t in P, thus j = 0. Notice that i – j = 4, i.e.,this gives us a positive shift past the point of mismatch.

Concept: Extended Bad Character Rule • How do we implement this rule? • We preprocess P (from right to left), recording the position of each occurrence of the letters. • For each character x in S, the alphabet, create a list of its occurrences in P. If x doesn’t occur in P, then it has an empty list.

Concept: Extended Bad Character Rule Example: S = {a, b, c, d, r, t}, P = abataradabara • a_list = <13, 11,9,7,5,3,1> since ‘a’ occurs at these positions in P, i.e., abataradabara • b_list = <10,2> (abataradabara) • c_list = Ø • d_list = <8> (abataradabara) • r_list = <12,6> (abataradabara) • t_list = <4> (abataradabara)

Concept: Suffix Shift Rule • Recall that we investigated finding prefixes last week. • Since we are matching P to T from right-to-left, we will instead need to use suffixes. • Note: historically, the preprocessing method for finding good suffixes for Boyer-Moore has been regarded as inscrutable. • If you are confused, that is ok  • If you are not confused does that mean you aren’t paying close enough attention?

Concept: Suffix Shift Rule • Consider the partial right-to-left matching of P to T below. • This partial match involves a, a suffix of P.

Concept: Suffix Shift Rule • This partial match ends where the first mismatch occurs, where x is aligned with d.

Concept: Suffix Shift Rule We want to find a right-most copy a´ of this substring a in P such that: • a´ is not a suffix of P and • The character to the left of a´ is not the same as the character to the left of a

Concept: Suffix Shift Rule • If a´ exists, shift P to the right such that a´ is now aligned with the substring in T that was previously aligned with a.

Concept: Suffix Shift Rule • If a´ doesn’t exist, shift P right by the least amount such that a prefix of P is aligned with a suffix of a in T.

Concept: Suffix Shift Rule • If a´ doesn’t exist, and there is no prefix of P that matches a suffix of a in T, shift P left by n positions.

Concept: Suffix Shift Rule • Let L(i) denote the largest position less than n s.t. P[i..n] matches a suffix of P[1..L(i)]. • If there is no such position, then L(i) = 0 • Example 1: If i = 17 then L(i) = 9 • Example 2: If i = 16 then L(i) = 0

Concept: Suffix Shift Rule • Let L´(i) denote the largest position less than n s.t. P[i..n] matches a suffix of P[1..L´(i)] and s.t. the character preceding the suffix is not equal to P(i-1). • If there is no such position, then L´(i) = 0 • Example 1: If i = 20 then L(i) = 12 and L´(i) = 6

slydogsaddogdbadbaddog P L (1 9 ) 1 9 Concept: Suffix Shift Rule • Example 2: If i = 19 then L(i) = 12 and L´(i) = 0

Concept: Suffix Shift Rule • Notice that L(i) indicates the right-most copy of P[i..n] that is not a suffix of P. • In contrast, L´(i) indicates the right-most copy of P[i..n] that is not a suffix of P and whose preceding character doesn’t match P(i-1). • The relation between L´(i) and L(i) is analogous to the relation between a´ and a.

Concept: Suffix Shift Rule • Q: What is the point? • A: If P(i - 1) causes the mismatch and L´(i) > 0, then we can shift P right by n - L´(i) positions. Example:

Concept: Suffix Shift Rule • If L(i) and L´(i) are different, then obviously shifting by n - L´(i) positions is a greater shift than n - L(i). • Example:

Concept: Suffix Shift Rule • Let Nj(P) denote the length of the longest suffix of P[1..j] that is also a suffix of P. • Example 1: N6(P) = 3 and N12(P) = 5. • Example 2: N3(P) = 2, N9(P) = 3, N15(P) = 5, N19(P) = 0.

Concept: Suffix Shift Rule • Q: How are the concepts of Ni and Zi related? • Recall that Zi is the length of a maximal substring starting at position i of P that matches a prefix of P. • In contrast, Ni is the length of a maximal substring ending at position i in P that matches a suffix of P. • In the case of Boyer-Moore, we are naturally interested in suffixes since we are scanning right-to-left

Concept: Suffix Shift Rule • Let Pr denote the mirror image of P, then the relationship can be expressed as Nj(P)=Zn-j+1(Pr). • In words, the length of the substring matching a suffix at position j in P is equal to the length of the corresponding substring matching a prefix in the reverse of P. • Q: Why must this true? • A: Because they are the same substring, except that one is the reverse of the other.

Concept: Suffix Shift Rule • Since Nj(P) = Zn-j+1(Pr), we can use the Z algorithm to compute N in O(n). • Q: How do we do this? • A: We create Pr, the reverse of P, and process it with the Z algorithm.

Concept: Suffix Shift Rule • We can then find L´(i) and L(i)values from N values in linear time with the following: For i = 1 to n {L´(i) = 0;} For j = 1 to n – 1 { i = n - Nj(P) + 1; L´(i) = j; } L(2) = L´(2) ; For i = 3 to n { L(i) = max(L(i - 1), L´(i));}

Concept: Suffix Shift Rule For i = 1 to n {L´(i) = 0;} For j = 1 to n – 1 { i = n - Nj(P) + 1; L´(i) = j; } L(2) = L´(2) ; For i = 3 to n { L(i) = max(L(i - 1), L´(i));} • Example: P = asdbasasas, n = 10 • Values of Ni(P): 0, 2, 0, 0, 0, 2, 0, 4, 0 • Computed values i: 11, 9, 11, 11, 11, 9, 11, 7, 11 • Values of L´: 0, 0, 0, 0, 0, 0, 8, 0, 6

Concept: Suffix Shift Rule • Let l´(i) denote the length of the largest suffix of P[i..n] that is also a prefix of P. Let l´(i) = 0 if no such suffix exists. Example: P = asasbsasas ^ l’(1) = 4 ^ l’(3) = 4 ^ l’(8) = 2 ^ l’(7) = 4 ^ l’(9) = 2 ^ l’(6) = 4 ^ l’(5) = 4 ^ l’(4) = 4 ^ l’(10) = 0 ^ l’(2) = 4

Concept: Suffix Shift Rule • Thm: l´(i) = largest j <= n – i + 1 s.t. Nj(P) = j. • Q: How can we compute l´(i) values in linear time? • A: This is problem #9 in Chapter 2. This would make an interesting homework problem.

Boyer-Moore Algorithm Preprocessing: Compute L´(i) and l´(i) for each position i in P, Compute R(x), the right-most occurrence of x in P, for each character x in S. Search: k = n; While k <= m { i = n; h = k; While i > 0 and P(i) = T(j) { i = i – 1; h = h – 1;} if i = 0 { report occurrence of P in T at position k. k = k + n - l´(2);} else Shift P (increase k) by the max amount indicated by the extended bad character rule and the good suffix rule. }

Boyer-Moore Algorithm Example: P = golgol Preprocessing: Compute L´(i) and l´(i) for each position i in P For i = 1 to n {L´(i) = 0;} For j = 1 to n – 1 { i = n - Nj(P) + 1; L´(i) = j; } Notice that first we need Nj(P) values in order to compute L´(i) and l´(i) for each position i in P.

Boyer-Moore Algorithm Example: P = golgol Recall that Nj(P) is the length of the longest suffix of P[1..j] that is also a suffix of P. N1(P) = 0, there is no suffix of P that ends with g N2(P) = 0, there is no suffix of P that ends with o N3(P) = 3, there is a suffix of P that ends with l N4(P) = 0, there is no suffix of P that ends with g N5(P) = 0, there is no suffix of P that ends with o N1(P) = N2(P) = N4(P) = N5(P) = 0 and N3(P) = 3

Boyer-Moore Algorithm • Preprocessing: P = golgol, n = 6 • N1(P) = N2(P) = N4(P) = N5(P) = 0 and N3(P) = 3 • Compute L´(i) and l´(i) for each position i in P For i = 1 to n {L´(i) = 0;} For j = 1 to n – 1 { i = n - Nj(P) + 1; L´(i) = j; } j = 1 i = 7ThereforeL´(7) = 1 j = 2 i = 7ThereforeL´(7) = 2 j = 3 i = 4ThereforeL´(4) = 3 j = 4 i = 7ThereforeL´(7) = 4 j = 5 i = 7ThereforeL´(7) = 5 L´(1) = L´(2) = L´(3) = L´(5) = 0 and L´(4) = 3

Boyer-Moore Algorithm • Preprocessing: P = golgol, n = 6 • N1(P) = N2(P) = N4(P) = N5(P) = 0 and N3(P) = 3 • L´(1) = L´(2) = L´(4) = L´(5) = 0 and L´(4) = 3 • Compute l´(i) for each position i in P. • Recall that l´(i) is the length of the longest suffix of P[i..n] that • is also a prefix of P. l´(1)= 6 since gol is the longest suffix ofP[1..n] that is a prefix ofP. l´(2)= 3 since gol is the longest suffix ofP[2..n] that is a prefix ofP. l´(3)= 3 since gol is the longest suffix ofP[3..n] that is a prefix ofP. l´(4)= 3 since gol is the longest suffix ofP[4..n] that is a prefix ofP. l´(5)= 0 since there is no suffix of P[5..n] that is a prefix ofP. l´(6)= 0 since there is no suffix of P[6..n] that is a prefix ofP. l´(1)= 6, l´(2)= l´(3)= l´(4)= 3 and l´(5)= l´(6)= 0

Boyer-Moore Algorithm • Preprocessing: P = golgol, n = 6 • N1(P) = N2(P) = N4(P) = N5(P) = 0 and N3(P) = 3 • L´(1) = L´(2) = L´(4) = L´(5) = 0 and L´(4) = 3 • l´(1) = 6, l´(2) = l´(3) = l´(4) = 3 and l´(5) = l´(6) = 0 • Compute the list R(x), the right-most occurrences of x in P, • for each character x in S = {g, o, l} R(g) = <4, 1> R(o) = <5, 2> R(l) = <6, 3>

Boyer-Moore Algorithm • Preprocessing: P = golgol, n = 6, T = lolgolgol, m = 9 • L´(1) = L´(2) = L´(4) = L´(5) = 0 and L´(4) = 3 • l´(1) =6, l´(2) = l´(3) = l´(4) = 3 and l´(5) = l´(6) = 0 • R(g) = <4, 1>, R(o) = <5, 2>, R(l) = <6, 3> Search: k = n; While k <= m { i = n; h = k; While i > 0 and P(i) = T(j) { i = i – 1; h = h – 1;} if i = 0 { report occurrence of P in T at position k. k = k + n - l´(2);} else Shift P (increase k) by the max amount indicated by the extended bad character rule and the good suffix rule. }

Search k = 6; While k <= 9 { i = 6; h = k; While i > 0 and P(i) = T(j) { i = i – 1; h = h – 1;} if i = 0 { report occurrence of P in T at position k. k = k + 6 - l´(2);} else Shift P (increase k) by the max amount indicated by the extended bad character rule and the good suffix rule. } lolgolgol golgol ^ i = 3, h = 3 ^ i = 4, h = 4 ^ i = 5, h = 5 ^ i = 6, h = 6 ^ i = 1, h = 1, P(1) != T(1)  ^ i = 2, h = 2 But i = 1! Bad Character Rule: there is no occurrence of l, the mismatched character in T, to the left of P(1). This suggests shifting only 1 place Good Suffix Rule: Since L´(2) = 0, l´(2) = 3 therefore shift P by n - l´(2) places, i.e., 6-3=3 places. Thus k = k + 3 = 9

Search k = 6; While k <= 9 { i = 6; h = k; While i > 0 and P(i) = T(j) { i = i – 1; h = h – 1;} if i = 0 { report occurrence of P in T at position k. k = k + 6 - l´(2);} else Shift P (increase k) by the max amount indicated by the extended bad character rule and the good suffix rule. } lolgolgol golgol lolgolgol golgol k = 12, we are done!  ^ i = 6, h = 9 ^ i = 0, h = 3 ^ i = 5, h = 8 ^ i = 1, h = 4 ^ i = 2, h = 5 ^ i = 3, h = 6 ^ i = 4, h = 7 • i = 0, report occurrence of P in T at position 4, k = k + 6 - l´(2) = 9 + 6- 3 = 12

Homework 1: Due 2/4/03 • Problems from Chapter 1 pages 12-14 • #2 • #4 • #6 • For P = tuttifruttiohrutti, calculate: • R(x) for all x in S. Assume that P contains all x. • L(i) for each position i. • L´(i) for each position i. • Nj(P) for each position 0 < j < n. • l´(i) for each position i. Additional problem for graduate students: • Problem from Chapter 2 page 30 • #9

Bioinformatics Algorithms and Data Structures