Boyer-Moore Algorithm

Boyer-Moore Algorithm • 3 main ideas • right to left scan • bad character rule • good suffix rule

Right to left scan 0 1 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 x z b c t z x t b p t x c t b p q t b c b b t b c b b t b cb b t b cb b

Bad character rule • Definition • For each character x in the alphabet, let R(x) denote the position of the right-most occurrence of character x in P. • R(x) is defined to be 0 if x is not in P • Usage • Suppose characters in P[i+1..n] match T[k+1..k+n-i] and P[i] mismatches T[k]. • Shift P to the right by max(1, i - R(T[k])) places • Hopefully more than 1

Illustration of bad character rule 0 1 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 x z b c t z x t b p t x c t b p q t b cb b t b cb b i = 5, R(z) = 0, so max(1, 5-0) = 5 t b c bb i = 5, R(t) = 1, so max(1, 5-1) = 4 i = 4, R(t) = 1, so max(1, 4-1) = 3 t b cb b

Extended bad character rule • Definition • For each character x in the alphabet, let R(x,i) denote the position of the right-most occurrence of character x P[1..i-1]. • R(x,i) is defined to be 0 if x is not in P[1..i-1]. • Usage • Suppose characters in P[i+1..n] match T[k+1..k+n-i] and P[i] mismatches T[k]. • Shift P to the right by max(1, i - R(T[k],i)) places • Hopefully more than 1

Illustration of extended bad character rule 0 1 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 x z b c b b x t b p t x c t b p q b t c tb i = 4, R(b) = 5, so max(1, 4-5) = 1 b t ct b i = 4, R(b,4) = 1, so max(1, 4-1) = 3 b t ct b

Implementation Issues • Bad character rule • Space required: O(|S|) for the number of characters in the alphabet • Calculate R[] matrix in O(n) time (exercise) • Extended bad character rule • Space required: full table is O(n|S|) • Smaller implementation: O(n) • Preprocess time: O(n) • Search time impact: increases search time by at worst twice the number of mismatches • See book for details (pg 18)

Observations • Bad character rules • work well in practice with large alphabets like the English alphabet • work less well with small alphabets like DNA • Do not guarantee linear worst-case run-time • Give an example of such a case

Strong good suffix rule part 1 • Situation • P[i..n] matches text T[j..j+n-i] but T[(j-1) does not match P(i-1) • The rightmost non-suffix substring t’ of P that matches the suffix P[i..n] AND the character to the left of t’ in P is different than P(i-1) • Shift P so that t’ matches up with T[j..j+n-i]

Illustration of suffix rule part 1 0 1 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 p r s t a b s t u b a b v q x r s t q c a b d a b d a b q c ab d a b d a b q c ab d a b d a b

Preprocessing for suffix rule part 1 • Definitions • For each i, L’(i) is the largest position less than n such that P[i..n] matches a suffix of P[1..L’(i)] and that the character preceding that suffix is not equal to P(i-1). • For string P, Nj(P) is the length of the longest suffix of the substring P[1..j] that is also a suffix of P • Observations • Nj(P) = Zn-j+1(Pr) • L’(i) is the largest j < n such that Nj(P) = |P[i..n]| which equals n-i+1 • If L’(i) > 0, shift P by n-L’(i) places to the right

Z-based computation of L’(i) for (i=1;i<=n;i++) L’(i) = 0; for (j=1; j<=n-1; j++) { k = n-Nj(P)+1; L’(k) = j; }

Strong good suffix rule part 2 • If L’(i) = 0 then … • Let t’’ = the largest suffix of P[i..n] that is also a prefix of P, if one exists. • If t’’ exists, shift P so that prefix of P matches up with t’’ at end of T[j..j+n-i]. • Otherwise, shift P past T[j+n-i].

Illustration of suffix rule part 2 0 1 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 p r s t a b s t a b a b v q x r s t a b a b s t ab a b a ba b s t a b a b a ba b s t a b a b

Preprocessing for suffix rule part 2 • Definitions • For each i, let l’(i) denote the length of the largest suffix of P[i..n] that is also a prefix of P, if one exists. Otherwise, let l’(i)=0. • Observations • l’(i) = the largest j <= |P[i..n]| such that Nj(P) = j • Question • How does l’(i) relate to l’(i+1)? • The same unless Nn-i+1(P) = n-i+1

Z-based computation of l’(i) l’[n+1] = 0; for (i=n;i>=2;i--) if (N[n-i+1] = = (n-i+1)) l’[i] = n-i+1; else l’[i] = l’[i+1]; }

Addendum to suffix rule • Shift by 1 if there is an immediate mismatch • That is, if P(n) mismatches with the corresponding character in T

Boyer-Moore Overview • Precompute L’(i), l’(i) for each position in P • Precompute R(x) or R(x,i) for each character x in S • Align P to T • Compare right to left • On mismatch, shift by the max possible from (extended) bad character rule and good suffix rule and return to compare

Observations I • Original Boyer-Moore algorithm uses “weak good suffix rule” without using the mismatch information • This is not sufficient to prove that the search part of Boyer-Moore runs in linear time in the worst case • Using only strong good suffix rule, can prove a worst-case time of O(n) provided P is not in T • If P is in T, original Boyer-Moore runs in Q(nm) time in the worst case, but this can be corrected with simple modifications • Using only the bad character shift rule leads to O(nm) time in the worst-case, but works in sublinear time on random strings

Boyer-Moore Algorithm