260 likes | 657 Views
Boyer-Moore string search algorithm Book by Dan Gusfield : Algorithms on Strings, Trees and Sequences (1997) Original: Robert S. Boyer, J Strother Moore (1977) Presented by: Vladimir Zoubritsky. Agenda. Problem Statement Bad character rule Boyer-Moore-Horspool algorithm
E N D
Boyer-Moore string search algorithm Book by Dan Gusfield: Algorithms on Strings, Trees and Sequences (1997) Original: Robert S. Boyer, J Strother Moore (1977) Presented by: Vladimir Zoubritsky
Agenda • Problem Statement • Bad character rule • Boyer-Moore-Horspool algorithm • Good Suffix Rule • Preprocessing • Analysis
Problem Statement • Given pattern P(1..n) and text T(1..m) defined over alphabet Σ, find one or all occurrences of P in T. • Boyer-Moore algorithm (1977) provides an efficient solution. The algorithm has a linear running time in worst case and sub-lineartime in most practical cases.
Right to left matching idea • Other known algorithms, e.g. Brute Force, match the pattern from left to right. • Algorithm: Align P with index k of T. Start matching from k+n-1, and if all letters match, report occurrence. • By itself matching from right to left is similar to Brute Force in the running time. • Based on the suffix we can decide to skip over ranges of characters.
Algorithm Skeleton • Align P with the beginning of T and match from right to left. • If whole P was match report occurrence. • Otherwise shift P by the maximal amount between the ones given by the bad character shift and the good suffix shift. Conditional correctness: If the two shifts never go beyond an occurrence of P in T, the algorithm will report all occurrences.
Bad Character rule • Definition For each character x, let R(x) be the position of the right-most occurrence of character x in P. R(x) is defined to be zero if x does not occur in P.
Bad character shift • Definition: Suppose a particular alignment of P against T, the rightmost n-i characters of P match their counterparts in T, but the character P(i) mismatches with its counterpart, say in position k of T. If the right-most position of the character T(k) in P is j, j < i, then shift so thatcharacter j of P is below character k of T, otherwise shift by 1. • The shift would be max[1, i-R(T(k))].
Bad character shift • Simple case: The character aligned with P(n), T(k) does not appear in P: P is shifted by n (to start after k).
Bad character shift • General case: Shift by i – R(x). Trivial to prove correctness.
Boyer-Moore-Horspool algorithm • Described by Horspool in 1980. • Basic idea: use Boyer Moore algorithm, but only use the bad character shift rule. • Worst case running time in degenerate cases may be O(nm). • Best case is sub-linear: O(m/n).
Boyer-Moore-Horspool worst case • A pair of pattern and text could be constructed to have a shift of 1 each time (same as Brute Force).
Boyer-Moore-Horspool best case • In a case when the last character in the pattern does not appear in the text, each shift would be of steps.
Boyer-Moore-Horspool Time • Preprocessing: Scanning the pattern is done in O(n) time, and using space. • Worst case: . • Best case: . • Average time: An average number of comparisons for the general case of Boyer-Moore-Horspool was established: [Baeza-Yates 1990]. • Bad character rule is not strong enough for providing linear time(see worst case above).
Good Suffix Rule • Definition: Suppose for a given alignment of and , a substring of matches a suffix of , but a mismatch occurs to the next character to the left. Then find, if exists, the rightmost copy of in , such as is not a suffix of , and the character to the left of in differs from the one to the left of in . Shift to the right, so that substring in is below substring in .
Good suffix rule (cont'd) • If does not exist, then shift the left end of past the left end of in by the leastamount, so that a prefix of matches a suffix of t in . If no such shift is possible then shift by n places to the right.
Correctness of the good-suffix shift • Recall: Suppose for a given alignment of and , a substring of matches a suffix of , but a mismatch occurs to the next character to the left. • If there is only one occurrence of in P, then any alignment with the left end of P aligned before the left end of will not yield a match. • If we align with a previous copy of in P, and the character before is equal to the character before , this alignment will fail the same way.
Preprocessing of P • Originally published preprocessing algorithm was complex and erroneous. An updated version was complex still. • We will use a simpler version based on the Z algorithm. • We want the preprocessing to compute values for functions L’(i) and l’(i) – defined later.
Preprocessing of P (cont'd) • An intermediate value we will require is . of is defined as the length of the longest suffix of which is also a suffix of . • Recall that is the length of the longest substring of that is also a prefix of S. • We can compute values for by running the Z-algorithm on the reverse of P.
Preprocessing of P: calculating L’(i) • gives the right-end position of the right-most copy of which is preceded by a different character. is zero if no such position exists. • Using , we can define as the largest j so that . • can be accumulated in linear time from the values of .
Preprocessing of P: calculating l’(i) • l'(i) is the length of the largest suffix of , that is also a prefix of P, if exists. • We can also define l'(i) in terms of : is the largest j ≤ |t| so that . • In a similar way, can be accumulated in linear time from values.
Using the preprocessing results • First part of the good suffix rule says we should find a copy of which is preceded by a different character – i.e. using a non-zero value of . • The second part looks at the least amount for a prefix of P to match a suffix of t – i.e. using a non-zero value of .
Boyer-Moore Time • Using the linear time implementation of the Z algorithm, the preprocessing takes O(n) time and O(n) space. • The original Boyer-Moore algorithm had cases when P appears in T which resulted in O(nm) time, before a few simple modifications [Galil 1979]. • A tight bound of 3m comparisons was established for Boyer-Moore running time [Cole 1991]. • An average case analysis is proposed, but remains difficult to simplify into a simple expression as in BMH [Tsai 2005]. • For other, “Boyer-Moore-like” algorithms the following time bounds were established:
Experimental Analysis • On average, for sufficiently large alphabets (8 characters) Boyer-Moore-Horspool has fast running time and sub-linear number of character comparisons. • On average, and in worst cases Boyer-Moore is faster than “Boyer-Moore-like” algorithms. Data from Michailidis and Margaritis [2001]