220 likes | 415 Views
A Fast String Searching Algorithm. Robert S. Boyer, and J Strother Moore. Communication of the ACM, vol.20 no.10 , Oct. 1977. Outline:. Introduction The Knuth-Morris-Pratt algorithm The Boyer-Moore algorithm Bad Character heuristic Good Suffix heuristic Matching Algorithm
E N D
A Fast String Searching Algorithm Robert S. Boyer, and J Strother Moore. Communication of the ACM, vol.20 no.10 , Oct. 1977
Outline: • Introduction • The Knuth-Morris-Pratt algorithm • The Boyer-Moore algorithm • Bad Character heuristic • Good Suffix heuristic • Matching Algorithm • Experimental Result • Conclusion
string s pattern Introduction • String Matching: • Searching a pattern from a text or a longer string. • If the pattern exist in the string, return the position of the first character in the substring which match the pattern.
Introduction (cont.) • Some definition: • m : the length of the pattern. • n : the length of the string( or text ). • s (shift): the distance between first character of matched substring and start character. • w x : a string w is a prefix of a string x. • w x : a string w is a suffix of a string x.
Introduction (cont.) • The naive string-matching algorithm: • Time Complexity: • Θ((n-m+1)m) in the worse case. • Θ(n2) if m = • for s ← 0 to n-m • do if pattern[1..m] = string[s+1..s+m] • printf “Pattern occurs with shift” s
B A C B A B A B A A B C B A B string s A B A B A C A pattern q B A C B A B A B A A B C B A B string s’ A B A B A C A pattern k Knuth-Morris-Pratt Algorithm s + q = s’ + k
Knuth-Morris-Pratt Algorithm(cont.) • Prefix Function: • f(j) = largest i < j such that P[1..i] = P[j-i+1..j] 0 if I dose not exist. A B A B A Pq Pk Pq Pk
Knuth-Morris-Pratt Algorithm(cont.) • Prefix Function Algorithm: f[1] ←0 k←0 for q←2 to m do while k>0 and P[k+1] ≠P[q] do k ← f[k] if P[k+1] = P[q] then k ← k+1 f[q] = k return f[1..m]
Example: Time Complexity: Prefix function : O(m) by amortize analysis Matching function: O(n) Total : O(m+n) Linear Complexity 1 2 3 4 5 6 7 8 9 10 11 k A B A B A C A B A B A P[k] 0 0 f[k] 1 2 3 4 5 Knuth-Morris-Pratt Algorithm(cont.) 1 2 3 0
The Boyer-Moore Algorithm • Symbols used: • Σ : the set of alphabets • patlen : the length of pattern • m : the last m characters of pattern matched • char : the mismatched character char ……… ……… string pattern m
Characteristic • Match pattern from rightmost character of the pattern to the left most character of the pattern. • Pattern is relatively long, and Σ is reasonably large, this algorithm is likely to be the most efficient string-matching algorithm.
A B C Bad Character heuristic • Observation 1: • if the char doesn’t occur in pat: Pattern Shift : j character String pointer shift: patlen character • Example:
Bad Character heuristic (cont.) • Observation 2: • The char occur in the pattern • The rightmost char in pattern in position δ1[char] and the pointer to the pattern is in j • If j < δ1[char] we shift the pattern right by 1 • If j > δ1[char] we shift the pattern right by j- δ1[char] • δ1[] is an array which size is the size of Σ
Bad Character heuristic (cont.) • Example: j = 3 and δ1[B] = 2 pattern shift 1 string pointer shift 1 (m+ pattern shift)
Good Suffix heuristic • 2 sequence [c1.. cn] and [d1.. dn] is unify if for j from 1 to patlen, either ci =di orci = $ordi = $, which $ be a character doesn’t occur in pat. • the position of rightmost plausible reoccurrence, rpr(j) = k , such that [pat(j+1)..pat(patlen)] and [pat(k)..pat(k+patlen – j - 1)] are unify, and either k≤1 or pat(k-1) ≠pat(j)
Good Suffix heuristic (cont.) • Example: • Pattern shift : j+1 – rar(j) • String pointer shift: m + j + 1 –rar(j) = strlen – j + j + 1 – rar(j) = δ2[j] j pat rpr(j)
Good Suffix heuristic (cont.) • Algorithm:
Boyer-Moore Matching Algorithm i = patlen; if n < patlen return false j = patlen While j > 0 do { if string(i) = pat(j) j = j-1 i = i-1 else i = i + max(δ1(string(i)) , δ2 (j)) if i > n then return false }
Boyer-Moore Matching Algorithm • Time Complexity: • Bad Character heuristic :O(patlen) • Good Suffix heuristic : O(patlen) • Matching : O(n) • Total O(n+patlen)
Conclusion • Boyer-Moore algorithm have sublinear time complexity :O(n+m) • Boyer-Moore is most efficient string matching algorithm when pattern is long and character is reasonably large.