410 likes | 662 Views
String Searching. CSCI 2720 Spring 2007 Eileen Kraemer. String Search. A common word processor facility is to search for a given word in a document. Generally, the problem is to search for occurrences of a short string in a long string. the. Do the first the n do the o the r one.
E N D
String Searching CSCI 2720 Spring 2007 Eileen Kraemer
String Search • A common word processor facility is to search for a given word in a document. Generally, the problem is to search for occurrences of a short string in a long string. the Do the first then do the other one
History of String Search • The brute force algorithm: • invented in the dawn of computer history • re-invented many times, still common • Knuth & Pratt invented a better one in 1970 • invented independently by Morris • published 1976 as “Knuth-Morris-Pratt” • Boyer & Moore found a better one before 1976 • found independently by Gosper • Karp & Rabin found a “better” one in 1980
The obvious algorithm is to try the word at each possible place, and compare all the characters: for i := 0 to n-m do(doc length n) for j := 0 to m-1 do(word length m) compare word[j] with doc[i+j] if not equal, exit the inner loop • The complexity is at worst O(m*n) and best O(n).
Improving String Search • Surprisingly, there is a faster algorithm where you compare the last characters first: Do the first then do the other one the compare‘e’ with ‘ ‘, fail so move along 3 places Do the first then do the other one the can only move along 2 places
Improved string search, continued • In every case where the document character is not one of the characters in the word, we can move along m places. Sometimes, it is less.
Problem Definition, terminology • Let p be the pattern string • Let t be the target string • Let k be the index of the character in the target string that “lies over” the first character of the pattern • Given two strings, p and t, over the alphabet , determine whether p occurs as the substring of t • That is, determine whether there exists k such that p=Substring(t,k,|p|).
Straightforward string searching functionSimpleStringSearch(string p,t): integer {Find p in t; return its location or -1 if p is not a substring of t} for k from 0 to Length(t) – Length(p) do i <- 0 while i < Length(p) and p[i] = t[k+i] do i <- i+1 if i == Length(p) then return k return -1
SimpleStringSearch t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t10] p[0] p[1] p[2] p[3] Y Y Y N
SimpleStringSearch t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t10] p[0] p[1] p[2] p[3] N
SimpleStringSearch t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t10] p[0] p[1] p[2] p[3] N
SimpleStringSearch t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t10] p[0] p[1] p[2] p[3] N
SimpleStringSearch t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t10] p[0] p[1] p[2] p[3] N
SimpleStringSearch t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t10] p[0] p[1] p[2] p[3] N
SimpleStringSearch t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t10] p[0] p[1] p[2] p[3] N
SimpleStringSearch t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t10] p[0] p[1] p[2] p[3] Y Y Y Y
Straightforward string searching • Worst case: • Pattern string always matches completely except for last character • Example: search for XXXXXXY in target string of XXXXXXXXXXXXXXXXXXXX • Outer loop executed once for every character in target string • Inner loop executed once for every character in pattern • (|p| * |t|) • Okay if patterns are short, but better algorithms exist
Knuth-Morris-Pratt • (|p| * |t|) • Key idea: • if pattern fails to match, slide pattern to right by as many boxes as possible without permitting a match to go unnoticed
Knuth-Morris-Pratt t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t10] p[0] p[1] p[2] p[3] p[4] Y Y Y Y N Y Y Y Y ?
Knuth-Morris Pratt • Correct motion of pattern depends on both location of mismatch and the mismatching character • If c == X : move 2 boxes to right • If c == E : move 5 boxes to right • If c == Z : target found; alg terminates
Knuth-Morris-Pratt • Goal: determine d, number of boxes to right pattern should move; smallest d such that: • p[0] = t[k+d] • p[1] = t[k+d+1] • p[2] = t[k+d+2] • … • p[i-d] = t[k+i]
Knuth-Morris-Pratt • Note: can be stated largely in terms of pattern alone. • Value of d depends only on: • The pattern • The value of i • The mismatching character c (at t[k+i])
Knuth-Morris-Pratt • Can define a function KMPskip(p,i,c) to give correct d • Return smallest integer d such that 0 <= d <=I, such that p[i-d] == c and p[j] == p[j+d] for each 0 <=j <= i-di1 • Return i+1 if no such d exists • Calculate all values of KMPskip for pattern p and store it in KMPskiparray • do lookup at each mismatch
Knuth-Morris-Pratt • For pattern ABCD: A B C D A B C D other
Knuth-Morris-Pratt • For pattern XYXYZ: X Y X Y Z X Y Z other
Knuth-Morris-Pratt Function KMPSearch(string p, t): integer {Find p in t; return its location or -1 if p is not a substring of t} KMPskiparray <- ComputeKMPskiparray(p) k <- 0 i <- 0 While k < Length(t) – Length(p) do if i == Length(p) then return k d <- KMPskiparray[I,t[k+i]] k <- k + d i <- I + 1 –d Return -1
The Boyer-Moore Algorithm • Similar to KMP in that: • Pattern compared against target • On mismatch, move as far to right as possible • Different from KMP in that: • Compare the patterns from right to left instead of left to right • Does that make a difference? • Yes!! – much faster on long targets; many characters in target string are never examined at all
Boyer-Moore example t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t10] p[0] p[1] p[2] p[3] N There is no E in the pattern : thus the pattern can’t match if any characters lie under t[3]. So, move four boxes to the right.
Boyer-Moore example t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t10] p[0] p[1] p[2] p[3] N Again, no match. But there is a B in the pattern. So move two boxes to the right.
Boyer-Moore example t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t10] p[0] p[1] p[2] p[3] Y Y Y Y
Boyer-Moore : another example t[k] t[k+1] … t[k+i] t[k+m-1] p[0] p[1] … p[i-1] p[i] p[i+1] … p[m-1] N Y Y Y Y Problem: determine d, the number of boxes that the pattern can be moved to the right. d should be smallest integer such that t[k+m-1]= p[m-1-d], t[k+m-2] = p[m-2-d], … t[k+i] = p[i-d]
The Boyer-Moore Algorithm • We said: • d should be smallest integer such that: • T[k+m-1] = p[m-1-d] • T[k+m-2] = p[m-2-d] • T[k+i] = p[i-d] • Reminder: • k = starting index in target string • m = length of pattern • i = index of mismatch in pattern string • Problem: statement is valid only for d<= i • Need to ensure that we don’t “fall off” the left edge of the pattern
Boyer-Moore : another example t[k] t[k+5] t[k+8] p[0] p[1] p[2] p[3] p[4] p[5] p[6] p[7] p[8] N Y Y Y If c == W, then d should be 3 If c == R, then d should be 7
BMPSkip • Let m = |p| • For any character c and any i such that 0<= i < m , define BMPSkip(p,i,c) to be: • The amount the pattern can move to the right when characters i+1 through m-1 of the pattern match corresponding characters in the target but p[i] doesn’t match character c. • Then BMPSkip(p,I,c) should return the smallest d such that: • p[j]= p[j-d] for all j such that max(i+1,d) <= j<= m-1, and • p[i-d] = c if d<= i
Boyer-Moore • For pattern ABCD: A B C D <- if the position in the pattern is this character And the mis-matching character in the target is this - A B C D other Then skip this many spaces …
Boyer-Moore • For pattern XYXYZ: X Y X Y Z - If the position in the pattern is this And the mis-matching character in the target is this -- X Y Z other Then skip this many spaces
Note: • entries in the Boyer-Moore arrays are generally larger than with KMP; thus, the pattern will move faster • Table not consulted on a match (thus, the blank entries)
BMSearch Function BMSearch(string p,t): int {Find p in t; return its location or -1 if p is not a substring of t} BMSkiparray <- ComputeBMSkipArray(p) k <- 0 while k <= Length(t) – Length(p) do i <- Length(p) – 1 while i >= 0 and p[i] = t[k+i] do i <- i– 1 if i = -1 then return k k <- k + BMSkiparray[i,t[k+i]] return -1
The Karp-Rabin Algorithm Idea • Karp & Rabin found an algorithm which is: • almost as fast as Boyer-Moore • simple enough to understand easily • can be adapted for 2-dimensional searches for patterns in pictures • Go back to the brute force idea, but now use a single number to represent the word you are searching for, and a single number for the current portion of the document you are comparing against.
The Karp-Rabin Algorithm • Suppose we are searching for 4-letter words. Then the whole (English) word fits in one (computer) word w of 4 bytes. If the current 4 bytes of the document are also in one word d, a single comparison can match the two in one step. To move along the document, shift d and add in the next character. • For longer words, use hashing. The characters of the word and the document are combined into single hash numbers wh and dh. The hash number dh can be updated by doing a suitable sum and adding in the code for the next character.