140 likes | 345 Views
Module 5: String Matching Algorithms. Week 9. Given text array T[1 .. n] and pattern array P[1 .. m] of characters from alphabet (sigma symbol), find all s such that T[s+1 .. s+m] = P[1 .. m], i.e., P occurs with shift s in T. Example: = {a b o r t u w y}, T = row row row your boat
E N D
Given text array T[1 .. n] and pattern array P[1 .. m] of characters from alphabet (sigma symbol), find all s such that T[s+1 .. s+m] = P[1 .. m], i.e., P occurs with shift s in T. Example: = {a b o r t u w y}, T = row row row your boat If P = yo then s = 12 If P = ro then s = 0, 4 and 8 Real World Applications: Text Editing, Pattern Recognition String Matching
Naive (T, P) n = length(T) m = length(P) for s = 0 to n-m if P[1..m] = T[s+1..s+m] then print "Pattern occurs with shift " s The brute force algorithm checks all positions in the text between 1 and n-m+1 We start form first position of the text and after each attempt, we shift the pattern exactly by 1 position Naive Brute Force String Matching Algorithm
Time Complexity: ((n – m + 1) m) Observation: Information gained about the text for one value of s is totally ignored in considering other values of s. T= aaab…etc. p = aaab s= 0. That also means s =1, s= 2, and s = 3 are invalid since T(4) = b Naive Bruteforce String Matching Algorithm Contd.
Rabin-Karp Algorithm • = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9} P= 12 (decimal value for pattern) q = 3 (Large Prime Number) n=9, T[1]=5, T[2] =5 T[3] = 3 … T[9] = 1 T 5 5 3 1 2 2 7 3 1 Uses a hashing function. P mod q = 0 T mod q (t s)= 1, 2, 1, 0, 1, 0, 1, 1 • Rabin-Karp Pseudo code: • Compute ( P mod q ). • Compute ( T[s + 1, ..., s + m] mod q ). 0<=s <=n-m • Test against p only those sequences in T having • the same (mod q) value. If the string does not match it’s a spurious hit.
Uses a hashing function Preprocessing phase (m+n). Determine hashed value for pattern and text. Worst case Searching Time ((n-m+1)m) . Expected Searching Time (m+n) Choose a prime number q such that 10 (||) q fits within a computer word to speed up computation. Rabin-Karp Algorithm Contd.
Computing the decimal value for a string (T[s +2] … T[s+m+1]) t s+1 = 10 (ts – 10 m-1 T[s + 1] ) + T [s + m + 1] // constant time m = 5 t s= 31415 T[s+6] = 2 t s +1 = 10 (31415 – 10000 . 3) + 2 = 14152 Use the following recursive function to compute t s+1 when p and ts may not fit into 1 word. Assume 10 q fits within 1 computer word, which allows all the necessary computations to be performed in 1 word. // Computing all the ts ‘s modulo q t s+1 = (10 (ts – T[s+1]h) + T[s+m+1]) mod q h = 10 m-1 (mod q) Rabin-Karp Algorithm Contd.
Each symbol in alphabet Σ can be represented by an ordinal value { 0, 1, 2, ..., d } |Σ| = d “Radix-d digits” RABIN-KARP-MATCHER( T, P, d, q ) n ← length[ T ] m ← length[ P ] h ← dm-1 mod q p ← 0 t0 ← 0 for i ← 1 to m ► Preprocessing do p ← ( d*p + P[ i ] ) mod q t0 ← ( d*t0 + T[ i ] ) mod q for s ← 0 to n – m ► Matching do if p = ts then if P[ 1..m ] = T[ s+1 .. s+m ] then print “Pattern occurs with shift” s if s < n – m then ts+1 ← ( d * ( ts – T[ s + 1 ] * h ) + T[ s + m + 1 ] ) mod q Rabin-Karp Algorithm Contd.
Is most efficient (on average) when P is long and the alphabet (sigma symbol) is large. Matches the pattern from right to left. Uses two heuristics that operate independently in parallel: The bad character heuristic. The good suffix heuristic when a mismatch occurs, each heuristic proposes an amount by which s can be safely increased without missing a valid shift. The largest shift value proposed is chosen. Boyre-Moore String Matching Algorithm
The bad character heuristic: Uses information about where the bad text character T[s+j] occurs in the pattern (if it occurs at all) to propose a new shift The pattern is matched from right to left. When a mismatch occurs in the text, the algorithm attempts to find the rightmost occurrence of the bad character in the left end of the window of P. If such a character is found, the shift value represents how many places the pattern would have to shift in order for the bad character in P to be under the bad character in T, otherwise the pattern is shifted next to the bad character.. Boyre-Moore String Matching Algorithm Contd.
The good suffix heuristic: Uses information about a suffix that may occur in the pattern elsewhere to propose a new shift. The pattern is matched from right to left. Having found a matching set of characters in the text, T, this is considered the good suffix. The algorithm searches for the good suffix to the left in P. The shift value represents how many places the pattern would have to shift in order for the good suffix in P to be under the good suffix in T. Boyre-Moore String Matching Algorithm Contd.
Boyre-Moore String Matching Algorithm Contd. Bad Character S = 1 proposed Shift by 1 Bad Character Heuristics
Boyre-Moore String Matching Algorithm Contd. Good Suffix S = 2 proposed Shift by 2 The Good Suffix Heuristics