380 likes | 752 Views
Exact String Matching Algorithms. Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU. Exact Matching: What’s the Problem. 1 1 2 34 5 67 8 90 1 2 T = bbabaxababay P = aba. P occurs in T starting at locations 3, 7, and 9 P may overlap, as found at 7 and 9. The Naive Method.
E N D
Exact String Matching Algorithms Presented By Dr. ShazzadHosain Asst. Prof. EECS, NSU
Exact Matching: What’s the Problem 1 1 2 34 5 67 8 90 1 2 T = bbabaxababay P = aba P occurs in T starting at locations 3, 7, and 9 P may overlap, as found at 7 and 9.
The Naive Method • Problem is to find if a pattern P[1..m] occurs within text T[1..n] • Let P = abxyabxzand T = xabxyabxyabxz • Where m = 8 and n = 13
The Naive Method • If P = aaaand T = aaaaaaaaaathen n=3, m=10 • In worst case exactly n(m-n+1) comparisons • In this case 24 comparisons in the order of θ(mn).
The Naive Algorithm Char text[], pat[] ; int n, m ; { inti, j, k, lim ; lim=n-m+1 ; for (i=1 ; i<=lim ; i++) /* search */ { k=i ; for (j=1 ; j<=m && text[k]==pat[j]; j++) k++; if (j>m) Report_match_at_position(i-j+1); } } • The worst-case bound can be reduced to O(m+n) • For applications with n = 1000 and m = 10,000,000the improvement is significant.
The Smart Algorithm • Reasoning of this sort is the key to shifting by more than one character If you know first character of P (namely a) does not occur again at P until position 5 of P 12345 678 Instead of Skips over three comparisons
The Smarter Algorithm Instead of Instead of Starts at Skips another three Skips over three comparisons
The Smart Algorithms • Knuth-Morris-Pratt (KMP) Alogorithm • Boyer-Moore Algorithm • Reduced run-time to O(n+m) Additional knowledge requires preprocessing of strings Usually P is much shorter than T So P is preprocessed
The Preprocessing Approach • Usually P is preprocessed instead of T • Sometimes T is preprocessed, e.g. suffix tree • The preprocessing methods are similar in spirit, but often quite different in detail and conceptual difficulty • Fundamental preprocessing of P is independent of any particular algorithm • Each algorithm uses this information
Basic String Definitions/Notations • Let, S be the string • S[i..j] is the substring of S starting at position i and ending at position j, S[i..j] is empty if i > j 1 1 2 34 5 67 8 90 1 2 S = bbabaxababay Prefix S[3..7] = abaxa S[1..4] = bbab • |S| is the length of the string. Here, |S| = 12 • S[1..i] is prefix of S that ends at position i • S[i..|S|] is the suffix of S that begins at position i Suffix S[9..12] = abay
Basic String Definitions/Notations • A proper prefix, suffix or substring of S is, respectively, a prefix, suffix or substring that is not the entire string S, not the empty string. • For any string S, S(i) denotes the ith character of S
Preprocessing • Goal: To gather the information needed for speeding up the algorithm • Definitions: • Zi: For i>1, the length of the longest substring of S that starts at i and matches a prefix of S • Z-box: for any position i >1 where Zi>0, the Z-box at i starts at i and ends at i+Zi-1 • ri; For every i>1, riis the right-most endpoint of the Z-boxes that begin at or before i • li; For every i>1, liis the left endpoint of the Z-box ends at ri
Preprocessing Zi(S) = The longest prefix of S[i..|S|] that matches a prefix of S, where i > 1 Z5(S) = 3 (aabc…aabx…) 1 12 3 456 7 8 901 S = aabcaabxaaz Z6(S) = 1 (aa…ab…) Z7(S) = Z8(S) = 0 Z9(S) = 2 (aab…aaz) We will use Zi in place of Zi(S) Z Box for i > 1, where Zi is greater than zero Figure 1.2: From Gusfield
The li and ri of Z-Box ri = the right-most endpoint of the Z-boxes that begin at or before position i. li = the left end of the Z-box that ends at ri. 40 50 55 62 70 78 82 85 89 95 r78 = 95 l78 = 78 r82 = 95 l82 = 78 r52 = 50 l52 = 40 r75 = 85 l75 = 70
Preprocessing 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 S: a a b a a b c a x a a b a a b c y Z: 0 1 0 3 1 0 0 1 0 7 1 0 3 1 0 0 0 Z-box a a b a a b c a x a a b a a b c y ri: 0 2 2 6 6 6 6 8 8 16 16 16 16 16 16 16 16 li: 0 2 2 4 4 4 4 8 8 10 10 10 10 10 10 10 10
Z-Algorithm Goal: To calculate Zifor an input string S in a linear time Starting from i=2, calculate Z2, r2 and l2 For i=3; i<n; i++ In iteration k, calculate Zk, rkand lkbased on Zj,rjandljforj=2,…,k-1 For iteration k, the algorithm only need rk-1 and lk-1. Thus, there is no need to keep all riand li. We use r, and l to denote rk-1 and lk-1
Z-Algorithm In iteration k: (I) if k<=r l k r a’ a b’ b l’ k’ l k r’ r k’=k-l+1; r’=r-l+1; a=a’; b=b’ a’ a b’ b a a b a a b c a x a a b a a b c y 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Z-Algorithm A) If |g’|<|b’|, that is, Z k’< r-k+1, Z k = Z k’ a’ a x y b’ y b g’ g g’’ l’ k’ l k r’ r g=g’=g’’; x≠y a’ a b’ g’ g b g’’ a a b a a b c a x a a b a a b c y 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Z: 0 1 0 3 1 0 0 1 0 7 1 0 3
Z-Algorithm 1 2 3 4 5 6 7 8 9 10 11 12 1314 15 16 S: a a b a a b c a x a a b a ac d Z: 0 1 0 3 1 0 0 1 0 6 1 0 2 1 0 0 B) If |g’|>|b’|, that is, Z k’ >r-k+1, Zk =|b|, i.e., r-k+1 a’ b’’ b’ b a x x y g’ g g’’ l’ k’ l k r’ r b=b’=b’’ g’=g’’; x ≠y (because a is a Z box) Zk =|b|, i.e., r-k+1 a’ b’’ b’ b a g’’ g’
Z-Algorithm 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 S: a a b a a e c a x a a b a abd Z: 0 1 0 2 1 0 0 1 0 6 1 0 3 1 0 0 C) If |g’|=|b’|, that is, Z k’ =r-k+1, Zk≥|b|, i.e., ≥ r-k+1 a’ b’’ b’ b a z x y g’ g g’’ l’ k’ l k r’ r b=b’=b’’ g=g’=g’’; x ≠y (because a is a Z box) z ≠x (because g’ is a Z box) z ?? y Compare S[r+1,...] with S[ |b| +1,…] until a mismatch occurs. Update Zk, r, and l a’ b’ b a g’’ g’
Z-Algorithm (II) if k>r l r k Compare the characters starting at k+1 with those starting at 1. Update r, and l if necessary
Z-Algorithm Input: Pattern P Output: Zi Z Algorithm Calculate Z2, r2 and l2 specifically by comparisons. R=r2 and l=l2 for i=3; i<n; i++ if k<=r if Z k-l+1 <r-k+1, then Z k = Z k-l+1 else if Z k-l+1 > r-k+1 Z k = r-k+1 else compare the characters starting at r+1 with those starting at |b| +1. Update r, and l if necessary else Compare the characters starting at k to those starting at 1. Update r, and l if necessary
Preprocessing 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 S: a a b a a b c a x a a b a a b c y Z: 0 1 0 3 1 0 0 1 0 7 1 0 3 1 0 0 0 r: 0 2 2 6 6 6 6 8 8 16 16 16 16 16 16 16 16 l: 0 2 2 4 4 4 4 8 8 10 10 10 10 10 10 10 10
Z-Algorithm Time complexity #mismatches <= number of iterations, n #matches • Let q be the number of matches at iteration k, then we need to increase r by at least q • r<=n • Thus total #match <=n T=O( #matches + #mismatches +#iterations)=O(n) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 S: a a b a a b c a x a a b a a b c y Z: 0 1 0 3 1 0 0 1 0 7 1 0 3 1 0 0 0 r: 0 2 2 6 6 6 6 8 8 16 16 16 16 16 16 16 16 l: 0 2 2 4 4 4 4 8 8 10 10 10 10 10 10 10 10 #m: 0 1 0 3 0 0 0 1 0 7 0 0 0 0 0 0 0 #mis: 0 1 1 1 0 0 1 1 1 1 0 0 0 0 0 0 1
Simplest Linear Time Exact Matching Algorithm Input: Pattern P, Text T Output: Occurrences of P in T Algorithm Simplest S=P$T, where $ is a character that do not appear in P and T For i=2; i<|S|; i++ Calculate Zi If Zi=|P|, then report that there is an occurrence of P in T starting at i-|P|-1 of T=O(|P|+|T|+1)=O(n+m)
Simplest Linear Time Exact Matching Algorithm • Take only O (n) extra space • Alphabet-independent linear time a’ a b’ b $ l’ k’ l k r’ r
Reference • Chapter 1, 2: Exact Matching: Fundamental Preprocessing and First Algorithms