1 / 27

Exact String Matching Algorithms

Exact String Matching Algorithms. Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU. Exact Matching: What’s the Problem. 1 1 2 34 5 67 8 90 1 2 T = bbabaxababay P = aba. P occurs in T starting at locations 3, 7, and 9 P may overlap, as found at 7 and 9. The Naive Method.

Download Presentation

Exact String Matching Algorithms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Exact String Matching Algorithms Presented By Dr. ShazzadHosain Asst. Prof. EECS, NSU

  2. Exact Matching: What’s the Problem 1 1 2 34 5 67 8 90 1 2 T = bbabaxababay P = aba P occurs in T starting at locations 3, 7, and 9 P may overlap, as found at 7 and 9.

  3. The Naive Method • Problem is to find if a pattern P[1..m] occurs within text T[1..n] • Let P = abxyabxzand T = xabxyabxyabxz • Where m = 8 and n = 13

  4. The Naive Method • If P = aaaand T = aaaaaaaaaathen n=3, m=10 • In worst case exactly n(m-n+1) comparisons • In this case 24 comparisons in the order of θ(mn).

  5. The Naive Algorithm Char text[], pat[] ; int n, m ; { inti, j, k, lim ; lim=n-m+1 ; for (i=1 ; i<=lim ; i++) /* search */ { k=i ; for (j=1 ; j<=m && text[k]==pat[j]; j++) k++; if (j>m) Report_match_at_position(i-j+1); } } • The worst-case bound can be reduced to O(m+n) • For applications with n = 1000 and m = 10,000,000the improvement is significant.

  6. The Smart Algorithm • Reasoning of this sort is the key to shifting by more than one character If you know first character of P (namely a) does not occur again at P until position 5 of P 12345 678 Instead of Skips over three comparisons

  7. The Smarter Algorithm Instead of Instead of Starts at Skips another three Skips over three comparisons

  8. The Smart Algorithms • Knuth-Morris-Pratt (KMP) Alogorithm • Boyer-Moore Algorithm • Reduced run-time to O(n+m) Additional knowledge requires preprocessing of strings Usually P is much shorter than T So P is preprocessed

  9. The Preprocessing Approach • Usually P is preprocessed instead of T • Sometimes T is preprocessed, e.g. suffix tree • The preprocessing methods are similar in spirit, but often quite different in detail and conceptual difficulty • Fundamental preprocessing of P is independent of any particular algorithm • Each algorithm uses this information

  10. Basic String Definitions/Notations • Let, S be the string • S[i..j] is the substring of S starting at position i and ending at position j, S[i..j] is empty if i > j 1 1 2 34 5 67 8 90 1 2 S = bbabaxababay Prefix S[3..7] = abaxa S[1..4] = bbab • |S| is the length of the string. Here, |S| = 12 • S[1..i] is prefix of S that ends at position i • S[i..|S|] is the suffix of S that begins at position i Suffix S[9..12] = abay

  11. Basic String Definitions/Notations • A proper prefix, suffix or substring of S is, respectively, a prefix, suffix or substring that is not the entire string S, not the empty string. • For any string S, S(i) denotes the ith character of S

  12. Preprocessing • Goal: To gather the information needed for speeding up the algorithm • Definitions: • Zi: For i>1, the length of the longest substring of S that starts at i and matches a prefix of S • Z-box: for any position i >1 where Zi>0, the Z-box at i starts at i and ends at i+Zi-1 • ri; For every i>1, riis the right-most endpoint of the Z-boxes that begin at or before i • li; For every i>1, liis the left endpoint of the Z-box ends at ri

  13. Preprocessing Zi(S) = The longest prefix of S[i..|S|] that matches a prefix of S, where i > 1 Z5(S) = 3 (aabc…aabx…) 1 12 3 456 7 8 901 S = aabcaabxaaz Z6(S) = 1 (aa…ab…) Z7(S) = Z8(S) = 0 Z9(S) = 2 (aab…aaz) We will use Zi in place of Zi(S) Z Box for i > 1, where Zi is greater than zero Figure 1.2: From Gusfield

  14. The li and ri of Z-Box ri = the right-most endpoint of the Z-boxes that begin at or before position i. li = the left end of the Z-box that ends at ri. 40 50 55 62 70 78 82 85 89 95 r78 = 95 l78 = 78 r82 = 95 l82 = 78 r52 = 50 l52 = 40 r75 = 85 l75 = 70

  15. Preprocessing 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 S: a a b a a b c a x a a b a a b c y Z: 0 1 0 3 1 0 0 1 0 7 1 0 3 1 0 0 0 Z-box a a b a a b c a x a a b a a b c y ri: 0 2 2 6 6 6 6 8 8 16 16 16 16 16 16 16 16 li: 0 2 2 4 4 4 4 8 8 10 10 10 10 10 10 10 10

  16. Z-Algorithm Goal: To calculate Zifor an input string S in a linear time Starting from i=2, calculate Z2, r2 and l2 For i=3; i<n; i++ In iteration k, calculate Zk, rkand lkbased on Zj,rjandljforj=2,…,k-1 For iteration k, the algorithm only need rk-1 and lk-1. Thus, there is no need to keep all riand li. We use r, and l to denote rk-1 and lk-1

  17. Z-Algorithm In iteration k: (I) if k<=r l k r a’ a b’ b l’ k’ l k r’ r k’=k-l+1; r’=r-l+1; a=a’; b=b’ a’ a b’ b a a b a a b c a x a a b a a b c y 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

  18. Z-Algorithm A) If |g’|<|b’|, that is, Z k’< r-k+1, Z k = Z k’ a’ a x y b’ y b g’ g g’’ l’ k’ l k r’ r g=g’=g’’; x≠y a’ a b’ g’ g b g’’ a a b a a b c a x a a b a a b c y 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Z: 0 1 0 3 1 0 0 1 0 7 1 0 3

  19. Z-Algorithm 1 2 3 4 5 6 7 8 9 10 11 12 1314 15 16 S: a a b a a b c a x a a b a ac d Z: 0 1 0 3 1 0 0 1 0 6 1 0 2 1 0 0 B) If |g’|>|b’|, that is, Z k’ >r-k+1, Zk =|b|, i.e., r-k+1 a’ b’’ b’ b a x x y g’ g g’’ l’ k’ l k r’ r b=b’=b’’ g’=g’’; x ≠y (because a is a Z box) Zk =|b|, i.e., r-k+1 a’ b’’ b’ b a g’’ g’

  20. Z-Algorithm 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 S: a a b a a e c a x a a b a abd Z: 0 1 0 2 1 0 0 1 0 6 1 0 3 1 0 0 C) If |g’|=|b’|, that is, Z k’ =r-k+1, Zk≥|b|, i.e., ≥ r-k+1 a’ b’’ b’ b a z x y g’ g g’’ l’ k’ l k r’ r b=b’=b’’ g=g’=g’’; x ≠y (because a is a Z box) z ≠x (because g’ is a Z box) z ?? y Compare S[r+1,...] with S[ |b| +1,…] until a mismatch occurs. Update Zk, r, and l a’ b’ b a g’’ g’

  21. Z-Algorithm (II) if k>r l r k Compare the characters starting at k+1 with those starting at 1. Update r, and l if necessary

  22. Z-Algorithm Input: Pattern P Output: Zi Z Algorithm Calculate Z2, r2 and l2 specifically by comparisons. R=r2 and l=l2 for i=3; i<n; i++ if k<=r if Z k-l+1 <r-k+1, then Z k = Z k-l+1 else if Z k-l+1 > r-k+1 Z k = r-k+1 else compare the characters starting at r+1 with those starting at |b| +1. Update r, and l if necessary else Compare the characters starting at k to those starting at 1. Update r, and l if necessary

  23. Preprocessing 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 S: a a b a a b c a x a a b a a b c y Z: 0 1 0 3 1 0 0 1 0 7 1 0 3 1 0 0 0 r: 0 2 2 6 6 6 6 8 8 16 16 16 16 16 16 16 16 l: 0 2 2 4 4 4 4 8 8 10 10 10 10 10 10 10 10

  24. Z-Algorithm Time complexity #mismatches <= number of iterations, n #matches • Let q be the number of matches at iteration k, then we need to increase r by at least q • r<=n • Thus total #match <=n T=O( #matches + #mismatches +#iterations)=O(n) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 S: a a b a a b c a x a a b a a b c y Z: 0 1 0 3 1 0 0 1 0 7 1 0 3 1 0 0 0 r: 0 2 2 6 6 6 6 8 8 16 16 16 16 16 16 16 16 l: 0 2 2 4 4 4 4 8 8 10 10 10 10 10 10 10 10 #m: 0 1 0 3 0 0 0 1 0 7 0 0 0 0 0 0 0 #mis: 0 1 1 1 0 0 1 1 1 1 0 0 0 0 0 0 1

  25. Simplest Linear Time Exact Matching Algorithm Input: Pattern P, Text T Output: Occurrences of P in T Algorithm Simplest S=P$T, where $ is a character that do not appear in P and T For i=2; i<|S|; i++ Calculate Zi If Zi=|P|, then report that there is an occurrence of P in T starting at i-|P|-1 of T=O(|P|+|T|+1)=O(n+m)

  26. Simplest Linear Time Exact Matching Algorithm • Take only O (n) extra space • Alphabet-independent linear time a’ a b’ b $ l’ k’ l k r’ r

  27. Reference • Chapter 1, 2: Exact Matching: Fundamental Preprocessing and First Algorithms

More Related