1 / 80

Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU

Exact String Matching Algorithms. Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU. Classical Comparison Based Methods. Boyer-Moore Algorithm Knuth-Morris-Pratt Algorithm (KMP Algorithm). Boyer-Moore Algorithm. Basic ideas: Previously discussed ideas for naïve matching

sean-fuller
Download Presentation

Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Exact String Matching Algorithms Presented By Dr. ShazzadHosain Asst. Prof. EECS, NSU

  2. Classical Comparison Based Methods • Boyer-Moore Algorithm • Knuth-Morris-Pratt Algorithm (KMP Algorithm)

  3. Boyer-Moore Algorithm • Basic ideas: • Previously discussed ideas for naïve matching • successively align P and T to check for a match. • Shift P to the right on match failure. • new concepts wrt the naïve algorithm • Scan from right-to-left, i.e.,  • Special Bad character rule • Suffix shift rule

  4. Concept: Right-to-left Scan • How can we check for a match of pattern P at location iin target T? • Naïve algorithm scanned left-to-right, i.e., T[i+k]&P[1+k], k = 0 to length(P)-1 Example: P = adab, T = abaracadabara a b a r a c a d a b a r a a d a b ^ 1 a == a ^ 2 d != b

  5. Concept: Right-to-left Scan • Alternative, scan right-to-left, i.e., T[i+k]&P[1+k], k = length(P)-1 down-to 0 Example: P = adab, T = abaracadabara a b a r a c a d a b a r a a d a b ^ 1 b != r

  6. Concept: Right-to-left Scan • Why is scanning right-to-left a good idea? • Answer: by itself, it isn’t any better than left-to-right. • A naïve approach with right-to-left scanning is also Q(nm). • Larger shifts, supported by a clever bad character rule and a suffix shift rule make it better.

  7. Concept: Bad Character Rule • Idea: the mismatched character indicates a safe minimum shift. Example: P = adacara, T=abaracadabara a b a r a c a d a b a r a a d a c a r a ^ 1 a == a ^ 2 r != c Here the bad character is c. Perhaps we should shift to align this character with its rightmost occurrence in P?

  8. Concept: Bad Character Rule Shift two positions to align the rightmost occurrence of the mismatched character c in P. a b a r a c a d a b a r a a d a c a r a a d a c a r a Now, start matching again from right to left.

  9. Concept: Bad Character Rule Second Example: P = adacara, T=abaxaradabara a b a x a r a d a b a r a a d a c a r a ^ 1 a == a ^ 3 a == a ^ 4 c != x ^ 2 r == r Here the bad character is x. The minimum that we should shift should align this character with its occurrence in P. But x doesn’t occur in P!!!!

  10. Concept: Bad Character Rule Since x doesn’t occur in P, we can shift past it. Second Example: P = adacara, T=abaxaradabara a b a x a r a d a b a r a a d a c a r a a d a c a r a Now, start matching again from right to left.

  11. Concept: Bad Character Rule • The idea of bad character rule is to shift P by more than one characters when possible. • But if rightmost position is greater than the mismatched position. • Unfortunately, it is often the case 12345678901234567 T: spbctbsatpqsctbpq P: tpabsat P: tpabsat

  12. Concept: Bad Character Rule • We will define a bad character rule that uses the concept of the rightmost occurrence of each letter. • Let R(x) be the rightmost position of the letter x in P for each letter x in our alphabet. • If x doesn’t occur in P, define R(x) to be 0. 1234567 P= adacara R

  13. Concept: Bad Character Rule 12345678901234567 T: spbctbsabpqsctbpq P: tpabsab R(t)=1, R(s)=5. i: the position of mismatch in P. i=3 k: the counterpart in T. k=5. T[k]=t • The bad character rule says P should be shifted right by max{1, i-R(T[k])}. i.e., if the right-most occurrence of character T[k] in P is in position j (j<i), then P[j] should be below T[k] after the shifting. • Otherwise, we will shift P one position, i.e., when R(T[k]) >= i, 1 >= i- R(T[k]) • Obviously this rule is not very useful when R(T[k]) >= i, which is usually the case for DNA sequences P: tpabxab

  14. Concept: Extended Bad Character Rule Extended Bad Character Rule: If P[i] mismatches T[k], shift P along T so that the closest occurrence of the letter T[k] in P to the left of i in P is aligned with T[k]. Example: P = aracara, T=abararadabara a b a r a r a d a b a r a a r a c a r a ^ 1 a == a ^ This is the rightmost occurrence of r in P. Notice that i - R(T(k)) < 0 , i.e.,4 – 6 < 0 ^ 2 r == r ^ 3 a == a ^ 4 c != r ^ This is the rightmost occurrence of r to the left of i in P. Notice that 4 – 2 > 0, i.e.,this gives us a positive shift.

  15. Concept: Extended Bad Character Rule The amount of shift is i – j, where: • i is the index of the mismatch in P. • j is the rightmost occurrence of T[k] to the left of i in P. Example: P = aracara, T=abataradabara a b a t a r a d a b a r a a r a c a r a ^ 1 a == a ^ 2 r == r ^ 3 a == a ^ 4 c != t There is no occurrence of t in P, thus j = 0. Notice that i – j = 4, i.e.,this gives us a positive shift past the point of mismatch.

  16. Concept: Extended Bad Character Rule • How do we implement this rule? • We preprocess P (from right to left), recording the position of each occurrence of the letters. • For each character x in S, the alphabet, create a list of its occurrences in P. If x doesn’t occur in P, then it has an empty list.

  17. Concept: Extended Bad Character Rule Example: S = {a, b, c, d, r, t}, P = abataradabara • a_list = <13, 11,9,7,5,3,1> since ‘a’ occurs at these positions in P, i.e., abataradabara • b_list = <10,2> (abataradabara) • c_list = Ø • d_list = <8> (abataradabara) • r_list = <12,6> (abataradabara) • t_list = <4> (abataradabara)

  18. Concept: Suffix Shift Rule • Recall that we investigated finding prefixes before. • Since we are matching P to T from right-to-left, we will instead need to use suffixes.

  19. Suffix Shift Rule t is a suffix of P that match with a substring t of T x≠y t’ is the right-most copy of t in P such that t’ is not a suffix of P and z≠y

  20. Concept: Suffix Shift Rule • Consider the partial right-to-left matching of P to T below. • This partial match involves a, a suffix of P.

  21. Concept: Suffix Shift Rule • This partial match ends where the first mismatch occurs, where x is aligned with d.

  22. Concept: Suffix Shift Rule We want to find a right-most copy a´ of this substring a in P such that: • a´ is not a suffix of P and • The character to the left of a´ is not the same as the character to the left of a

  23. Concept: Suffix Shift Rule • If a´ exists, shift P to the right such that a´ is now aligned with the substring in T that was previously aligned with a.

  24. Concept: Suffix Shift Rule • If a´ doesn’t exist, shift P right by the least amount such that a prefix of P is aligned with a suffix of a in T.

  25. Concept: Suffix Shift Rule • If a´ doesn’t exist, and there is no prefix of P that matches a suffix of a in T, shift P left by n positions.

  26. Preprocessing for the good suffix rule • Let L(i) denote the largest position less than ns.t. P[i..n] matches a suffix of P[1..L(i)]. • If there is no such position, then L(i) = 0 • Example 1: If i = 17 then L(i) = 9 • Example 2: If i = 16 then L(i) = 0

  27. Concept: Suffix Shift Rule • Let L´(i) denote the largest position less than n s.t. P[i..n] matches a suffix of P[1..L´(i)] and s.t. the character preceding the suffix is not equal to P(i-1). • If there is no such position, then L´(i) = 0 • Example 1: If i = 20 then L(i) = 12 and L´(i) = 6

  28. slydogsaddogdbadbaddog P L (1 9 ) 1 9 Concept: Suffix Shift Rule • Example 2: If i = 19 then L(i) = 12 and L´(i) = 0

  29. Concept: Suffix Shift Rule • Notice that L(i) indicates the right-most copy of P[i..n] that is not a suffix of P. • In contrast, L´(i) indicates the right-most copy of P[i..n] that is not a suffix of P and whose preceding character doesn’t match P(i-1). • The relation between L´(i) and L(i) is analogous to the relation between a´ and a.

  30. Concept: Suffix Shift Rule • Q: What is the point? • A: If P(i - 1) causes the mismatch and L´(i) > 0, then we can shift P right by n - L´(i) positions. Example:

  31. Concept: Suffix Shift Rule • If L(i) and L´(i) are different, then obviously shifting by n - L´(i) positions is a greater shift than n - L(i). • Example:

  32. Concept: Suffix Shift Rule • Let Nj(P) denote the length of the longest suffix of P[1..j] that is also a suffix of P. • Example 1: N6(P) = 3 and N12(P) = 5. • Example 2: N3(P) = 2, N9(P) = 3, N15(P) = 5, N19(P) = 0.

  33. Concept: Suffix Shift Rule • Q: How are the concepts of Ni and Zi related? • Recall that Zi= Length of a maximal substring starting at position i, which is a prefix of P. • In contrast, Ni= Length of a maximal substring ending at position i, which is a suffix of P. • In the case of Boyer-Moore, we are naturally interested in suffixes since we are scanning right-to-left a a i i

  34. Concept: Suffix Shift Rule • Let Pr denote the mirror image of P, then the relationship can be expressed as Nj(P)=Zn-j+1(Pr). • In words, the length of the substring matching a suffix at position j in P is equal to the length of the corresponding substring matching a prefix in the reverse of P. • Q: Why must this true? • A: Because they are the same substring, except that one is the reverse of the other.

  35. Concept: Suffix Shift Rule • Since Nj(P) = Zn-j+1(Pr), we can use the Z algorithm to compute N in O(n). • Q: How do we do this? • A: We create Pr, the reverse of P, and process it with the Z algorithm.

  36. Concept: Suffix Shift Rule N is the reverse of Z! P: the pattern Pr the string obtained by reversing P Then Nj(P)=Zn-j+1 (Pr) 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 P: q c a b d a b d a b Pr: b a d b a d b a c q Nj: 0 0 0 2 0 0 5 0 0 0 Zi0 0 0 5 0 0 2 0 0 0 t y t’ x i y t’ x t j

  37. Concept: Suffix Shift Rule For pattern P, Nj (for j=1,…,n) can be calculated in O(n) using the Z algorithm. Why do we need to define Nj ? To use the strong good suffix rule, we need to find out L’(i) for every i=1,…,n. We can get L’(i) from Nj ! T x t P z t’ y t L’(i) i n z t’ y t

  38. Concept: Suffix Shift Rule • We can then find L´(i) and L(i)values from N values in linear time with the following: For i = 1 to n {L´(i) = 0;} For j = 1 to n – 1 { i = n - Nj(P) + 1; L´(i) = j; } // L values (if desired) can be obtained L(2) = L´(2) ; For i = 3 to n { L(i) = max(L(i- 1), L´(i));}

  39. Concept: Suffix Shift Rule • Example: P = asdbasasas, n = 10 • Values of Ni(P): 0, 2, 0, 0, 0, 2, 0, 4, 0 • Computed values i: 11, 9, 11, 11, 11, 9, 11, 7, 11 • Values of L´: 0, 0, 0, 0, 0, 0, 8, 0, 6 For i = 1 to n {L´(i) = 0;} For j = 1 to n – 1 { i = n - Nj(P) + 1; L´(i) = j; } L(2) = L´(2) ; For i = 3 to n { L(i) = max(L(i- 1), L´(i));}

  40. Concept: Suffix Shift Rule • Let l´(i) denote the length of the largest suffix of P[i..n] that is also a prefix of P. Let l´(i) = 0 if no such suffix exists. Example: P = asasbsasas ^ l’(1) = 4 ^ l’(7) = 4 ^ l’(6) = 4 ^ l’(4) = 4 ^ l’(5) = 4 ^ l’(3) = 4 ^ l’(2) = 4 ^ l’(8) = 2 ^ l’(9) = 2 ^ l’(10) = 0 l´(i) = t i t’ t

  41. Concept: Suffix Shift Rule • Thm: l´(i) = largest j <= n – i + 1 s.t. Nj(P) = j. • Q: How can we compute l´(i) values in linear time? • A: This is problem #9 in Chapter 2. This would make an interesting homework problem. l´(i) = t i y t’ x t t’ t j

  42. Boyer-Moore Algorithm Preprocessing: Compute L´(i) and l´(i) for each position i in P, Compute R(x), the right-most occurrence of x in P, for each character x in S. Search: k = n; While k <= m { i = n; h = k; While i > 0 and P(i) = T(j) { i = i – 1; h = h – 1;} if i = 0 { report occurrence of P in T at position k. k = k + n - l´(2);} else Shift P (increase k) by the max amount indicated by the extended bad character rule and the good suffix rule. }

  43. Boyer-Moore Algorithm Example: P = golgol Preprocessing: Compute L´(i) and l´(i) for each position i in P For i = 1 to n {L´(i) = 0;} For j = 1 to n – 1 { i = n - Nj(P) + 1; L´(i) = j; } Notice that first we need Nj(P) values in order to compute L´(i) and l´(i) for each position i in P.

  44. Boyer-Moore Algorithm Example: P = golgol Recall that Nj(P) is the length of the longest suffix of P[1..j] that is also a suffix of P. N1(P) = 0, there is no suffix of P that ends with g N2(P) = 0, there is no suffix of P that ends with o N3(P) = 3, there is a suffix of P that ends with l N4(P) = 0, there is no suffix of P that ends with g N5(P) = 0, there is no suffix of P that ends with o N1(P) = N2(P) = N4(P) = N5(P) = 0 and N3(P) = 3

  45. Boyer-Moore Algorithm • Preprocessing: P = golgol, n = 6 • N1(P) = N2(P) = N4(P) = N5(P) = 0 and N3(P) = 3 • Compute L´(i) and l´(i) for each position i in P For i = 1 to n {L´(i) = 0;} For j = 1 to n – 1 { i = n - Nj(P) + 1; L´(i) = j; } j = 1 i = 7ThereforeL´(7) = 1 j = 2 i = 7ThereforeL´(7) = 2 j = 3 i = 4ThereforeL´(4) = 3 j = 4 i = 7ThereforeL´(7) = 4 j = 5 i = 7ThereforeL´(7) = 5 L´(1) = L´(2) = L´(3) = L´(5) = 0 and L´(4) = 3

  46. Boyer-Moore Algorithm • Preprocessing: P = golgol, n = 6 • N1(P) = N2(P) = N4(P) = N5(P) = 0 and N3(P) = 3 • L´(1) = L´(2) = L´(4) = L´(5) = 0 and L´(4) = 3 • Compute l´(i) for each position i in P. • Recall that l´(i) is the length of the longest suffix of P[i..n] that • is also a prefix of P. l´(1)= 6 since gol is the longest suffix ofP[1..n] that is a prefix ofP. l´(2)= 3 since gol is the longest suffix ofP[2..n] that is a prefix ofP. l´(3)= 3 since gol is the longest suffix ofP[3..n] that is a prefix ofP. l´(4)= 3 since gol is the longest suffix ofP[4..n] that is a prefix ofP. l´(5)= 0 since there is no suffix of P[5..n] that is a prefix ofP. l´(6)= 0 since there is no suffix of P[6..n] that is a prefix ofP. l´(1)= 6, l´(2)= l´(3)= l´(4)= 3 and l´(5)= l´(6)= 0

  47. Boyer-Moore Algorithm • Preprocessing: P = golgol, n = 6 • N1(P) = N2(P) = N4(P) = N5(P) = 0 and N3(P) = 3 • L´(1) = L´(2) = L´(4) = L´(5) = 0 and L´(4) = 3 • l´(1) = 6, l´(2) = l´(3) = l´(4) = 3 and l´(5) = l´(6) = 0 • Compute the list R(x), the right-most occurrences of x in P, • for each character x in S = {g, o, l} R(g) = <4, 1> R(o) = <5, 2> R(l) = <6, 3>

  48. Boyer-Moore Algorithm • Preprocessing: P = golgol, n = 6, T = lolgolgol, m = 9 • L´(1) = L´(2) = L´(4) = L´(5) = 0 and L´(4) = 3 • l´(1) =6, l´(2) = l´(3) = l´(4) = 3 and l´(5) = l´(6) = 0 • R(g) = <4, 1>, R(o) = <5, 2>, R(l) = <6, 3> Search: k = n; While k <= m { i = n; h = k; While i > 0 and P(i) = T(j) { i = i – 1; h = h – 1;} if i = 0 { report occurrence of P in T at position k. k = k + n - l´(2);} else Shift P (increase k) by the max amount indicated by the extended bad character rule and the good suffix rule. }

  49. Search k = 6; While k <= 9 { i = 6; h = k; While i > 0 and P(i) = T(j) { i = i – 1; h = h – 1;} if i = 0 { report occurrence of P in T at position k. k = k + 6 - l´(2);} else Shift P (increase k) by the max amount indicated by the extended bad character rule and the good suffix rule. } lolgolgol golgol ^ i= 1, h = 1, P(1) != T(1)  ^ i= 2, h = 2 ^ i= 3, h = 3 ^ i= 6, h = 6 ^ i= 4, h = 4 ^ i= 5, h = 5 But i = 1! Bad Character Rule: there is no occurrence of l, the mismatched character in T, to the left of P(1). This suggests shifting only 1 place Good Suffix Rule: Since L´(2) = 0, l´(2) = 3 therefore shift P by n - l´(2) places, i.e., 6-3=3 places. Thus k = k + 3 = 9

  50. Search k = 6; While k <= 9 { i = 6; h = k; While i > 0 and P(i) = T(j) { i = i – 1; h = h – 1;} if i = 0 { report occurrence of P in T at position k. k = k + 6 - l´(2);} else Shift P (increase k) by the max amount indicated by the extended bad character rule and the good suffix rule. } lolgolgol golgol lolgolgol golgol k = 12, we are done!  ^ i= 6, h = 9 ^ i= 0, h = 3 ^ i= 1, h = 4 ^ i= 2, h = 5 ^ i= 3, h = 6 ^ i= 4, h = 7 ^ i= 5, h = 8 • i = 0, report occurrence of P in T at position 4, k = k + 6 - l´(2) = 9 + 6- 3 = 12

More Related