1 / 17

CSE331 – Lecture 24 String Matching

CSE331 – Lecture 24 String Matching. Simple (Brute-Force) Approach Knuth-Morris-Pratt Algorithm Boyer-Moore Algorithm. The Problem. Find the first occurrence of the pattern P in text T. The number of characters in P is m The number of characters in T is n. The Simple Approach.

jules
Download Presentation

CSE331 – Lecture 24 String Matching

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSE331 – Lecture 24String Matching Simple (Brute-Force) Approach Knuth-Morris-Pratt Algorithm Boyer-Moore Algorithm

  2. The Problem • Find the first occurrence of the pattern P in text T. • The number of characters in P is m • The number of characters in T is n

  3. The Simple Approach • For each position j in the text • If T[ j .. j+m) matches P[0..m) • stop : pattern found at position j • Advantage: • simple to increment • Disadvantage: • may require ability to push previously read characters back into input stream • Worst Case Efficiency: O(m*n) • The pattern is moved forward only one position each time a mismatch is found, no matter how much of the pattern matched prior to the mismatch character

  4. Knuth-Morris-Pratt (KMP) • Based on FSA for recognizing the pattern P • The FSA is represented by a KMP flowchart • States are letters in the pattern P • Arcs are SUCCESS or FAIL • On success ( T[ j ] == P[ k ] ) • move forward with match ( j++ & k++ ) • On failure ( T[ j ] != P[ k ] ) • Move backward in the pattern (or shift the pattern forward over the text) to align the rightmost character P [ fail [ k ] ] with text character T [ j ] preserving the a longest matching prefix

  5. KMP Fail Links: hubbahubba • Example pattern: hubbahubba • P: H U B B A H U B B A • K: 0 1 2 3 4 5 6 7 8 9 • Fail[k] -1 0 0 0 0 0 1 2 3 4 • Match to text: hubbahubbletelescope... • hubbahubba last A != L fail[9]= 4 • hubbahubba first A != L fail[4]= 0 • hubbahubba H != L fail[0]= -1 • hubbahubba • hubbahubbletelescope... • ^

  6. KNP – Building Fail Links • Pattern: ABABDD • If P [ k ] != T [ j ] then • Knew = fail [ k ] is the position of the pattern character with the longest prefix matching the text T prior to the mismatch character T [ j ] • Finding fail[k]: • Go to P [ k-1 ] & find its fail [ k-1 ] (prefix that matches up to T[ k-2 ] ) • If P [ fail[k-1] ] matches P[k-1], then fail [ k ] becomes P[ fail[k-1] ] + 1 • Else follow next fail arrow fail [ fail [ k-1 ] ] and repeat Read char A B A B D D * 0 1 2 3 4 5

  7. KNP – Building Fail Links void kmpSetup(char P[], int m, int fail[]) { int k, s; fail[0] = -1; // ch != P[0], read another ch for (k=1; k<m; k++) // for each P[k], left to right { s = fail[k-1]; // s is previous fail link while(s >= 0) // if not back to start { if (P[s] == P[k-1]) // duplicate char found break; // so, stop following links s = fail[s]; // follow next fail link } fail[k] = s + 1; // } }

  8. KMP Fail Links:on mismatch, new k = fail[k] • Example pattern: ABABDD • ABABDD .ABCADD A != X so fail[0] = -1 • X????? X????? Skip X & k=0 • ABABDD .ABCADD B != X so fail[1] = 0 • AX???? AX?????? Shift=1 & k=0 • ABABDD ..ABCADD 2nd A != X so fail[2] = 0 • ABX??? ABX??? Shift=2 & k=0 • ABABDD ..ABCADD 2nd B != X so fail[3] = 1 • ABAX?? ABAX???? Shift=2 & k=1

  9. KMP Fail Links:on mismatch, new k = fail[k] • Example pattern: ABABDD (cont) • ABABDD ..ABABDD D != X so fail[4] = 2 • ABABX? ABABX? Shift=2 & k=2 • ABABDD .....ABABDD 2nd D != X so fail[5] = 0 • ABABDX ABABDX Shift=5 & k=0

  10. KNP Scan Algorithm int kmpScan (char P[], char T[], int m, int fail[]) { int match = -1; // position of match in text int j = 0, k = 0; while (! atEndOfText(T,j)) { // there is more text if (k == m) { match = j - m; // matched entire pattern, so stop break; } if (k == -1) { // nothing in pattern matched last text char, so j++; // get next text character k = 0; // start pattern over } else if (T[j] == P[k]) { j++; k++; // move forward one character in pattern and text } else { k = fail[k]; // follow fail link to best restart in pattern } } return match; }

  11. KNP - Efficiency • Building Fail Links – O(m) • Scanning text – O(n) • Overall – O(m+n) = O(n)

  12. Boyer-Moore (BM) • Heuristic # 1 • Match pattern Right-to-Left • Create a charJump[ch] array with entry for each character in the alphabet (ASCII code) • If T[ j ] != P[ k ] then • If T[j] appears in P[0..k) then • the rightmost occurrence is aligned with T[j] • Else • the pattern P is aligned beginning at T[j+1] • Jnew = charJump[ T[ j ] ] • matching resumes with T[ jnew ] and P[m-1] • This skips multiple text characters WITHOUT ever examining them

  13. Boyer Moore Algorithm • Heuristic # 2 • MatchJump[k] = slide[k] + m – k • Slide[k] is amount of slide to align substrings • M-k is length of suffix (substring) being realigned • Similar to KMP fail links, but calculated right to left • If a suffix has matched in P & T and that same substring appears elsewhere in P, then upon a mismatch the pattern P is “slid” to align the rightmost such matching substring with the suffix in T • Matching resumes at the new end of the pattern determined by matchJump [ k ]

  14. BM - Example • Pattern: BATSANDCATS • BATSANDCATS first Pattern alignment • BATSANDCATS charJump[T[j]] aligns N’s • BATSANDCATS matchJump[k] aligns ATS’s • TWOOLDGNATSCANBELIKEBATSANDCATS  The Text • New j (where matching resumes) is at end of pattern P, but which (S =?= A) or (S =?= I) • Use MAX(charJump(T[j]),matchJump[k])

  15. BM Scan Algorithm int boyerMooreScan (char P[], char T[], int m, int charJump[], int matchJump[]) { int match = -1, j = m-1, k = m-1; while (! endOfText(T,j)){ if (k < 0) { match = j + 1; break; // entire pattern matches, so stop } if (T[j] == P[k]) { j--; k--; // continue match right-to-left } else { jump = matchJump[k]; if (charJump[(int)t[j]] > matchJump[k]) jump = charJump[(int)t[i]]; j += jump; // jump forward & restart matching at right k = m-1; } } return match; }

  16. BM Algorithm Efficiency • Building charJump[ ] – O(S) • Building matchJump[ ] – O(m) • Scanning text – O(n) • In practice, only every 3 or 4 characters are examined in text so BM is quite fast • Overall – O(n)

  17. String Matching Program • Program to demonstrate all three approaches to string matching • Strscan.cpp

More Related