CSE331 – Lecture 24 String Matching

CSE331 – Lecture 24String Matching Simple (Brute-Force) Approach Knuth-Morris-Pratt Algorithm Boyer-Moore Algorithm

The Problem • Find the first occurrence of the pattern P in text T. • The number of characters in P is m • The number of characters in T is n

The Simple Approach • For each position j in the text • If T[ j .. j+m) matches P[0..m) • stop : pattern found at position j • Advantage: • simple to increment • Disadvantage: • may require ability to push previously read characters back into input stream • Worst Case Efficiency: O(m*n) • The pattern is moved forward only one position each time a mismatch is found, no matter how much of the pattern matched prior to the mismatch character

Knuth-Morris-Pratt (KMP) • Based on FSA for recognizing the pattern P • The FSA is represented by a KMP flowchart • States are letters in the pattern P • Arcs are SUCCESS or FAIL • On success ( T[ j ] == P[ k ] ) • move forward with match ( j++ & k++ ) • On failure ( T[ j ] != P[ k ] ) • Move backward in the pattern (or shift the pattern forward over the text) to align the rightmost character P [ fail [ k ] ] with text character T [ j ] preserving the a longest matching prefix

KMP Fail Links: hubbahubba • Example pattern: hubbahubba • P: H U B B A H U B B A • K: 0 1 2 3 4 5 6 7 8 9 • Fail[k] -1 0 0 0 0 0 1 2 3 4 • Match to text: hubbahubbletelescope... • hubbahubba last A != L fail[9]= 4 • hubbahubba first A != L fail[4]= 0 • hubbahubba H != L fail[0]= -1 • hubbahubba • hubbahubbletelescope... • ^

KNP – Building Fail Links • Pattern: ABABDD • If P [ k ] != T [ j ] then • Knew = fail [ k ] is the position of the pattern character with the longest prefix matching the text T prior to the mismatch character T [ j ] • Finding fail[k]: • Go to P [ k-1 ] & find its fail [ k-1 ] (prefix that matches up to T[ k-2 ] ) • If P [ fail[k-1] ] matches P[k-1], then fail [ k ] becomes P[ fail[k-1] ] + 1 • Else follow next fail arrow fail [ fail [ k-1 ] ] and repeat Read char A B A B D D * 0 1 2 3 4 5

KNP – Building Fail Links void kmpSetup(char P[], int m, int fail[]) { int k, s; fail[0] = -1; // ch != P[0], read another ch for (k=1; k<m; k++) // for each P[k], left to right { s = fail[k-1]; // s is previous fail link while(s >= 0) // if not back to start { if (P[s] == P[k-1]) // duplicate char found break; // so, stop following links s = fail[s]; // follow next fail link } fail[k] = s + 1; // } }

KMP Fail Links:on mismatch, new k = fail[k] • Example pattern: ABABDD • ABABDD .ABCADD A != X so fail[0] = -1 • X????? X????? Skip X & k=0 • ABABDD .ABCADD B != X so fail[1] = 0 • AX???? AX?????? Shift=1 & k=0 • ABABDD ..ABCADD 2nd A != X so fail[2] = 0 • ABX??? ABX??? Shift=2 & k=0 • ABABDD ..ABCADD 2nd B != X so fail[3] = 1 • ABAX?? ABAX???? Shift=2 & k=1

KMP Fail Links:on mismatch, new k = fail[k] • Example pattern: ABABDD (cont) • ABABDD ..ABABDD D != X so fail[4] = 2 • ABABX? ABABX? Shift=2 & k=2 • ABABDD .....ABABDD 2nd D != X so fail[5] = 0 • ABABDX ABABDX Shift=5 & k=0

KNP Scan Algorithm int kmpScan (char P[], char T[], int m, int fail[]) { int match = -1; // position of match in text int j = 0, k = 0; while (! atEndOfText(T,j)) { // there is more text if (k == m) { match = j - m; // matched entire pattern, so stop break; } if (k == -1) { // nothing in pattern matched last text char, so j++; // get next text character k = 0; // start pattern over } else if (T[j] == P[k]) { j++; k++; // move forward one character in pattern and text } else { k = fail[k]; // follow fail link to best restart in pattern } } return match; }

KNP - Efficiency • Building Fail Links – O(m) • Scanning text – O(n) • Overall – O(m+n) = O(n)

Boyer-Moore (BM) • Heuristic # 1 • Match pattern Right-to-Left • Create a charJump[ch] array with entry for each character in the alphabet (ASCII code) • If T[ j ] != P[ k ] then • If T[j] appears in P[0..k) then • the rightmost occurrence is aligned with T[j] • Else • the pattern P is aligned beginning at T[j+1] • Jnew = charJump[ T[ j ] ] • matching resumes with T[ jnew ] and P[m-1] • This skips multiple text characters WITHOUT ever examining them

Boyer Moore Algorithm • Heuristic # 2 • MatchJump[k] = slide[k] + m – k • Slide[k] is amount of slide to align substrings • M-k is length of suffix (substring) being realigned • Similar to KMP fail links, but calculated right to left • If a suffix has matched in P & T and that same substring appears elsewhere in P, then upon a mismatch the pattern P is “slid” to align the rightmost such matching substring with the suffix in T • Matching resumes at the new end of the pattern determined by matchJump [ k ]

BM - Example • Pattern: BATSANDCATS • BATSANDCATS first Pattern alignment • BATSANDCATS charJump[T[j]] aligns N’s • BATSANDCATS matchJump[k] aligns ATS’s • TWOOLDGNATSCANBELIKEBATSANDCATS  The Text • New j (where matching resumes) is at end of pattern P, but which (S =?= A) or (S =?= I) • Use MAX(charJump(T[j]),matchJump[k])

BM Scan Algorithm int boyerMooreScan (char P[], char T[], int m, int charJump[], int matchJump[]) { int match = -1, j = m-1, k = m-1; while (! endOfText(T,j)){ if (k < 0) { match = j + 1; break; // entire pattern matches, so stop } if (T[j] == P[k]) { j--; k--; // continue match right-to-left } else { jump = matchJump[k]; if (charJump[(int)t[j]] > matchJump[k]) jump = charJump[(int)t[i]]; j += jump; // jump forward & restart matching at right k = m-1; } } return match; }

BM Algorithm Efficiency • Building charJump[ ] – O(S) • Building matchJump[ ] – O(m) • Scanning text – O(n) • In practice, only every 3 or 4 characters are examined in text so BM is quite fast • Overall – O(n)

String Matching Program • Program to demonstrate all three approaches to string matching • Strscan.cpp

CSE331 – Lecture 24 String Matching