CSE 30331 Lecture 23 – String Matching

CSE 30331Lecture 23 – String Matching • Simple (Brute-Force) Approach • Knuth-Morris-Pratt Algorithm • Boyer-Moore Algorithm

The Problem • Find the first occurrence of the pattern P in text T. • The number of characters in P is m • The number of characters in T is n

The Simple Approach • For each position j in the text • If T[ j .. j+m) matches P[0..m) • stop : pattern found at position j • Advantage: • simple to increment • Disadvantage: • may require ability to push previously read characters back into input stream • Worst Case Efficiency: O(m*n) • The pattern is moved forward only one position each time a mismatch is found, no matter how much of the pattern matched prior to the mismatch character

Knuth-Morris-Pratt (KMP) • Based on FSA for recognizing the pattern P • The FSA is represented by a KMP flowchart • States are letters in the pattern P • Arcs are SUCCESS or FAIL • On success ( T[ j ] == P[ k ] ) • move forward with match ( j++ & k++ ) • On failure ( T[ j ] != P[ k ] ) • Move backward in the pattern (or shift the pattern forward over the text) to align the rightmost character P [ fail [ k ] ] with text character T [ j ] preserving the longest matching prefix

KMP Fail Links: hubbahubba • Example pattern: hubbahubba • P: H U B B A H U B B A • K: 0 1 2 3 4 5 6 7 8 9 • Fail[k] -1 0 0 0 0 0 1 2 3 4 • Match to text: hubbahubbletelescope... • hubbahubba last A != L fail[9]= 4 • hubbahubba first A != L fail[4]= 0 • hubbahubba H != L fail[0]= -1 • hubbahubba • hubbahubbletelescope... • ^

KNP – Building Fail Links • Pattern: ABABDD • If P [ k ] != T [ j ] then • Knew = fail [ k ] is the position of the pattern character with the longest prefix matching the text T prior to the mismatch character T [ j ] • Finding fail[k]: • Go to P [ k-1 ] & find its fail [ k-1 ] (prefix that matches up to T[ k-2 ] ) • If P [ fail[k-1] ] matches P[k-1], then fail [ k ] becomes P[ fail[k-1] ] + 1 • Else follow next fail arrow fail [ fail [ k-1 ] ] and repeat Read char A B A B D D * 0 1 2 3 4 5

KNP – Building Fail Links void kmpSetup(char P[], int m, int fail[]) { int k, s; fail[0] = -1; // ch != P[0], read another ch for (k=1; k<m; k++) // for each P[k], left to right { s = fail[k-1]; // s is previous fail link while(s >= 0) // if not back to start { if (P[s] == P[k-1]) // duplicate char found break; // so, stop following links s = fail[s]; // follow next fail link } fail[k] = s + 1; // } }

KNP – Building Fail Links • Pattern: A B A B D D • Fail: -1 0 void kmpSetup(char P[], int m, int fail[]) { int k, s; fail[0] = -1; // ch != P[0], read another ch for (k=1; k<m; k++) { // for P[1]:‘B’ s = fail[k-1];// s is fail[0]:-1 while(s >= 0) { // skip loop if (P[s] == P[k-1]) // break; // s = fail[s]; // } fail[k] = s + 1;// set fail[1] = -1 + 0 = 0 } } Read char A B A B D D * 0 1 2 3 4 5

KNP – Building Fail Links • Pattern: A B A B D D • Fail: -1 0 0 void kmpSetup(char P[], int m, int fail[]) { int k, s; fail[0] = -1; // ch != P[0], read another ch for (k=1; k<m; k++) { // for P[2]:‘A’ s = fail[k-1];// s is fail[1]:0 while(s >= 0) { // loop once if (P[s] == P[k-1])// P[0]:’A’ != p[1]:’B’ break; // s = fail[s];// so s is fail[0]:-1 } fail[k] = s + 1;// fail[2] = -1+1 = 0 } } Read char A B A B D D * 0 1 2 3 4 5

KNP – Building Fail Links • Pattern: A B A B D D • Fail: -1 0 0 1 void kmpSetup(char P[], int m, int fail[]) { int k, s; fail[0] = -1; // ch != P[0], read another ch for (k=1; k<m; k++) { // for P[3]:‘B’ s = fail[k-1];// s is fail[2]:0 while(s >= 0) {// loop once if (P[s] == P[k-1])// P[0]:‘A’ == P[2]:‘A’ break;// so, break s = fail[s]; // } fail[k] = s + 1;// fail[3] = 0+1 = 1 } } Read char A B A B D D * 0 1 2 3 4 5

KNP – Building Fail Links • Pattern: A B A B D D • Fail: -1 0 0 1 2 void kmpSetup(char P[], int m, int fail[]) { int k, s; fail[0] = -1; // ch != P[0], read another ch for (k=1; k<m; k++) { // for P[4]:‘D’ s = fail[k-1];// s is fail[3]:1 while(s >= 0) { // loop once if (P[s] == P[k-1])// P[1]:‘B’ == P[3]:‘B’ break;// so, break s = fail[s]; // } fail[k] = s + 1;// fail[4] = 1+1 = 2 } } Read char A B A B D D * 0 1 2 3 4 5

KNP – Building Fail Links • Pattern: A B A B D D • Fail: -1 0 0 1 2 0 void kmpSetup(char P[], int m, int fail[]) { int k, s; fail[0] = -1; // ch != P[0], read another ch for (k=1; k<m; k++) { // for P[5]:‘D’ s = fail[k-1];// s is fail[4]:2 while(s >= 0) { // loop twice if (P[s] == P[k-1])// P[2]:‘A’ != P[4]:‘D’, P[0]:‘A’ != P[4]:‘D’ break; // s = fail[s];// s = fail[2]:0, s = fail[0]:-1 } fail[k] = s + 1;// fail[5] = -1+1 = 0 } } Read char A B A B D D * 0 1 2 3 4 5

KMP Fail Links:on mismatch, new k = fail[k] • Example pattern: ABABDD fail: -1 0 0 1 2 0 • ABABDD .ABABDD A != X so fail[0] = -1 • X????? X????? Skip X & k=0 • ABABDD .ABABDD B != X so fail[1] = 0 • AX???? AX?????? k=0 (shifts pattern 1) • ABABDD ..ABABDD 2nd A != X so fail[2] = 0 • ABX??? ABX??? k=0 (shifts pattern 2) • ABABDD ..ABABDD 2nd B != X so fail[3] = 1 • ABAX?? ABAX???? k=1 (shifts pattern 2)

KMP Fail Links:on mismatch, new k = fail[k] • Example pattern: ABABDD fail: -1 0 0 1 2 0 • ABABDD ..ABABDD D != X so fail[4] = 2 • ABABX? ABABX? k=2 (shifts pattern 2) • ABABDD .....ABABDD 2nd D != X so fail[5] = 0 • ABABDX ABABDX k=0 (shifts pattern 5)

KNP Scan Algorithm int kmpScan (char P[], char T[], int m, int fail[]) { int match = -1; // position of match in text int j = 0, k = 0; while (! atEndOfText(T,j)) { // there is more text if (k == m) { match = j - m; // matched entire pattern, so stop break; } if (k == -1) { // nothing in pattern matched last text char, so j++; // get next text character k = 0; // start pattern over } else if (T[j] == P[k]) { j++; k++; // move forward one character in pattern and text } else { k = fail[k]; // follow fail link to best restart in pattern } } return match; }

KNP - Efficiency • Building Fail Links – O(m) • Scanning text – O(n) • Overall – O(m+n) = O(n)

Boyer-Moore (BM) • Heuristic # 1 • Match pattern Right-to-Left • Create a charJump[ch] array with entry for each character in the alphabet (ASCII code) • If T[ j ] != P[ k ] then • If T[ j ] appears in P[0..k) then • the rightmost occurrence is aligned with T[ j ] • Else • the pattern P is aligned beginning at T[ j+1 ] • Jnew = charJump[ T[ j ] ] • matching resumes with T[ jnew ] and P[m-1] • This skips multiple text characters WITHOUT ever examining them

Boyer Moore Algorithm • Heuristic # 2 • MatchJump[k] = slide[k] + m – k • Slide[k] is amount of slide to align substrings • M-k is length of suffix (substring) being realigned • Similar to KMP fail links, but calculated right to left • If a suffix has matched in P & T and that same substring appears elsewhere in P, then upon a mismatch the pattern P is “slid” to align the rightmost such matching substring with the suffix in T • Matching resumes at the new end of the pattern determined by matchJump [ k ]

BM - Example • Pattern: BATSANDCATS • BATSANDCATS first Pattern alignment • BATSANDCATS charJump[T[j]] aligns N’s • BATSANDCATS matchJump[k] aligns ATS’s • TWOOLDGNATSCANBELIKEBATSANDCATS  The Text • New j (where matching resumes) is at end of pattern P, but which (S =?= A) or (S =?= I) • Use MAX(charJump(T[j]),matchJump[k])

Computing individual charJumps // find cJ[ch] for each character ch in pattern P void computeJumps (char P[], int m, int alpha, int charJump[]) { // assume jump distance is entire pattern length for all // characters that do not match a pattern letter. for (int ch=0; ch<alpha; ch++) charJump[ch] = m; // for each pattern letter find the minimum jump to align // rightmost occurrence in string, with same current char // in the text for (int k=0;k<m; k++) charJump[(int)P[k]] = m - (k + 1); }

Computing substring matchJumps void computeMatchJumps (char P[], int m, int matchJump[]) { int k, s, low, shift, *sufx = new int[m+1]; // note: sufx[0] tells what suffix matches a prefix of P for (k=0;k<m; k++) matchJump[k] = m + 1; // initially, an impossibly large slide // Compute sufx links (like KMP fail links, but right-to-left // Detect if substring equals matched suffix and is preceded // by mismatch at s; compute its slide. sufx[m] = m + 1; for (k=m-1; k>=0; k--) // k indexes sufx array, k-1 indexes P and matchJump { s = sufx[k+1]; while (s <= m) { if (P[k] == P[s-1]) // P indices 0..m-1, sufx indices 0,1..m break; if (s-(k+1) < matchJump[s-1]) // Mismatch between P[k] and P[s-1] matchJump[s-1] = s-(k+1); s = sufx[s]; } sufx[k] = s - 1; }

Computing substring matchJumps // if no suffix match at k+1, compute slide based on prefix that // matches suffix. Prefix length = (m - shift). low = 1; shift = sufx[0]; while (shift <= m) { for (k=low; k<=shift; k++) { if (shift < matchJump[k-1]) matchJump[k-1] = shift; } low = shift + 1; shift = sufx[shift]; } // Add number of matched characters to slide amount for (k=0; k<m; k++) matchJump[k] += (m-(k+1)); }

BM Scan Algorithm int boyerMooreScan (char P[], char T[], int m, int charJump[], int matchJump[]) { int match = -1, j = m-1, k = m-1; while (! endOfText(T,j)){ if (k < 0) { match = j + 1; break; // entire pattern matches, so stop } if (T[j] == P[k]) { j--; k--; // continue match right-to-left } else { jump = matchJump[k]; if (charJump[(int)t[j]] > matchJump[k]) jump = charJump[(int)t[i]]; j += jump; // jump forward & restart matching at right k = m-1; } } return match; }

BM - Example • Pattern: WOWWOW • mJump: 876731 cJump: ‘W’=0, ‘O’=1, others=6 • WOWTHISISWOWXOWWOWWOW the TEXT (21 chars) • 1 1111111111121 # of comparisons (15) • WOWWOW W != I, cJ[I]=6, mJ[5]=1 • WOWWOWW != S, cJ[S]=6, mJ[2]=6 • WOWWOW W != X, cJ[X]=6, mJ[3]=7 • WOWWOW W != O, cJ[O]=1, mJ[5]=1 • WOWWOW match • Note: cJump[‘W’]=0 means simply that if the TEXT character is ‘W’ the pattern realignment placing the rightmost pattern ‘W’ over the text ‘W’ is achieved by not moving the pattern • Note: the algorithm will NOT work using only cJump

BM Algorithm Efficiency • Building charJump[ ] – O(S) • Building matchJump[ ] – O(m) • Scanning text – O(n) • In practice, only every 3 or 4 characters are examined in text so BM is quite fast • Overall – O(n)

String Matching Program • Program to demonstrate all three approaches to string matching • demos\strScan.cpp

CSE 30331 Lecture 23 – String Matching

CSE 30331 Lecture 23 – String Matching

Presentation Transcript

Lecture 7: 555 Timer

String Theory (HEP2005)

Combinatorial Pattern Matching

Chapter 9: Structured Data Extraction

Matching Markets

What about Parents’ Matching their Children?

Stereo Matching

Materials for Lecture

Dictionary Matching and Indexing with Edits and Don’t Cares

Combinatorial Pattern Matching

String/Gauge Duality: (re) discovering the QCD String in AdS Space

; LightBar module PPE 03/3/94

string comments[5]

Class string and String Stream Processing

String algorithms and data structures (or, tips and tricks for index design)

CS 6293 Advanced Topics: Current Bioinformatics

Workbook 8 String Processing Tools

Materials for Lecture 08

Propensity Score Matching

Languages

The Next Best String