410 likes | 450 Views
Learn how to use the Deterministic Finite Automaton (DFA) algorithm for efficient pattern matching in text strings. Understand the concept of states and transitions in building a DFA and simulate its operation on different texts. Discover how DFA minimizes backtracking and accelerates pattern recognition strategies. Explore step-by-step examples to grasp the practical implementation of the DFA algorithm for pattern searching.
E N D
KMP algorithm • public KMP(String pat) { • this.R = 256; • this.pat = pat; • // build DFA from pattern • int m = pat.length(); • dfa = new int[R][m]; • dfa[pat.charAt(0)][0] = 1; • for (int x = 0, j = 1; j < m; j++) { • for (int c = 0; c < R; c++) • dfa[c][j] = dfa[c][x]; // Copymismatch cases. • dfa[pat.charAt(j)][j] = j+1; // Set match case. • x = dfa[pat.charAt(j)][x]; // Update restart state. • } • } • public int search(String txt) { • // simulate operation of DFA on text • int m = pat.length(); • int n = txt.length(); • int i, j; • for (i = 0, j = 0; i < n && j < m; i++) { • j = dfa[txt.charAt(i)][j]; • } • if (j == m) return i - m; // found • return n; // not found • }
General idea • Avoid backing up in the text string on a mismatch • For example • Text: 00000000000000000000000000000000000001 • Pattern: 000000001 • When we find a mismatch, how could we move forward in the text? • Cleverer way than Brute force ? • How to analyze the pattern?
How ? Build a DFA • DFA – Deterministic finite-state automata • DFA = States + Transitions • States • For a pattern with m characters, there are (m + 1) states in the DFA • `At state j` means the first (j – 1) characters in the pattern are matched • The last state indicates ACCEPT (AC), i.e all characters in the pattern are matched • But we do not allocate entry for this state
How ? Build a DFA • DFA – Deterministic finite-state automata • DFA = States + Transitions • Transitions • At each state, there are R possible transitions, in which R is the number of all possible characters • Formalize transitions as dfa[next_char][current_state] = next_state
How ? Build a DFA • Explanation: dfa[next_char][current_state] = next_state • Suppose we are now at current_state • If we see that the next character is next_char, then we should transit to next_state • Therefore, dfa[R][m] is a 2-dimensional table exhaustively enumerates all possible cases • m – we do not allocate entry for the accept state
How ? Build a DFA • Explanation: dfa[next_char][current_state] = next_state • Pattern: ABABAC (assume R=3 and the only characters are A,B,C) • 2D array representation • Directed graph representation
How to use DFA ? • Example • Text: ABCABABABACA • Pattern: ABABAC
public int search(String txt) { • // simulate operation of DFA on text • int m = pat.length(); • int n = txt.length(); • int i, j; • for (i = 0, j = 0; i < n && j < m; i++) { • j = dfa[txt.charAt(i)][j]; • } • if (j == m) return i - m; // found • return n; // not found • } How to use DFA ? • State with state 0 • ABCABABABACA • Goto state 1
public int search(String txt) { • // simulate operation of DFA on text • int m = pat.length(); • int n = txt.length(); • int i, j; • for (i = 0, j = 0; i < n && j < m; i++) { • j = dfa[txt.charAt(i)][j]; • } • if (j == m) return i - m; // found • return n; // not found • } How to use DFA ? • Current state 1 • ABCABABABACA • Goto state 2
public int search(String txt) { • // simulate operation of DFA on text • int m = pat.length(); • int n = txt.length(); • int i, j; • for (i = 0, j = 0; i < n && j < m; i++) { • j = dfa[txt.charAt(i)][j]; • } • if (j == m) return i - m; // found • return n; // not found • } How to use DFA ? • Current state 2 • ABCABABABACA • Goto state 0
public int search(String txt) { • // simulate operation of DFA on text • int m = pat.length(); • int n = txt.length(); • int i, j; • for (i = 0, j = 0; i < n && j < m; i++) { • j = dfa[txt.charAt(i)][j]; • } • if (j == m) return i - m; // found • return n; // not found • } How to use DFA ? • Current state 0 • ABCABABABACA • Goto state 1
public int search(String txt) { • // simulate operation of DFA on text • int m = pat.length(); • int n = txt.length(); • int i, j; • for (i = 0, j = 0; i < n && j < m; i++) { • j = dfa[txt.charAt(i)][j]; • } • if (j == m) return i - m; // found • return n; // not found • } How to use DFA ? • Current state 1 • ABCABABABACA • Goto state 2
public int search(String txt) { • // simulate operation of DFA on text • int m = pat.length(); • int n = txt.length(); • int i, j; • for (i = 0, j = 0; i < n && j < m; i++) { • j = dfa[txt.charAt(i)][j]; • } • if (j == m) return i - m; // found • return n; // not found • } How to use DFA ? • Current state 2 • ABCABABABACA • Goto state 3
public int search(String txt) { • // simulate operation of DFA on text • int m = pat.length(); • int n = txt.length(); • int i, j; • for (i = 0, j = 0; i < n && j < m; i++) { • j = dfa[txt.charAt(i)][j]; • } • if (j == m) return i - m; // found • return n; // not found • } How to use DFA ? • Current state 3 • ABCABABABACA • Goto state 4
public int search(String txt) { • // simulate operation of DFA on text • int m = pat.length(); • int n = txt.length(); • int i, j; • for (i = 0, j = 0; i < n && j < m; i++) { • j = dfa[txt.charAt(i)][j]; • } • if (j == m) return i - m; // found • return n; // not found • } How to use DFA ? • Current state 4 • ABCABABABACA • Goto state 5
public int search(String txt) { • // simulate operation of DFA on text • int m = pat.length(); • int n = txt.length(); • int i, j; • for (i = 0, j = 0; i < n && j < m; i++) { • j = dfa[txt.charAt(i)][j]; • } • if (j == m) return i - m; // found • return n; // not found • } How to use DFA ? • Current state 5 • ABCABABABACA • Goto state 4
public int search(String txt) { • // simulate operation of DFA on text • int m = pat.length(); • int n = txt.length(); • int i, j; • for (i = 0, j = 0; i < n && j < m; i++) { • j = dfa[txt.charAt(i)][j]; • } • if (j == m) return i - m; // found • return n; // not found • } How to use DFA ? • Current state 4 • ABCABABABACA • Goto state 5
public int search(String txt) { • // simulate operation of DFA on text • int m = pat.length(); • int n = txt.length(); • int i, j; • for (i = 0, j = 0; i < n && j < m; i++) { • j = dfa[txt.charAt(i)][j]; • } • if (j == m) return i - m; // found • return n; // not found • } How to use DFA ? • Current state 5 • ABCABABABACA • Goto state 6
public int search(String txt) { • // simulate operation of DFA on text • int m = pat.length(); • int n = txt.length(); • int i, j; • for (i = 0, j = 0; i < n && j < m; i++) { • j = dfa[txt.charAt(i)][j]; • } • if (j == m) return i - m; // found • return n; // not found • } How to use DFA ? • Current state 6 (ACCEPT) • ABCABABABACA • j == m, we are now at the (6+1)th state
How to build DFA ? • If we could match the next character, • If we see expected character, go to the next state • Pattern: ABABAC (assume R=3 and the only characters are A,B,C) We only need dfa[R][m] since there is no transition information for the last state A B A B A C 4 5 6 0 1 2 3
How to build DFA ? • If we could match the next character • public KMP(String pat) { • this.R = 256; • this.pat = pat; • // build DFA from pattern • int m = pat.length(); • dfa = new int[R][m]; • dfa[pat.charAt(0)][0] = 1; • for (int x = 0, j = 1; j < m; j++) { • for (int c = 0; c < R; c++) • dfa[c][j] = dfa[c][x]; // Copymismatch cases. • dfa[pat.charAt(j)][j] = j+1; // Set match case. • x = dfa[pat.charAt(j)][x]; // Update restart state. • } • } A B A B A C 4 5 6 0 1 2 3
How to build DFA ? • If we failed to match the next character • Copy data from column x • Mimic the transitions of state x • Similar to `I am now in state x` or `restart from state x` • x is a restart state • Update restart state x • x state • Restart state, if we failed matching the j-th character, we restart from state x • How to restart? Since we copied the entries from x for failed cases, it is equivalent to restart from x. • The x state is one state behind our DFA building process at the very beginning. • The x state is updated based on the partially built DFA! It tries to find information in the pattern. • public KMP(String pat) { • this.R = 256; • this.pat = pat; • // build DFA from pattern • int m = pat.length(); • dfa = new int[R][m]; • dfa[pat.charAt(0)][0] = 1; • for (int x = 0, j = 1; j < m; j++) { • for (int c = 0; c < R; c++) • dfa[c][j] = dfa[c][x]; // Copymismatch cases. • dfa[pat.charAt(j)][j] = j+1; // Set match case. • x = dfa[pat.charAt(j)][x]; // Update restart state. • } • }
How to build DFA ? j=0 • public KMP(String pat) { • this.R = 256; • this.pat = pat; • // build DFA from pattern • int m = pat.length(); • dfa = new int[R][m]; • dfa[pat.charAt(0)][0] = 1; • for (int x = 0, j = 1; j < m; j++) { • for (int c = 0; c < R; c++) • dfa[c][j] = dfa[c][x]; // Copymismatch cases. • dfa[pat.charAt(j)][j] = j+1; // Set match case. • x = dfa[pat.charAt(j)][x]; // Update restart state. • } • } • An example (ABABAC) • j (current state): 0 c j
How to build DFA ? j=1 • public KMP(String pat) { • this.R = 256; • this.pat = pat; • // build DFA from pattern • int m = pat.length(); • dfa = new int[R][m]; • dfa[pat.charAt(0)][0] = 1; • for (int x = 0, j = 1; j < m; j++) { • for (int c = 0; c < R; c++) • dfa[c][j] = dfa[c][x]; // Copymismatch cases. • dfa[pat.charAt(j)][j] = j+1; // Set match case. • x = dfa[pat.charAt(j)][x]; // Update restart state. • } • } • An example (ABABAC) • j (current state): 1 • x (restart state): 0 • Process • Copy dfa[][0] to dfa[][1] • dfa[`B`][1]2 • x dfa[`B`][0] = 0 c j
How to build DFA ? j=1 • public KMP(String pat) { • this.R = 256; • this.pat = pat; • // build DFA from pattern • int m = pat.length(); • dfa = new int[R][m]; • dfa[pat.charAt(0)][0] = 1; • for (int x = 0, j = 1; j < m; j++) { • for (int c = 0; c < R; c++) • dfa[c][j] = dfa[c][x]; // Copymismatch cases. • dfa[pat.charAt(j)][j] = j+1; // Set match case. • x = dfa[pat.charAt(j)][x]; // Update restart state. • } • } • An example (ABABAC) • Understand restart state x • You are actually at state 1, but if you see next character is A or C, just suppose you are currently at state 0. Recall the meaning of states in DFA, state 0 means you have matched nothing. c j
How to build DFA ? j=2 • public KMP(String pat) { • this.R = 256; • this.pat = pat; • // build DFA from pattern • int m = pat.length(); • dfa = new int[R][m]; • dfa[pat.charAt(0)][0] = 1; • for (int x = 0, j = 1; j < m; j++) { • for (int c = 0; c < R; c++) • dfa[c][j] = dfa[c][x]; // Copymismatch cases. • dfa[pat.charAt(j)][j] = j+1; // Set match case. • x = dfa[pat.charAt(j)][x]; // Update restart state. • } • } • An example (ABABAC) • j (current state): 2 • x (restart state): 0 • Process • Copy dfa[][0] to dfa[][2] • dfa[`A`][2]3 • x dfa[`A`][0] = 1 c j
How to build DFA ? j=2 • public KMP(String pat) { • this.R = 256; • this.pat = pat; • // build DFA from pattern • int m = pat.length(); • dfa = new int[R][m]; • dfa[pat.charAt(0)][0] = 1; • for (int x = 0, j = 1; j < m; j++) { • for (int c = 0; c < R; c++) • dfa[c][j] = dfa[c][x]; // Copymismatch cases. • dfa[pat.charAt(j)][j] = j+1; // Set match case. • x = dfa[pat.charAt(j)][x]; // Update restart state. • } • } • An example (ABABAC) • j (current state): 2 • x (restart state): 0 • Understand restart state x • At state 2, you have matched `AB`, but if you see next character is `B` or `C`, you have to start from very beginning (state 0). c j
How to build DFA ? j=2 • public KMP(String pat) { • this.R = 256; • this.pat = pat; • // build DFA from pattern • int m = pat.length(); • dfa = new int[R][m]; • dfa[pat.charAt(0)][0] = 1; • for (int x = 0, j = 1; j < m; j++) { • for (int c = 0; c < R; c++) • dfa[c][j] = dfa[c][x]; // Copymismatch cases. • dfa[pat.charAt(j)][j] = j+1; // Set match case. • x = dfa[pat.charAt(j)][x]; // Update restart state. • } • } • An example (ABABAC) • j (current state): 2 • x (restart state): 0 • Understand restart state x • x dfa[`A`][0] = 1, why ? At current state 2, the expect char is `A`, which means if we failed to match at next state 3, we do not need start from the very beginning, since at least we have `A` matched (x=1). c j
How to build DFA ? j=3 • public KMP(String pat) { • this.R = 256; • this.pat = pat; • // build DFA from pattern • int m = pat.length(); • dfa = new int[R][m]; • dfa[pat.charAt(0)][0] = 1; • for (int x = 0, j = 1; j < m; j++) { • for (int c = 0; c < R; c++) • dfa[c][j] = dfa[c][x]; // Copymismatch cases. • dfa[pat.charAt(j)][j] = j+1; // Set match case. • x = dfa[pat.charAt(j)][x]; // Update restart state. • } • } • An example (ABABAC) • j (current state): 3 • x (restart state): 1 • Process • Copy dfa[][1] to dfa[][3] • dfa[`B`][3]4 • x dfa[`B`][1] = 2 c j
How to build DFA ? j=3 • public KMP(String pat) { • this.R = 256; • this.pat = pat; • // build DFA from pattern • int m = pat.length(); • dfa = new int[R][m]; • dfa[pat.charAt(0)][0] = 1; • for (int x = 0, j = 1; j < m; j++) { • for (int c = 0; c < R; c++) • dfa[c][j] = dfa[c][x]; // Copymismatch cases. • dfa[pat.charAt(j)][j] = j+1; // Set match case. • x = dfa[pat.charAt(j)][x]; // Update restart state. • } • } • An example (ABABAC) • j (current state): 3 • x (restart state): 1 • Understand restart state x • We restart from state 1 if we failed to match the expected `B`. The reason is that we know we have at least a `A` already matched (x=1). Restart state x was set in the previous step. c j
How to build DFA ? j=3 • public KMP(String pat) { • this.R = 256; • this.pat = pat; • // build DFA from pattern • int m = pat.length(); • dfa = new int[R][m]; • dfa[pat.charAt(0)][0] = 1; • for (int x = 0, j = 1; j < m; j++) { • for (int c = 0; c < R; c++) • dfa[c][j] = dfa[c][x]; // Copymismatch cases. • dfa[pat.charAt(j)][j] = j+1; // Set match case. • x = dfa[pat.charAt(j)][x]; // Update restart state. • } • } • An example (ABABAC) • j (current state): 3 • x (restart state): 1 • Understand restart state x • x dfa[`B`][1] = 2, why ? At current state 3, the expect char is `B` and restart state 1 tells us `A` is already matched in the pattern. Thus if we failed at next state 4, `AB` are already matched, i.e. we could update restart state x to 2. c j
How to build DFA ? j=4 • public KMP(String pat) { • this.R = 256; • this.pat = pat; • // build DFA from pattern • int m = pat.length(); • dfa = new int[R][m]; • dfa[pat.charAt(0)][0] = 1; • for (int x = 0, j = 1; j < m; j++) { • for (int c = 0; c < R; c++) • dfa[c][j] = dfa[c][x]; // Copymismatch cases. • dfa[pat.charAt(j)][j] = j+1; // Set match case. • x = dfa[pat.charAt(j)][x]; // Update restart state. • } • } • An example (ABABAC) • j (current state): 4 • x (restart state): 2 • Process • Copy dfa[][2] to dfa[][4] • dfa[`A`][4] • x dfa[`A`][2] = 3 c j
How to build DFA ? j=4 • public KMP(String pat) { • this.R = 256; • this.pat = pat; • // build DFA from pattern • int m = pat.length(); • dfa = new int[R][m]; • dfa[pat.charAt(0)][0] = 1; • for (int x = 0, j = 1; j < m; j++) { • for (int c = 0; c < R; c++) • dfa[c][j] = dfa[c][x]; // Copymismatch cases. • dfa[pat.charAt(j)][j] = j+1; // Set match case. • x = dfa[pat.charAt(j)][x]; // Update restart state. • } • } • An example (ABABAC) • j (current state): 4 • x (restart state): 2 • Understand restart state x • Explanation: at state 4, you already matched `ABAB`, if you failed to match next `A`, you assume you still matched `AB` since restart state is 2. This assumption is achieved by copying the column of 2 for failed cases (`B` and `C`). c j
How to build DFA ? j=5 • public KMP(String pat) { • this.R = 256; • this.pat = pat; • // build DFA from pattern • int m = pat.length(); • dfa = new int[R][m]; • dfa[pat.charAt(0)][0] = 1; • for (int x = 0, j = 1; j < m; j++) { • for (int c = 0; c < R; c++) • dfa[c][j] = dfa[c][x]; // Copymismatch cases. • dfa[pat.charAt(j)][j] = j+1; // Set match case. • x = dfa[pat.charAt(j)][x]; // Update restart state. • } • } • An example (ABABAC) • j (current state): 5 • x (restart state): 3 • Process • Copy dfa[][3] to dfa[][5] • dfa[`C`][5]6 • x dfa[`C`][3] = 0 c j
How to build DFA ? j=5 • public KMP(String pat) { • this.R = 256; • this.pat = pat; • // build DFA from pattern • int m = pat.length(); • dfa = new int[R][m]; • dfa[pat.charAt(0)][0] = 1; • for (int x = 0, j = 1; j < m; j++) { • for (int c = 0; c < R; c++) • dfa[c][j] = dfa[c][x]; // Copymismatch cases. • dfa[pat.charAt(j)][j] = j+1; // Set match case. • x = dfa[pat.charAt(j)][x]; // Update restart state. • } • } • An example (ABABAC) • j (current state): 5 • x (restart state): 3 • Understand restart state x • Explanation: we have already matched `ABABA`, if we failed to match the expected `C`, we assume we have matched `ABA` since restart state is 3 c j
Understand state x • public KMP(String pat) { • this.R = 256; • this.pat = pat; • // build DFA from pattern • int m = pat.length(); • dfa = new int[R][m]; • dfa[pat.charAt(0)][0] = 1; • for (int x = 0, j = 1; j < m; j++) { • for (int c = 0; c < R; c++) • dfa[c][j] = dfa[c][x]; // Copymismatch cases. • dfa[pat.charAt(j)][j] = j+1; // Set match case. • x = dfa[pat.charAt(j)][x]; // Update restart state. • } • } • public int search(String txt) { • // simulate operation of DFA on text • int m = pat.length(); • int n = txt.length(); • int i, j; • for (i = 0, j = 0; i < n && j < m; i++) { • j = dfa[txt.charAt(i)][j]; • } • if (j == m) return i - m; // found • return n; // not found • } Update x when we build the DFA is similar to the state transition when we match pattern in the text
Understand state x • The transition of state x: match the pattern itself using partially constructed DFA table • Build the next state of the DFA: we need to know the info of restart state x • An example (ABABA) • x 0 • x dfa[`B`][0] = 0 • x dfa[`A`][0] = 1 • x dfa[`B`][1] = 2 • x dfa[`A`][2] = 3 • x dfa[`C`][3] = 0
Conclusion • Update x when we build the DFA is similar to the state transition when we match pattern in the text • Understand that the process of building DFA is the same as matching the pattern to itself. • By analyzing the pattern, we know how to move forward when we see failed matching characters.