280 likes | 759 Views
2. Outline. Problem descriptionBrute Force algorithmKnuth-Morris-Pratt algorithmBoyer-Moore algorithm. 3. Problem description. What is pattern matching?Pattern: quchjfngText: kadogqlxwwenkpjkmolpevnmqcsjakoheorplkzbzfjliar Iacwzjlvkxromavcdquchjfnganqxharnhkweweg tkrmwm fjduwhqvhwpfg pymyaj
E N D
1. 1 Pattern Matching
2. 2 Outline Problem description
Brute Force algorithm
Knuth-Morris-Pratt algorithm
Boyer-Moore algorithm
3. 3 Problem description What is pattern matching?
Pattern: quchjfng
Text:
kadogqlxwwenkpjkmolpevnmqcsjakoheorplkzb
zfjliar Iacwzjlvkxromavcdquchjfnganqxharnhkw
eweg tkrmwm fjduwhqvhwpfg pymyajsaywf aas
bwxwcmrncmslgeqztyoygdbnxjqoctslbrvngoyxe
4. 4 Problem description Given two strings P (pattern) and T (text), find the first occurrence of P in T.
Input: two strings P & T
Output: The starting index of the occurrence of P in T or Null .
Applications:
Text editors
Search engines
Biological research
5. 5 Strings [1/2] A string is a sequence of data elements, such as numbers or characters.
An alphabet S is a set of basic symbols for a category of strings
Examples of alphabet:
Character: {A, BZ, a, b z}
Number: {0, 1, 29}
DNA: {A, C, G, T}
6. 6 Strings [2/2] Let P be a string of size m
A substring P[i .. j] is a string containing characters of string P with ranks between i and j. P = P[0 .. m-1].
A prefix of P is a substring of the type P[0 .. i]
A suffix of P is a substring of the type P[i ..m - 1]
7. 7 Brute Force Algorithm [1/2]
8. 8 Brute Force Algorithm [2/2]
9. 9 Shift P by more than one position when mismatch occurs
Never miss a occurrence of P in T
Preprocessing approach
Preprocessing P
Knuth-Morris-Pratt Algorithm
Boyer-Moore Algorithm
Preprocessing T
Suffix trees
10. 10 The Knuth-Morris-Pratt algorithm Best known
For linear time for exact matching
Preprocessing pattern P
Compares from Left to Right
11. 11 Shift more than one position
Reduce comparison The KMP Ideas
12. 12
13. 13 KMP Failure Function Knuth-Morris-Pratts algorithm preprocesses the pattern to find matches of prefixes of the pattern with the pattern itself
The failure function F(j) is defined as the size of the largest prefix of P[0..j] that is also a suffix of P[1..j]
14. 14 The KMP Algorithm
15. 15 Example
16. 16 The KMP Algorithm The failure function can be represented by an array and can be computed in O(m) time
At each iteration of the while-loop, either
i increases by one, or
the shift amount i - j increases by at least one (observe that F(j - 1) < j)
Hence, there are no more than 2n iterations of the while-loop
Thus, KMPs algorithm runs in optimal time O(m + n)
17. 17 Computing the Failure Function The failure function can be represented by an array and can be computed in O(m) time
The construction is similar to the KMP algorithm itself
At each iteration of the while-loop, either
i increases by one, or
the shift amount i - j increases by at least one (observe that F(j - 1) < j)
Hence, there are no more than 2m iterations of the while-loop
18. 18 Boyer-Moores Algorithm The Boyer-Moores pattern matching algorithm is based on two principles
Right to Left Scan
Bad character shift
Reference :
http://www.cs.utexas.edu/users/moore/best-ideas/string-searching
19. 19 Right to left Scan Rule
20. 20 Bad Character Shift Rule [1/3] Shift more than one position at a time.
When a mismatch occurs at T[i] = c
If P contains c : shift P to align the last occurrence of c in P with T[i]
21. 21 Bad Character Shift Rule [2/3] When a mismatch occurs at T[i] = c
If P contains c : shift P to align the last occurrence of c in P with T[i]
Else if P doesnt contains c : shift P to align P[0] with T[i + 1]
22. 22 Bad Character Shift Rule [3/3] When a mismatch occurs at T[i] = c
If P contains c : shift P to align the last occurrence of c in P with T[i]
Else if P doesnt contains c : shift P to align P[0] with T[i + 1]
What if we have already crossed the last occurrence?
23. 23 Example of BM Algorithm
24. 24 Last-Occurrence Function Preprocess the pattern P and the alphabet S to build the last-occurrence function L mapping S to integers, where L(c) is defined as L(c) =
Example:
S = {a, b, c, d}
P = abacab
The last-occurrence function can be computed in time O(m + s), where m is the size of P and s is the size of S
25. 25 The Boyer-Moore Algorithm
26. 26 Time complexity Analysis Boyer-Moores algorithm is significantly faster than the Brute-Force Algorithm on English text
Pattern size = m, Text size = n ,Number of alphabet = s ?Time complexity : O(nm+s)
Example of worst case:T = aaaaaaaa..aP = caaaaa
27. 27 Extended Boyer-Moore algorithm Extended Bad character shift rule
then shift P to the right so that the closest c to the left of position i in P is below the mismatched c in T.
The original Boyer-Moore algorithm use the simpler bad character rule
28. 28 Extended Boyer-Moore algorithm Extended Bad character shift rule