150 likes | 163 Views
Data Structures and Algorithms Analysis. String Matching Dr. Ken Cosh. Review. Memory Management Memory Allocation Garbage Collection. This Week. String Matching String matching is a common task for many computer users; Internet Searches String manipulation in word processing
E N D
Data Structures and Algorithms Analysis String Matching Dr. Ken Cosh
Review • Memory Management • Memory Allocation • Garbage Collection
This Week • String Matching • String matching is a common task for many computer users; • Internet Searches • String manipulation in word processing • Advanced DNA sequence matching • Therefore effective pattern matching algorithms are essential.
Brute Force • Our first simple string matching algorithm is brute force. • We check the first character, if it is a match, we check the second character, if not a match, we step forward one character and start again. • Any useful information that could be used in subsequent searches is then lost.
Brute Force bruteForceStringMatching(pattern P, text T) i=0; while i ≤ |T| - |P| j=0; while Ti == Pj && j < |P| i++; j++; if j == |P| return match at i-|P|; i = i – j + 1; return no match;
Brute Force • T = ababcdababababababad, P=babab ababcdababababababad 1 babab • babab • babab • babab • babab • babab • babab • babab In this case the match is found on the 8th try.
Brute Force Complexity • The best case for the algorithm is that the string is matched straight away (consider searching this sentence for “The”). Here |P| comparisons are required – O(|P|). • The worst case is if the string isn’t found, but for each character in |T|, we are required to make |P| comparisons – here worst case is O(|T||P|). • The average case depends on the size and frequencies of the character set.
Brute Force Complexity • Notice the nested while loops in the Brute Force algorithm; while i ≤ |T| - |P| while Ti == Pj && j < |P| • Shortly we’ll investigate how we can reduce the number of iterations of each loop. • For the worst case to occur we could search of a string such as aaaaaaaaaaaab within a string aaaaaaaaaaaaaaaaaaaaaaaaaaa etc.
Improving Brute Force • A key problem with brute force is that each time we abort the comparison we have to start from the beginning of the pattern again. • We could reduce the algorithm complexity by enabling us to skip unnecessary searches. • Hancart’s algorithm allows the search to step forward 2 characters if a match won’t be found.
Hancart • Hancart’s algorithm refines brute force in a couple of ways. • First the first two characters of the pattern are compared • Either they are the same, or they are different. • Second comparisons begin with the 2nd character in the Text.
Hancart • Hancart’s revision works by allowing us to skip forward 2 characters in situations where there can’t be a match. • Notice that the situations where 2 steps forwards are allowed depends on whether the first 2 characters of the pattern. • We can refine the search further by extending this observation – that the number of steps forward allowed depends on the contents of the pattern. • The Knuth Morris Pratt algorithm observes that the pattern contains enough information to determine where the next match could begin.
Hancart • Hancart’s algorithm reduces the number of iterations through the outer loop – by sometimes allowing the increment to be; i = i – j + 2;
Knuth Morris Pratt • The Knutt Morris Pratt algorithm begins by finding the longest suffix, which is equal to a prefix of the same substring. • Substring: A,B,C,D,A,B,D • Longest Suffix: 0,0,0,0,1,2,0 • i.e. when the 2nd A comes it is both a suffix and a prefix for the substring. The following B forms ‘AB’ a 2 character prefix and suffix. • Now for each iteration of the outer loop i can be increased by j-x, where x is the longest suffix. • i.e. if a mismatch is found when comparing the second A, j=5, so i can be increased by 4 (j-1)
Test Try searching for this substring, A,B,C,D,A,B,D within this string ABCDABCABCDABDE
Knuth Morris Pratt complexity • Knuth Morris Pratt removes some of the complexity of the brute force algorithm by preprocessing the substring being searched for (to create the suffix table). • Now as we don’t need to recheck characters in the text it is O(|T|) for the outer loop. • Preprocessing can be performed quickly, in O(|P|) time, leaving a total complexity of O(|T|+|P|)