1 / 36

Efficient String Matching Algorithms for Pattern Search

This chapter discusses various string matching algorithms, including Sequential Search, Rabin-Karp, and Knuth-Morris-Pratt, and their performance in finding occurrences of a pattern in a given text.

karendennis
Download Presentation

Efficient String Matching Algorithms for Pattern Search

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapter 3 String Matching

  2. String Matching • Given:Two strings T[1..n] and P[1..m] over alphabet . • Want to find all occurrences of P[1..m] “the pattern” in T[1..n] “the text”. • Example: = {a, b, c} Text T pattern P • - P occurs with shift s. • - P occurs beginning at position s+1. • s is a valid shift. • The idea of the string matching problem is that we want to find all occurrences of the pattern P in the given text T s=3

  3. Sequential Search

  4. Naïve String Matching Using Brute Force Technique

  5. Naïve String Matching method • n ≡ size of input string • m ≡ size of pattern to be matched • O( (n-m+1)m ) • Θ( n2 ) if m = floor( n/2 ) • We can do better

  6. Rabin Karp String Matching Consider a hashing scheme • Let characters in both arrays T and P be digits in radix-S notation. (S = (0,1,...,9) Assume each character is digit in radix-d notation (e.g. d=10) • Let p be the value of the characters in P • Choose a prime number q such that fits within a computer word to speed computations. • Compute (p mod q) • The value of p mod q is what we will be using to find all matches of the pattern P in T.

  7. Compute (T[s+1, .., s+m] mod q) for s = 0 .. n-m • Test against P only those sequences in T having the same (mod q) value

  8. Assume each character is digit in radix-d notation (e.g. d=10) • p = decimal value of pattern • ts = decimal value of substring T[s+1..s+m] for s = 0,1...,n-m • s = a valid shift We never explicitly compute a new value. We simply adjust the existing value as we move over one character.

  9. Performance of Robin Karp:- • Preprocessing (determining each pattern hash) • Θ( m ) • Worst case running time • Θ( (n-m+1)m ) • No better than naïve method • Expected case • If we assume the number of hits is constant compared to n, we expect O( n ) • Only pattern-match “hits” – not all shifts

  10. The Knuth-Morris-Pratt Algorithm Knuth, Morris and Pratt proposed a linear time algorithm for the string matching problem. A matching time of O(n) is achieved by avoiding comparisons with elements of ‘S’ that have previously been involved in comparison with some element of the pattern ‘p’ to be matched. i.e., backtracking on the string ‘S’ never occurs

  11. Components of KMP algorithm • The prefix function, Π The prefix function,Π for a pattern encapsulates knowledge about how the pattern matches against shifts of itself. This information can be used to avoid useless shifts of the pattern ‘p’. In other words, this enables avoiding backtracking on the string ‘S’. • The KMP Matcher With string ‘S’, pattern ‘p’ and prefix function ‘Π’ as inputs, finds the occurrence of ‘p’ in ‘S’ and returns the number of shifts of ‘p’ after which occurrence is found.

  12. The prefix function, Π Following pseudocode computes the prefix fucnction, Π: Compute-Prefix-Function (p) 1 m  length[p] //’p’ pattern to be matched 2 Π[1]  0 3 k  0 • for q  2 to m • do while k > 0 and p[k+1] != p[q] 6 do k  Π[k] • If p[k+1] = p[q] • then k  k +1 • Π[q]  k 10 returnΠ

  13. Initially: m = length[p] = 7 Π[1] = 0 k = 0 Step 1: q = 2, k=0 Π[2] = 0 Step 2: q = 3, k = 0, Π[3] = 1 Step 3: q = 4, k = 1 Π[4] = 2 Example: compute Π for the pattern ‘p’ below: p

  14. Step 4: q = 5, k =2 Π[5] = 3 Step 5: q = 6, k = 3 Π[6] = 1 Step 6: q = 7, k = 1 Π[7] = 1 After iterating 6 times, the prefix function computation is complete: 

  15. The KMP Matcher The KMP Matcher, with pattern ‘p’, string ‘S’ and prefix function ‘Π’ as input, finds a match of p in S. Following pseudocode computes the matching component of KMP algorithm: KMP-Matcher(S,p) 1 n  length[S] 2 m  length[p] 3 Π Compute-Prefix-Function(p) 4 q  0 //number of characters matched 5 for i  1 to n //scan S from left to right 6 do while q > 0 and p[q+1] != S[i] • do q  Π[q] //next character does not match • if p[q+1] = S[i] • then q  q + 1 //next character matches • if q = m //is all of p matched? • then print “Pattern occurs with shift” i – m • q  Π[ q] // look for the next match Note: KMP finds every occurrence of a ‘p’ in ‘S’. That is why KMP does not terminate in step 12, rather it searches remainder of ‘S’ for any more occurrences of ‘p’.

  16. Illustration: given a String ‘S’ and pattern ‘p’ as follows: S p Let us execute the KMP algorithm to find whether ‘p’ occurs in ‘S’. For ‘p’ the prefix function, Π was computed previously and is as follows:

  17. Initially: n = size of S = 15; m = size of p = 7 Step 1: i = 1, q = 0 comparing p[1] with S[1] S p P[1] does not match with S[1]. ‘p’ will be shifted one position to the right. Step 2: i = 2, q = 0 comparing p[1] with S[2] S p P[1] matches S[2]. Since there is a match, p is not shifted.

  18. Comparing p[2] with S[3] p[2] does not match with S[3] S Step 3: i = 3, q = 1 p Backtracking on p, comparing p[1] and S[3] Step 4: i = 4, q = 0 comparing p[1] with S[4] p[1] does not match with S[4] S p Step 5: i = 5, q = 0 p[1] matches with S[5] comparing p[1] with S[5] S p

  19. Step 6: i = 6, q = 1 Comparing p[2] with S[6] p[2] matches with S[6] S p Step 7: i = 7, q = 2 Comparing p[3] with S[7] p[3] matches with S[7] S p Step 8: i = 8, q = 3 Comparing p[4] with S[8] p[4] matches with S[8] S p

  20. Step 9: i = 9, q = 4 Comparing p[5] with S[9] p[5] matches with S[9] S p Step 10: i = 10, q = 5 p[6] doesn’t match with S[10] Comparing p[6] with S[10] S p Backtracking on p, comparing p[4] with S[10] because after mismatch q = Π[5] = 3 Step 11: i = 11, q = 4 Comparing p[5] with S[11] p[5] matches with S[11] S p

  21. Step 12: i = 12, q = 5 Comparing p[6] with S[12] p[6] matches with S[12] S p Step 13: i = 13, q = 6 Comparing p[7] with S[13] p[7] matches with S[13] S p Pattern ‘p’ has been found to completely occur in string ‘S’. The total number of shifts that took place for the match to be found are: i – m = 13 – 7 = 6 shifts.

  22. Compute-Prefix-Function (Π) 1 m  length[p] //’p’ pattern to be matched 2 Π[1]  0 3 k  0 for q  2 to m do while k > 0 and p[k+1] != p[q] 6 do k  Π[k] If p[k+1] = p[q] then k  k +1 Π[q]  k returnΠ In the above pseudocode for computing the prefix function, the for loop from step 4 to step 10 runs ‘m’ times. Step 1 to step 3 take constant time. Hence the running time of compute prefix function is Θ(m). KMP Matcher 1 n  length[S] 2 m  length[p] 3 Π Compute-Prefix-Function(p) 4 q  0 5 for i  1 to n 6 do while q > 0 and p[q+1] != S[i] do q  Π[q] if p[q+1] = S[i] then q  q + 1 if q = m then print “Pattern occurs with shift” i – m q  Π[ q] The for loop beginning in step 5 runs ‘n’ times, i.e., as long as the length of the string ‘S’. Since step 1 to step 4 take constant time, the running time is dominated by this for loop. Thus running time of matching function is Θ(n). Running - time analysis

  23. Closest-Pair Problem Find the two closest points in a set of n points (in the two-dimensional Cartesian plane). Brute-force algorithm Compute the distance between every pair of distinct points and return the indexes of the points for which the distance is the smallest.

  24. Closest-Pair Brute-Force Algorithm (cont.) Efficiency: How to make it faster? Θ(n^2) multiplications (or sqrt) Using divide-and-conquer!

  25. Brute-Force Strengths and Weaknesses • Strengths • wide applicability • simplicity • yields reasonable algorithms for some important problems(e.g., matrix multiplication, sorting, searching, string matching) • Weaknesses • rarely yields efficient algorithms • some brute-force algorithms are unacceptably slow • not as constructive as some other design techniques

  26. Convex Hull

  27. Exhaustive Search A brute force solution to a problem involving search for an element with a special property, usually among combinatorial objects such as permutations, combinations, or subsets of a set. Method: • generate a list of all potential solutions to the problem in a systematic manner (see algorithms in Sec. 5.4) • evaluate potential solutions one by one, disqualifying infeasible ones and, for an optimization problem, keeping track of the best one found so far • when search ends, announce the solution(s) found

  28. 2 a b 5 3 4 8 c d 7 Example 1: Traveling Salesman Problem • Given n cities with known distances between each pair, find the shortest tour that passes through all the cities exactly once before returning to the starting city • Alternatively: Find shortest Hamiltonian circuit in a weighted connected graph • Example: How do we represent a solution (Hamiltonian circuit)?

  29. TSP by Exhaustive Search Tour Cost a→b→c→d→a 2+3+7+5 = 17 a→b→d→c→a 2+4+7+8 = 21 a→c→b→d→a 8+3+4+5 = 20 a→c→d→b→a 8+7+4+2 = 21 a→d→b→c→a 5+4+3+8 = 20 a→d→c→b→a 5+7+3+2 = 17 Efficiency: Θ((n-1)!)

  30. Example 2: Knapsack Problem Given n items: • weights: w1 w2 … wn • values: v1 v2 … vn • a knapsack of capacity W Find most valuable subset of the items that fit into the knapsack Example: Knapsack capacity W=16 item weight value • 2 $20 • 5 $30 • 10 $50 • 5 $10

  31. Knapsack Problem by Exhaustive Search SubsetTotal weightTotal value {1} 2 $20 {2} 5 $30 {3} 10 $50 {4} 5 $10 {1,2} 7 $50 {1,3} 12 $70 {1,4} 7 $30 {2,3} 15 $80 {2,4} 10 $40 {3,4} 15 $60 {1,2,3} 17 not feasible {1,2,4} 12 $60 {1,3,4} 17 not feasible {2,3,4} 20 not feasible {1,2,3,4} 22 not feasible Efficiency: Θ(2^n) Each subset can be represented by a binary string (bit vector, Ch 5).

  32. Example 3: The Assignment Problem There are n people who need to be assigned to n jobs, one person per job. The cost of assigning person i to job j is C[i,j]. Find an assignment that minimizes the total cost. Job 0 Job 1 Job 2 Job 3 Person 0 9 2 7 8 Person 1 6 4 3 7 Person 2 5 8 1 8 Person 3 7 6 9 4 Algorithmic Plan: Generate all legitimate assignments, compute their costs, and select the cheapest one. How many assignments are there? Pose the problem as one about a cost matrix: n! cycle cover in a graph

  33. Assignment Problem by Exhaustive Search 9 2 7 8 6 4 3 7 5 8 1 8 7 6 9 4 Assignment (col.#s) Total Cost 1, 2, 3, 4 9+4+1+4=18 1, 2, 4, 3 9+4+8+9=30 1, 3, 2, 4 9+3+8+4=24 1, 3, 4, 2 9+3+8+6=26 1, 4, 2, 3 9+7+8+9=33 1, 4, 3, 2 9+7+1+6=23 etc. (For this particular instance, the optimal assignment can be found by exploiting the specific features of the number given. It is: ) C = 2,1,3,4

  34. Final Comments on Exhaustive Search • Exhaustive-search algorithms run in a realistic amount of time only on very small instances • In some cases, there are much better alternatives! • Euler circuits • shortest paths • minimum spanning tree • assignment problem • In many cases, exhaustive search or its variation is the only known way to get exact solution The Hungarian method runs in O(n^3) time.

More Related