Efficient String Matching: Naïve vs. Rabin-Karp Algorithms

Outline String Matching • Introduction • Naïve Algorithm • Rabin-Karp Algorithm • Knuth-Morris-Pratt (KMP) Algorithm

Introduction • What is string matching? • Finding all occurrences of a pattern in a given text (or body of text) • Many applications • While using editor/word processor/browser • Login name & password checking • Virus detection • Header analysis in data communications • DNA sequence analysis, Web search engines (e.g. Google), image analysis

String-Matching Problem • The text is in an array T [1..n] of length n • The pattern is in an array P [1..m] of length m • Elements of T and P are characters from a finite alphabet • E.g.,  = {0,1} or  = {a, b, …, z} • Usually T and P are called strings of characters

String-Matching Problem …contd • We say that pattern Poccurs with shift s in text T if: • 0 ≤ s ≤ n-m and • T [(s+1)..(s+m)] = P [1..m] • If P occurs with shift s in T, then s is a valid shift, otherwise s is an invalid shift • String-matching problem: finding all valid shifts for a given T and P

Example 1 1 2 3 4 5 6 7 8 9 10 11 12 13 text T s = 3 pattern P 1 2 3 4 shift s = 3is a valid shift (n=13, m=4 and 0 ≤ s ≤ n-m holds)

a a b b a a a a Example 2 1 2 3 4 pattern P 1 2 3 4 5 6 7 8 9 10 11 12 13 text T s = 3 s = 9

Naïve String-Matching Algorithm Input: Text strings T [1..n] and P[1..m] Result: All valid shifts displayed NAÏVE-STRING-MATCHER (T, P) n← length[T] m ← length[P] fors ← 0 ton-m ifP[1..m] = T [(s+1)..(s+m)] print “pattern occurs with shift” s

Naïve Algorithm • The Naïve algorithm consists in checking, at all the positions in the text between 0 to n-m, whether an occurrence of the pattern starts there or not. • After each attempt, it shifts the pattern by exactly one position to the right. Example (from left to right): a b c a b c a a b c a (shift = 0) a b c a (shift = 1) a b c a (shift = 2) a b c a (shift = 3)

a a a a a a a a a b b b Analysis: Worst-case Example 1 2 3 4 pattern P 1 2 3 4 5 6 7 8 9 10 11 12 13 text T

Worst-case Analysis • There are m comparisons for each shift in the worst case • There are n-m+1 shifts • So, the worst-case running time is Θ((n-m+1)m) • In the example on previous slide, we have (13-4+1)4 comparisons in total • Naïve method is inefficient because information from a shift is not used again

Naïve Algorithm Example (from right to left): a b c a b c a a b c a (shift =3) a b c a (shift = 2) a b c a (shift = 1) a b c a (shift = 0) Pattern occur with shift 0 and 3

Rabin-Karp Algorithm • Has a worst-case running time of O((n-m+1)m) but average-case is O(n+m) • Also works well in practice • Based on number-theoretic notion of modularequivalence • We assume that  = {0,1, 2, …, 9}, i.e., each character is a decimal digit • In general, use radix-d where d = ||

Rabin-Karp Approach • We can view a string of k characters (digits) as a length-k decimal number • E.g., the string “31425” corresponds to the decimal number 31,425 • Given a pattern P [1..m], let p denote the corresponding decimal value • Given a text T [1..n], let ts denote the decimal value of the length-m substring T [(s+1)..(s+m)] for s=0,1,…,(n-m)

The Rabin-Karp algorithm

Rabin-Karp Approach …contd • ts = p iff T [(s+1)..(s+m)] = P [1..m] • s is a valid shift iff ts = p • p can be computed in O(m) time • p = P[m] + 10 (P[m-1] + 10 (P[m-2]+…)) • t0 can similarly be computed in O(m) time • Other t1, t2,…, tn-m can be computed in O(n-m) time since ts+1 can be computed from ts in constant time

Rabin-Karp Approach …contd • ts+1 = 10(ts - 10m-1·T [s+1]) + T [s+m+1] • E.g., if T={…,3,1,4,1,5,2,…}, m=5 and ts= 31,415, then ts+1 = 10(31415 – 10000·3) + 2 • =14152 • Thus we can compute p in  (m) and can compute t0, t1, t2,…, tn-m in  (n-m+1) time • And we can find al occurrences of the pattern P[1…m] in text T[1…n] with  (m) preprocessing time and  (n-m+1) matching time. • But…a problem: this is assuming p and ts are small numbers • They may be too large to work with easily

Rabin-Karp Approach …contd • Solution: we can use modular arithmetic with a suitable modulus, q • E.g., • ts+1 (10(ts – T[s+1]h) + T [s+m+1]) (mod q) • Where h =10 m-1 (mod q) • q is chosen as a small prime number ; e.g., 13 for radix 10 • Generally, if the radix is d, then dq should fit within one computer word

How values modulo 13 are computed 3 1 4 1 5 2 old high-order digit new low-order digit 7 8 14152 ((31415 – 3·10000) ·10 + 2 )(mod 13)  ((7 – 3 · 3) · 10 + 2 )(mod 13)  8 (mod 13)

Problem of Spurious Hits • tsp (mod q) does not imply that ts=p • Modular equivalence does not necessarily mean that two integers are equal • A case in which tsp (mod q) when ts ≠ p is called a spurious hit • On the other hand, if two integers are not modular equivalent, then they cannot be equal

Example pattern 3 1 4 1 5 mod 13 text 7 1 2 3 4 5 6 7 8 9 10 11 12 13 14 2 3 1 4 1 5 2 6 7 3 9 9 2 1 mod 13 1 7 8 4 5 10 11 7 9 11 valid match spurious hit

Rabin-Karp Algorithm • Basic structure like the naïve algorithm, but uses modular arithmetic as described • For each hit, i.e., for each s where tsp (mod q), verify character by character whether s is a valid shift or a spurious hit • In the worst case, every shift is verified • Running time can be shown as O((n-m+1)m) • Average-case running time is O(n+m)

3. The KMP Algorithm • The Knuth-Morris-Pratt (KMP) algorithm looks for the pattern in the text in a left-to-right order (like the brute force algorithm). • But it shifts the pattern more intelligently than the brute force algorithm. continued

If a mismatch occurs between the text and pattern P at P[j], what is the most we can shift the pattern to avoid wasteful comparisons? • Answer: the largest prefix of P[0 .. j-1] that is a suffix of P[1 .. j-1]

Example T: P: j = 5 jnew = 2

Why j == 5 • Find largest prefix (start) of: "a b a a b" ( P[0..j-1] )which is suffix (end) of: "b a a b" ( p[1 .. j-1] ) • Answer: "a b" • Set j = 2 // the new j value

KMP Failure Function • KMP preprocesses the pattern to find matches of prefixes of the pattern with the pattern itself. • j = mismatch position in P[] • k = position before the mismatch (k = j-1). • The failure function F(k) is defined as the size of the largest prefix of P[0..k] that is also a suffix of P[1..k].

j 0 1 2 3 4 F(j) 0 0 1 1 2 Failure Function Example (k == j-1) • P: "abaaba" j: 012345 • In code, F() is represented by an array, like the table. F(k) is the size of the largest prefix.

Why is F(4) == 2? P: "abaaba" • F(4) means • find the size of the largest prefix of P[0..4] that is also a suffix of P[1..4] = find the size largest prefix of "abaab" that is also a suffix of "baab" = find the size of "ab" = 2

Using the Failure Function • Knuth-Morris-Pratt’s algorithm modifies the brute-force algorithm. • if a mismatch occurs at P[j] (i.e. P[j] != T[i]), then k = j-1; j = F(k); // obtain the new j

Example T: P: k 0 1 2 3 4 F(k) 0 0 1 0 1

Why is F(4) == 1? P: "abacab" • F(4) means • find the size of the largest prefix of P[0..4] that is also a suffix of P[1..4] = find the size largest prefix of "abaca" that is also a suffix of "baca" = find the size of "a" = 1

KMP Advantages • KMP runs in optimal time: O(m+n) • very fast • The algorithm never needs to move backwards in the input text, T • this makes the algorithm good for processing very large files that are read in from external devices or through a network stream

KMP Disadvantages • KMP doesn’t work so well as the size of the alphabet increases • more chance of a mismatch (more possible mismatches) • mismatches tend to occur early in the pattern, but KMP is faster when the mismatches occur later

Efficient String Matching: Naïve vs. Rabin-Karp Algorithms

Efficient String Matching: Naïve vs. Rabin-Karp Algorithms

Presentation Transcript

Outline

Outline

Outline

Outline

Outline

Outline

Outline

outline

outline

OUTLINE

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline:

Outline

Outline

OUTLINE: