130 likes | 268 Views
String Matching. Input : Strings P (pattern) and T (text); | P | = m , | T | = n. Output : Indices of all occurrences of P in T. Example. T = discombobulate. P output. combo 4 (i.e., with shift 3). ate 12. later 15 > | T | (no occurrence of P ).
E N D
String Matching Input:Strings P (pattern) and T (text); |P| = m, |T| = n. Output: Indices of all occurrences of P in T. Example T =discombobulate P output combo4 (i.e., with shift 3) ate12 later15 > |T| (no occurrence of P)
Applications Text retrieval Computational biology - DNA is a one-dimensional (1-D) string of characters A’s, G’s, C’s, T’s. - All information for 3-D protein folding is contained in protein sequence itself and independent of the environment. Searching for DNA patterns Comparing two or more DNA strings for similarities Reconstructing DNA strings from overlapping fragments.
Sliding the Pattern Template T =b i o l o g yP =l o g i c n = 7 m = 5 b i o l o g y l o g i c b i o l o g y l o g i c b i o l o g y l o g i c T[1] P[1] No match! b i o l o g y l o g i c b i o l o g y l o g i c T[4] = P[1], T[5] = P[2], T[6] = P[3], but T[7] P[4] T[2] P[1] b i o l o g y l o g i c b i o l o g y l o g i c T[3] P[1]
Another Example T =b i o l o g i c a lP =l o g i c n = 10 m = 5 b i o l o g i c a l l o g i c Match found! return 4.
The Naive Matcher Pattern: P[1..m] Text: T[1..n] Naive-String-Matcher(T, P) // find all occurrences of P in T. fors = 1 ton m +1 do ifP[1 .. m] = T[s .. s+m1] then print “Pattern occurs at index” s T: s s+m-1 P: 1 m
P T 1 2 3 n m+1 n Time Complexity m(n m + 1) comparisons (as below) in the worst case. m chars n m + 1 blocks, each requiring m comparisons Time complexity isO(mn)!
Example a input a b b 0 1 0 0 1 state a 1 0 0 transition function b Finite Automaton Afinite automatonconsists of a finite setQof states a start state a set A of accepting states a finite input alphabet a transition function d: Q Q. accepting state start state
Always begins at the start state. Accepts a string if it ends at an accepting state after accepting all string chars. Otherwise, it rejects the string. a b 0 1 a b Accepting a String input state sequence accepts? Yes aabba 010001 No bbabb 000100
input state a b P b 1 0 a b 0 1 2 0 a a b a a 0 1 2 3 4 2 2 3 b a b 3 4 0 a a 2 0 4 b state sequence A String Matching Automaton Ex. Pattern P =a a b a aba not rescanned due to transition 42 T = a b b a a a b a a b a Pattern occurs at indices 5 and 8! 0 1 0 0 1 2 2 3 4 2 3 4
Key Ideas of Automaton Matching Slide pattern forward by more than one position if possible. Do not rescan chars of T that have already been examined.
3 But computing d requiresO(m ||)!// details omitted. The Automaton Matcher Finite-Automaton-Matcher(T, d, m) n = length[T] q = 0 // current state fori = 1 ton do q = d(q, T[i]) // d function precomputed if q = m// match succeeds then print “Pattern occurs at index” i m+1 O(n)if the state transition function d is available.