170 likes | 333 Views
Exact Matching. Charles Yan 2008. Na ï ve Method. Input: P: pattern; T: Text Output: Occurrences of P in T Algorithm Naive Align P with the left end of T Compare from left right until mismatch or an occurrence of P is found Shift P one place to the right O (n*m).
E N D
Exact Matching Charles Yan 2008
Naïve Method Input: P: pattern; T: Text Output: Occurrences of P in T Algorithm Naive Align P with the left end of T Compare from left right until mismatch or an occurrence of P is found Shift P one place to the right O (n*m)
Speeding Up The Naïve Algorithm • Shift P by more than one places at a time • Skip comparisons that have been made
Preprocessing • Goal: To gather the information needed for speeding up the algorithm • Definitions: • substring, prefix, suffix, proper prefix, proper suffix • Zi: For i>1, the length of the longest substring of S that starts at i and matches a prefix of S • Z-box: for any position i >1 where Zi>0, the Z-box at i starts at i and ends at i+Zi-1 • ri; For every i>1, ri is the right-most endpoint of the Z-boxes that begin at or before i • li; For every i>1, li is the left endpoint of the Z-box ends at ri
Preprocessing 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 S: a a b a a b c a x a a b a a b c y Z: 0 1 0 3 1 0 0 1 0 7 1 0 3 1 0 0 0 Z-box a a b a a b c a x a a b a a b c y ri: 0 2 2 6 6 6 6 8 8 16 16 16 16 16 16 16 16 li: 0 2 2 4 4 4 4 8 8 10 10 10 10 10 10 10 10
Z-Algorithm Goal: To calculate Zi for an input string S in a linear time Starting from i=2, calculate Z2, r2 and l2 For i=3; i<n; i++ In iteration k, calculate Zk, rk and lk based on Zj,rjand lj forj=2,…,k-1 For iteration k, the algorithm only need rk-1 and lk-1. Thus, there is no need to keep all ri and li. We use r, and l to denote rk-1 and lk-1
Z-Algorithm In iteration k: (I) if k<=r l k r a’ a b’ b l’ k’ l k r’ r k’=k-l+1; r’=r-l+1; a=a’; b=b’ a’ a b’ b a a b a a b c a x a a b a a b c y 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
A) If |g’|<|b’|, that is, Z k’< r-k+1, Z k = Z k’ a’ a x y b’ y b g’ g g’’ l’ k’ l k r’ r g=g’=g’’; x≠y a’ a b’ g’ g b g’’ a a b a a b c a x a a b a a b c y 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Z: 0 1 0 3 1 0 0 1 0 7 1 0 3
Z-Algorithm 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 S: a a b a a b c a x a a b a a c d Z: 0 1 0 3 1 0 0 1 0 6 1 0 2 1 0 0 B) If |g’|>|b’|, that is, Z k’ >r-k+1, Zk =|b|, i.e., r-k+1 a’ b’’ b’ b a x x y g’ g g’’ l’ k’ l k r’ r b=b’=b’’ g’=g’’; x ≠y (because a is a Z box) Zk =|b|, i.e., r-k+1 a’ b’’ b’ b a g’’ g’
Z-Algorithm 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 S: a a b a a e c a x a a b a a b d Z: 0 1 0 2 1 0 0 1 0 6 1 0 3 1 0 0 C) If |g’|=|b’|, that is, Z k’ =r-k+1, Zk =|b|, i.e., r-k+1 a’ b’’ b’ b a z x y g’ g g’’ l’ k’ l k r’ r b=b’=b’’ g=g’=g’’; x ≠y (because a is a Z box) z ≠x (because g’ is a Z box) z ?? y Compare S[r+1,...] with S[ |b| +1,…] until a mismatch occurs. Update Zk, r, and l a’ b’ b a g’’ g’
Z-Algorithm (II) if k>r l r k Compare the characters starting at k+1 with those starting at 1. Update r, and l if necessary
Z-Algorithm Input: Pattern P Output: Zi Z Algorithm Calculate Z2, r2 and l2 specifically by comparisons. R=r2 and l=l2 for i=3; i<n; i++ if k<=r if Z k-l+1 <r-k+1, then Z k = Z k-l+1 else if Z k-l+1 > r-k+1 Z k = r-k+1 else compare the characters starting at r+1 with those starting at |b| +1. Update r, and l if necessary else Compare the characters starting at k to those starting at 1. Update r, and l if necessary
Preprocessing 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 S: a a b a a b c a x a a b a a b c y Z: 0 1 0 3 1 0 0 1 0 7 1 0 3 1 0 0 0 r: 0 2 2 6 6 6 6 8 8 16 16 16 16 16 16 16 16 l: 0 2 2 4 4 4 4 8 8 10 10 10 10 10 10 10 10
Z-Algorithm Time complexity #mismatches <= number of iterations, n #matches • Let q be the number of matches at iteration k, then we need to increase r by at least q • r<=n • Thus total #match <=n T=O( #matches + #mismatches +#iterations)=O(n) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 S: a a b a a b c a x a a b a a b c y Z: 0 1 0 3 1 0 0 1 0 7 1 0 3 1 0 0 0 r: 0 2 2 6 6 6 6 8 8 16 16 16 16 16 16 16 16 l: 0 2 2 4 4 4 4 8 8 10 10 10 10 10 10 10 10 #m: 0 1 0 3 0 0 0 1 0 7 0 0 0 0 0 0 0 #mis: 0 1 1 1 0 0 1 1 1 1 0 0 0 0 0 0 1
Simplest Linear Time Exact Matching Algorithm Input: Pattern P, Text T Output: Occurrences of P in T Algorithm Simplest S=P$T, where $ is a character that do not appear in P and T For i=2; i<|S|; i++ Calculate Zi If Zi=|P|, then report that there is an occurrence of P in T starting at i-|P|-1 of T=O(|P|+|T|+1)=O(n+m)
Simplest Linear Time Exact Matching Algorithm • Take only O (n) extra space • Alphabet-independent linear time a’ a b’ b $ l’ k’ l k r’ r