340 likes | 740 Views
Boyer-Moore. Charles Yan 2007. Exact Matching. Boyer-Moore ( worst-case: linear time, Typical: sublinear time ) Aho-Corasik ( A set of pattern ). Boyer-Moore. Idea 1: Right-to-left comparison 12345678901234567 T: xpbc t bxab pqxctbpq P: tp a bxab. Boyer-Moore.
E N D
Boyer-Moore Charles Yan 2007
Exact Matching • Boyer-Moore (worst-case: linear time, Typical: sublinear time ) • Aho-Corasik (A set of pattern)
Boyer-Moore Idea 1: Right-to-left comparison 12345678901234567 T: xpbctbxabpqxctbpq P: tpabxab
Boyer-Moore 12345678901234567 T: spbctbsabpqsctbpq P: tpabsab Idea 2: Bad character rule R(x): The right-most occurrence of x in P. R(x)=0 if x does not occur. R(t)=1, R(s)=5. i: the position of mismatch in P. i=3 k: the counterpart in T. k=5. T[k]=t The bad character rule says P should be shifted right by max{1, i-R(T[k])}. i.e., if the right-most occurrence of character T[k] in P is in position j (j<i), then P[j] show be below T[k] after the shifting. P: tpabxab
Boyer-Moore • The idea of bad character rule is to shift P by more than one characters when possible. • But is has no effect if j>i • Unfortunately, it is often the case that j>i 12345678901234567 T: spbctbsatpqsctbpq P: tpabsat P: tpabsat
Boyer-Moore Let x=T[k], the mismatched character in T. Idea 3: Extended bad character rule says P should be shifted right so that the closest x to the left of position i in P is below T[K]. 12345678901234567 T: spbctbsatpqsctbpq P: tpabsat P: tpabsat
Boyer-Moore To use extended bad character rule we need: For each position i of P, for each character x in the alphabet, the position of the closest occurrence of x to the left of i. Approach 1: Two dimensional array. n*| | Space and time: expensive
Boyer-Moore Approach two: scan P from right to left and for each x maintain a list positions where x occurs (in decreasing order). P: tpabsat t7,1 a6,3 … When P[i] is mismatched with T[k], (let x=T[k]), scan the x’s list, find the first number (let it be j) that is less than i and shift P to right so that P[j] is below T[k]. If no such j is found then shift P past T[k] Space and time: Linear 12345678901234567 T: spbctbsatpqsctbpq P: tpabsat P: tpabsat
Boyer-Moore Idea 3: Strong good suffix rule t is a suffix of P that match with a substring t of T x≠y t’ is the right-most copy of t in P such that t’ is not a suffix of P and z≠y T x t P z t’ y t
Boyer-Moore The strong good suffix rule says (1) if t’ exist then shift P to right such that t’ in P is below t in T 123456789012345678 T: prstabstubabvqxrst P: qcabdabdab T x t P z t’ y t z t’ y t P: qcabdabdab P: qcabdabdab
Boyer-Moore Extended bad character rule focuses on characters. Strong good rule focuses on substrings. How to get the information needed for the strong good suffix rule? i.e., for a t, how do we find t`?
Boyer-Moore L’(i): For each i, L’(i) is the largest position less than n such that substring P[i,…,n] matches a suffix of P[1,…, ’(i) ] with the additional requirement that the character preceding that suffix is not equal to character P[i-1]. If there is no such a position, L’(i) =0. Let t= P[i,…,n], then L’(i) is the right end-position of t’. T x t P z t’ y t L’(i) i n z t’ y t T: prstabstubabvqxrst P: qcabdabdab 1234567890 L’(9)=4, L’(10)=0, L’(8)=?, L’(7)=? L’(6)=?
Boyer-Moore Let t= P[i,…,n], then L’(i) is the right end-position of t’. Thus to use the strong good suffix rule, we need to find out L’(i) for every i=1,…,n. For pattern P, Njis the length of the longest substring that end at j and that is also a suffix of P. P y t’ x t j t=t’; j=|t’|=|t|; x≠y
Boyer-Moore Njis the length of the longest substring that end at j and that is also a suffix of P. Zi: the length of the longest substring of P that starts at i and matches a prefix of P y t’ x t j t y t’ x i
Boyer-Moore N is the reverse of Z! P: the pattern Pr the string obtained by reversing P Then Nj (P)=Zn-j+1 (Pr) 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 P: q c a b d a b d a b Pr: b a d b a d b a c q Nj: 0 0 0 2 0 0 5 0 0 0 Zi0 0 0 5 0 0 2 0 0 0 t y t’ x i y t’ x t j
Boyer-Moore For pattern P, Nj (for j=1,…,n) can be calculated in O(n) using the Z algorithm. Why do we need to define Nj ? To use the strong good suffix rule, we need to find out L’(i) for every i=1,…,n. We can get L’(i) from Nj ! T x t P z t’ y t L’(i) i n z t’ y t
Boyer-Moore For position i, let t=P[i,…n]. L’(i) is the largest position j less than n such that Nj=|t| t’’ z t’ y t P L’(i) n i 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 P: q c a b d a b d a b Pr: b a d b a d b a c q Nj: 0 0 0 2 0 0 5 0 0 0 Zi0 0 0 5 0 0 2 0 0 0 L’(i):0 0 0 0 0 7 0 0 4 0
Boyer-Moore How to obtain L’(i) from Nj in linear time? Input: Pattern P Output: L’(i) for i=1,…,n Algorithm Calculate Nj for j=1,…,n based on Z algorithm for i=1; i<=n; i++ L’(i)=0; for j=1; j<n; j++ i=n-Nj+1 L’(i)=j; z t’ y t P i L’(i) n j
Boyer-Moore The strong good suffix rule says (1) if t’ exist then shift P to right such that t’ in P is below t in T 123456789012345678 T: prstabstubabvqxrst P: qcabdabdab i=9; L’(9)=4 T x t P z t’ y t i n L’(i) z t’ y t i n L’(i) P: qcabdabdab
Boyer-Moore The strong good suffix rule: (1) If a mismatch occurs at position i-1 of P and L’(i)>0 (i.e. t’ exists), then using the strong good suffix rule we can shift P by n-L’(i) positions to the right. (2) What ifa mismatch occurs at position i-1 of P and L’(i)=0 (i.e. t’ does not exists)? We can shift P as least like this x T t y P t i n P y t i n
Boyer-Moore But we can do more than that! x T t y P t i n y t P i n
Boyer-Moore Observation 1:If b is a prefix of P is also a suffix of P, then… x T t b’ y P t b i n y t P i n
Boyer-Moore Observation 2: If there are more than one candidates of b, then shift P by the least amount T x t P y t b P1 y t b’ y t P2
Boyer-Moore The strong good suffix rule:When amismatch occurs at position i-1 of P (1) If L’(i)>0 (i.e. t’ exists), then using the strong good suffix rule we can shift P by n-L’(i) positions to the right. (2) Else if L’(i)=0 (i.e. t’ does not exists)? We can shift P past the left end of t by the least amount such a prefix of the shifted pattern matches a suffix of t. x T t b’ y P t b i n y t P i n
Boyer-Moore l’(i) : the length of the largest suffix of P[i,…,n], that is also a prefix of P. If none exists, then l’(i)=0. l’(i) is length of the overlap between the unshifted and shifted patterns. T x t P y t i l’(i) l’(i) b P1 y t b’ y t P2
Boyer-Moore l’(i) equals the largest j≤|P[i,…n]|, such that Nj=j 1. Nj=j thenb is a prefix of P is also a suffix of P 2. and we want the largest j b P b j P y t i j1 j2 l’(i)
Boyer-Moore l’(i) equals the largest j≤|P[i,…n]|, such that Nj=j 1 2 3 4 5 6 7 8 9 0 P: a b d a b a b d a b Nj: 0 2 0 0 5 0 2 0 0 0 l’(i): 5 5 5 5 5 5 2 2 2 0
Boyer-Moore How to calculatel’(i) from Nj in linear time ?
Boyer-Moore The strong good suffix rule: When amismatch occurs at position i-1 of P (1) If L’(i)>0 (i.e. t’ exists), then using the strong good suffix rule we can shift P by n-L’(i) positions to the right. (2) Else if L’(i)=0 (i.e. t’ does not exists)? We can shift P past the left end of t by the least amount such a prefix of the shifted pattern matches t, that is by n-l’(i) positions to the right. T x t P z t’ y t i n L’(i) z t’ y t i n L’(i) x T t b’ y P t b i n l’(i) y t P i n
Boyer-Moore What if a match is found? Shift P by one position…but… Shift P by the least amount such a prefix of the shifted pattern matches a suffix of t, that is, shift P to the right by n-l’(2) T P b y t P
Boyer-Moore The strong good suffix rule: When amismatch occurs at position i-1 of P (1) If L’(i)>0 (i.e. t’ exists), then using the strong good suffix rule we can shift P by n-L’(i) positions to the right. (2) Else if L’(i)=0 (i.e. t’ does not exists)? We can shift P past the left end of t by the least amount such a prefix of the shifted pattern matches t, that is by n-l’(i) positions to the right. (3) If a match is found, then shift P to the right by n-l’(2) T x t P z t’ y t i n L’(i) z t’ y t x T t b’ y P t b n i l’(i) y P t
Boyer-Moore The extended bad character rule vs. the strong good suffix rule 123456789012345678 T: prstabstuqabvqxrst P: qcabdabdab 123456789012345678 T: prstabstubabvqxrst P: qcabdabdab P: qcabdabdab P: qcabdabdab P: qcabdabdab P: qcabdabdab
Boyer-Moore Shift P by the largest amount given by either of rules. That results in the Boyer-Moore algorithm! Input: Text T, and pattern P; Output: Find the occurrences of P in T Algorithm Boyer-Moore Compute L’(i), L`(i), and R(x) k=n; while (k≤m) do i=n h=k while i>0 and P[i]=T[h] do i--; h--; if i=0 report an occurrence of P in T ending at position k; k=k+n-l`(2) else shift P (increase k) by the maximum amount determined by the extended bad character rule and the good suffix rule. h k T t t P i