450 likes | 607 Views
Faster algorithms for string matching with k mismatches. Adviser : R. C. T. Lee Speaker: C. C. Yen. Journal of Algorithms , Volume 50, Issue 2 , February 2004 , Pages 257-275 Amihood Amir, Moshe Lewenstein and Ely Porat. String matching with k mismatches.
E N D
Faster algorithms for string matching with k mismatches Adviser : R. C. T. Lee Speaker: C. C. Yen Journal of Algorithms, Volume 50, Issue 2, February 2004, Pages 257-275Amihood Amir, Moshe Lewenstein and Ely Porat
String matching with k mismatches Input: A text T with length n , a pattern P with length m and a mismatching threshold k Output: Each sub-string S of T where HD(S,P)
The basic idea of following algorithms • The authors discuss the number of distinct symbols in the pattern and design algorithms to solve the problems efficiently in different cases. Example: P = ACAABD The number of distinct symbols of P is 4.
Three cases of the number of distinct symbols in pattern The paper discusses the following three cases; k is the maximal number of mismatches allowed. • There are at least 2k distinct symbols. • There are less than distinct symbols. • The number of distinct symbols is between and 2k.
Case 1: At least 2k distinct symbols There are two stages in the algorithm. 1. Marking Identify potential starts of the pattern and do a crude pruning of the potential candidates. 2. Verification Verify which of the potential candidates is indeed a pattern which occurs. In this case, the algorithm takes linear time to solve string matching with k mismatches problem.
The basic idea of this paper is as follow: • Let A={a1,a2…a2k} be a set of distinct alphabets appearing in P. • Let P’ be the shortest prefix of P containing A. • Let the length of P’ be C. • Let S be a substring of T of length C. • Suppose among the 2k distinct alphabets in A which also appear in S , there are d matches between P’and S , as shown below: • Then, obviously, among 2k locations in P’ ,there are 2k-d mismatches. • If , then , we may ignore S totally. S d matches C P’
But, how can we determine d ? We may use a position table
Marking stage of Case1 • Let{a1….,,a2k}be 2k different alphabet symbols appearing in the pattern and let ij be the smallest index in the pattern where ajappears ,j=1….,2k. • Create a position table S1 … S2k to represent distinct symbols in pattern P and pos0 … pos2k are their first appearance locations on P. Example S0 S1 S2 S3 0 1 2 3 4 5 6 P = ACABDAE T = ACBBDACTADIKQDABD…. = T0 … Tn-1 pos0 pos1 pos2 pos3 k = 2
S0 S1 S2 S3 symbols A C B D pos 0 1 3 4 pos0 pos1 pos2 pos3 We need scan the text T for each ti, , if we can find a j, , such that ti=sj , add 1 to location i - posj of an array X. If i – posj is less than 0, we ignore it. X is an array with size n and all elements of X are 0 initially . 0 1 2 3 4 5 6 P = ACABDAE T = ACBBDACTADIKQDABD…. = T0 … Tn-1 k = 2 S0 … S3 represent 2k distinct symbols in pattern P and pos0 …pos3 are their first appearance locations on P. X = 00000000000000000….
After the scanning is completed, we obtain the following array : X=4 000 0 3 00 1 100 0 0 0 00 For every X(a)=b, we know that there are b matches 2k distinct character between T(a, a+c-1) and P(0, c-1) . There are at least 2k-b mismatches .Since b<k, 2k-b>k. We may ignore T(a,a+c-1) in our case, since 0 1 2 3 4 5 6 7 8 9 10 11 1213141516 X=4 00 0 0 3 0 0 1 1 0 0 0 0 0 0 0 We need to examine only T(0,4) and T(5,9).We ignore all other substrings
Lemma 1 • For Case 1, let n denote the length of text and k be maximal number of mismatches allowed. There are at most n/k candidate locations. Proof : The total number of addition to the X array is at most n because the algorithm tests T(i) , i=1,2….n . Let the number of locations whose numbers are larger than k be a Then
Through Lemma 1, we know that at most n/k candidate locations remain. • But not all candidate locations are starting points of matches with k maximal number of mismatches. There are four other mismatches, so the candidate location is not a starting point of match with k maximal number of mismatches. Take T(5) as an example: P = ACABDAE T = ACBBDACTADIKQDABD…. X = 40000300000000000….
We must verify which candidate locations are starting points of matches with k maximal number of mismatches.
Verification stage of Case1 • The authors use the Kangaroo Method to verify whether a location has k maximal number of mismatches in O(k). P = ETBDBCCDFDC T = ABCCABDADBDETADBAADFDAAEERDXTDADCT… We shall not elaborate on this method because it was presented before
Time complexity of Case 1 • We take O(n) time in marking stage, where n is the length of the text. • According to Lemma 1, we have at most n/k candidate locations. • Using Kangaroo method, we take O(k) time to verify a remained candidate location. • Thus, we take O(n) time for the verification stage.
Case 2: Less than distinct symbols • We can use the Boolean Convolution method to solve the problem for this case.
Thus,it is obvious that Hamming distance can be found by convolution • Let A=abac and B=acdc For this case,HD(A,B)=2 Convolution: a b a c c d c a 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 2 0 2 0 2 matches HD(A,B)=2
Using Fast Fourier Transforms (FFT), Boolean Convolution can be done in O(nlogm). • Our alphabet size is • We take times to solve the problem for Case 2.
Case 3: The number of distinct symbols is between and 2k Definition: frequent symbol: A symbol appears in the pattern at least times. Example k = 2, , P = baccdbdd d is a frequent symbol.
Two Sub-cases of Case 3 Case3-1:There are at least frequent symbols in the pattern. Case3-2:There are less than frequent symbols in pattern.
Case 3-1:at least frequent symbols • There are two stages in the algorithm for this case. • Marking stage • Identify potential starts of the pattern and do a crude pruning of the potential candidates. • (2)Verification stage • Verify which of the potential candidate is indeed a pattern which occurs. • Verification stage will be done by Kangaroo Method.
Marking stage of Case 3-1 We pick arbitrarily frequent symbols and convert this problem to mismatch problem with “don’t care” . Example Let P = ABCAABBDBAA and k = 4 There are 4 ( 4 is between and 2k) distinct symbols in P and ‘A’, ‘B’ are frequent symbols. There are 2 (= )frequent symbols. T = ABCABDCABBCFADDABC T = ABCABDCABBCFADDABC T = ABCABDCABBCFADDABC
A B Φ A A B B Φ B A Φ A B Φ A A B B Φ B A Φ Mismatch problem with “don’t care” Input: A text T with length n and a pattern P with length m. where g are the characters in the pattern which are not “don’t care” symbols. and the rest are Φ(“don’t care”). Output:The numbers of mismatches between pattern and each sub-string of T with length m. Only mismatches of the g pattern characters are counted. The number of mismatches 4 P = T = A B C A B D C A B B C F A D D A B D
A B Φ A A B B Φ B A Φ A B Φ A A B B Φ B A Φ Mismatch problem with “don’t care” Input: A text T with length n and a pattern P with length m. where g are the characters in the pattern which are not “don’t care” symbols. and the rest are Φ(“don’t care”). Output:The numbers of mismatches between pattern and each sub-string of T with length m. Only mismatches of the g pattern characters are counted. The number of mismatches 4 7 P = T = A B C A B D C A B B C F A D D A B D
A B Φ A A B B Φ B A Φ A B Φ A A B B Φ B A Φ Mismatch problem with “don’t care” Input: A text T with length n and a pattern P with length m. where g are the characters in the pattern which are not “don’t care” symbols. and the rest are Φ(“don’t care”). Output:The numbers of mismatches between pattern and each sub-string of T with length m. Only mismatches of the g pattern characters are counted. The number of mismatches 4 7 7 P = T = A B C A B D C A B B C F A D D A B D
A B Φ A A B B Φ B A Φ A B Φ A A B B Φ B A Φ Mismatch problem with “don’t care” Input: A text T with length n and a pattern P with length m. where g are the characters in the pattern which are not “don’t care” symbols.and the rest are Φ(“don’t care”). Output:The numbers of mismatches between pattern and each sub-string of T with length m. Only mismatches of the g pattern characters are counted. The number of mismatches 4 7 7 2 P = T = A B C A B D C A B B C F A D D A B D
Mismatch problem with “don’t care” can be solved in • (Amir et, 1997), where n is the length of text T, m is the length of pattern P and g are the characters in the pattern which are not “don’t care” symbols.
k = 4 4 7 7 2 6 8 7 6 P = T = A B C A B D C A B B C F A D D A B D A B Φ A A B B Φ B A Φ All locations with at most k mismatches of frequent symbols are our candidate locations where matches with k maximal number of mismatches start. Example The number of mismatches
Lemma 2 for Case 3-1 Let {a1,….,a }be frequent symbols. Then there exist in the text at most locations where there is a pattern occurrence with no more than k errors Proof: The total number of mark is at most n because the algorithm tests T(i) , i=1,2….n . Let the number of locations which have marks larger than k be a Then
We convert marking stage to mismatch problem with “don’t care” and take to solve mismatch problem with “don’t care” problem. • According to lemma 2 for Case3-1, there are candidate locations and we take O(k) time to verify one candidate location. • Verification stage for Case3-1 takes time.
Case 3-2:less than frequent symbols First, we can check the number of mismatches by using convert all frequent symbols to Φ (“don’t care” symbol). Example Let P = ABCAABGDBAA and k = 5 There are 5 ( 5 is between and 2k) distinct symbols in P and ‘A’ are frequent symbols. There are 1 (< )frequent symbols. T = ABCABDCABBCFADDABC T = ABCABDCABBCFADDABC T = ABCABDCABBCFADDABC
Two cases are discussed after we convert all frequent symbols to Φ. 3-2-1:There are less than 2k remaining symbols. 3-2-2:There are at least 2k remaining symbols.
Φ Φ B B C C Φ Φ Φ Φ B B G G D D B B Φ Φ Φ Φ A B C A B D C A B B C F A D D A B C Case3-2-1 There are less than 2k remaining symbols There are less than 2k remaining symbols and the rest are “don’t care” symbols. Finding mismatches of remaining symbols can be solved as a mismatch problem with “don’t care” and takes time. P’ = T = mismatches of remaining = symbols 3
Φ Φ B B C C Φ Φ Φ Φ B B G G D D B B Φ Φ Φ Φ A B C A B D C A B B C F A D D A B C Case3-2-1 There are less than 2k remaining symbols There are less than 2k remaining symbols and the rest are “don’t care” symbols. Finding mismatches of remaining symbols can be solved as a mismatch problem with “don’t care” and takes time. P’ = T = mismatches of remaining = symbols 3 5
Φ Φ B B C C Φ Φ Φ Φ B B G G D D B B Φ Φ Φ Φ A B C A B D C A B B C F A D D A B C Case3-2-1 There are less than 2k remaining symbols There are less than 2k remaining symbols and the rest are “don’t care” symbols. Finding mismatches of remaining symbols can be solved as a mismatch problem with “don’t care” and takes time. P’ = T = mismatches of remaining = symbols 3 5 6
Φ Φ B B C C Φ Φ Φ Φ B B G G D D B B Φ Φ Φ Φ A B C A B D C A B B C F A D D A B C Case3-2-1 There are less than 2k remaining symbols There are less than 2k remaining symbols and the rest are “don’t care” symbols. Finding mismatches of remaining symbols can be solved as a mismatch problem with “don’t care” and takes time. P’ = T = mismatches of remaining = symbols 3 5 6 4
Φ Φ B B C C Φ Φ Φ Φ B B G G D D B B Φ Φ Φ Φ A B C A B D C A B B C F A D D A B C Case3-2-1 There are less than 2k remaining symbols There are less than 2k remaining symbols and the rest are “don’t care” symbols. Finding mismatches of remaining symbols can be solved as a mismatch problem with “don’t care” and takes time. P’ = T = mismatches of remaining = symbols 3 5 6 4 4
All locations which have less than k mismatches of all frequent symbols and remaining symbols are matches which we want.
Conclusion: The problem for Case 3-2-1 can be solved in time
Case3-2-2 There are at least 2k remaining symbols • There are two stages in algorithm for this case. • Marking stage • Identify potential starts of the pattern and do a crude pruning of the potential candidates. • (2)Verification stage • Verify which of the potential candidates is indeed a pattern which occurred. • Verification stage will be done by Kangaroo Method.
Marking stage of Case 3-2-2 We pick arbitrarily 2k remaining symbols and convert all symbols to Φ(“don’t care” symbols) except 2k remaining symbols which we picked. Marking stage of Case3-2-2 can be solved as mismatch problem with “don’t care” in time.
Conclusion: The problem for Case 3-2-2 can be solved in time