320 likes | 434 Views
Rules for Approximate String Matching. R.C.T. Lee. Rule 1. Consider two substrings A 1 and A 2 as shown below:. A 1. P 1. S 1. A 2. P 2. S 2. If ed ( A 1 , A 2 ) ≦ k and S 1 = S 2 , then ed ( P 1 , P 2 ) ≦ k.
E N D
Rules for Approximate String Matching R.C.T. Lee
Rule 1 Consider two substrings A1 and A2 as shown below: A1 P1 S1 A2 P2 S2 If ed(A1, A2) ≦k and S1=S2, then ed(P1, P2) ≦k.
Rule 1:[AKLLLR2000], [H2005], [HHLS2006], [JB2000], [LV89], [NB99], [NB2000], [S80], [TU93], and [WM92].
Rule 2 If ed(A, B) ≦k, then the length of A must be between m-k and m+k. A B m
Rule 3 If S1 contain S1’ completely and the distance between S1’ and any substring of P is larger than k, then ed(S1, P)>k. S1 S1’ P
Rule 4 T S1 For any substring S1 in T, if there exists a substring S2 in P to the left of S1, ed(S1, S2) ≦k and S2 is the rightmost such substring, then move P to align S1 and S2. P S2 P S2
Based upon Rule 3 and Rule 2, we have Rule 5 m-k If the window size is (m-k) and there exists a substring S1 in the window such that the distance between S1 and any substring of P is larger than k, then we can safely move P as follows: T S1 P m-k T S1 P
If Rule 5 is not satisfied, it means the following: For every substring S1 in T, there exists a substring S2 in P such that ed(S1, S2) ≦k.
Rule 5-1 m-k T S1 P If Rule 5 is not satisfied, we can only move 1 step as follows: m-k T S1 P
Rule 6 Hamming Distance(A, B) ≧Edit Distance(A, B).
Rule 7 For strings A and B, if there are k+1 characters which do not appear in B, then ed(A, B)>k. Rule 7-1 Let A and B be two strings. Let there be k+1 characters a1, a2, …, ak+1 in A and ai is aligned with bi in B. If every ai does not appear in B[i-k, i+k], then ed(A, B)>k.
Rule 8 Let there be two strings A and B. Let B be divided into j pieces B1, B2, …, Bj. If ed(A, B)>k, there is at least one substring Ai in A such that ed(Ai, Bi) .
Rule 8-1 Let A and B be two strings. Let B be divided into j pieces B1, B2, …, Bj. If for every Bi and every substring S of A, ed(S, Bi) , ed(A, B)>k.
Rule 8-2 Let A and B be two strings. Let the lengths of A and B be m+k and m repsectively. Let B be divided into j pieces B1, B2, …, Bj. Let AP be a prefix of A. If for every Bi and every substring S of A, ed(S, Bi) , ed(AP, B)>k.
Rule 9 Let A and B be two strings with lengths m+k and m respectively. Let A’ be the prefix of A with length m-k. Let there be j characters a1, a2, …, aj in A’. Let the number of times that ai appears in A and B be N(A’, ai) and N(B, ai) respectively. Let Ci=N(A’, ai)-N(B, ai). Let AP be any prefix of A. If , ed(AP, B)>k.
Rule 9-1 Let A and B be two strings with lengths m+k and m respectively. Let there be j characters a1, a2, …, aj in A. Let the number of times that ai appears in A and B be N(A’, ai) and N(B, ai) respectively. Let Ci=N(B, ai)-N(A, ai). Let AP be any prefix of A. If , ed(AP, B)>k.
Rule 10 m+2k T P’ i-k i i+m+k P Let P and T be two strings with lengths m and n respectively. If P matches with a substring P’ of T at position i, any substring S of T[i-k, i+m+k] has the probability of ed(S, P) ≦k.
Rule 11 Let P and Q be two strings. Let P be divided as follows: … P1 Pn P2 Let Qi be the substring in Q and that ed(Pi, Qi) is the smallest. … P1 Pn P2 … Q2 QN Q1 If
Application of Rule 11 W … t2 tn T t1 Pn P2 P1 ed(ti,Pi) is the smallest. If for some n,
[AKLLLR2000] Text Indexing and Dictionary Matching with One Error , Amir, A., Keselman, D., Landau, G. M., Lewenstein, M., Lewenstein, N. and Rodeh, M. , Journal of Algorithms , Vol. 37 , 2000 , pp. 309-325 . • [ALP2004] Faster Algorithms for String Matching with k Mismatches, Amir, A., Lewenstein, and Porat, E. Journal of Algorithms, Vol. 50, 2004, pp. 257-275. • [FN2004] Average-Optimal Multiple Approximate String Matching, Kimmo Fredriksson , Gonzalo Navarro, ACM Journal of Experimental Algorithmics, Vol 9, Article No. 1.4,2004, pp. 1-47.
[GG86] Improved String Matching with k Mismatches, Galil, Z. and Giancarlo, R.,SIGACT News, Vol. 17, No. 4, 1986, pp. 52-54. • [H2005] Bit-parallel approximate string matching algorithms with transposition Heikki Hyyrö, Journal of Discrete Algorithms, Vol. 3, 2005, pp. 215-229. • [HHLS2006] Approximate String Matching Using Compressed Suffix Arrays, Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol. 352, 2006, pp. 240-249.
[HN2005] Bit-parallel Witnesses and their Applications to Approximate String Matching, Heikki Hyyro and Gonzalo Navarro, Algorithmica, Vol 4, No. 3, 2005, pp.203-231. • [JB2000] Approximate string matching using factor automata, Jan Holub, Borivoj Melichar, Theoretical Computer Science 249, 2000, pp. 305-311. • [LV86] String Matching with k Mismatches by Using Kangaroo Method, Landau, G.M., and Vishkin, U., Theoret. Comput Sci 43, 1986, pp. 239-249.
[LV89] Fast Parallel and Serial Approximate String Matching, G. Landau and U. Vishkin, Journal of algorithms, 10, 1989, pp.157-169. • [NB99] Very fast and simple approximate string matching, G. Navarro and R. Baeza-Yates, Information Processing Letters, Vol. 72, 1999, pp.65-70. • [NB2000] A Hybrid Indexing Method for Approximate String Matching, Gonzalo Navarro and Ricardo Baeza-Yates , 2000, No.1, Vol.1, pp.205-239.
[S80] String Matching with Errors, Sellers, P. H., Journal of Algorithms, Vol. 20, No. 1, 1980, pp. 359-373. • [TU93] Approximate Boyer-Moore String Matching, J. Tarhio and E. Ukkonen, SIAM Journal on Computing, Vol. 22, No. 2, 1993, pp.243-260. • [WM92] Fast Text Searching: Allowing Errors, Sun Wu and Udi Manber, Communications of the ACM, Vol. 35, 1992, pp. 83-91.