580 likes | 792 Views
A Hybrid Indexing Method for Approximate String Matching. Journal of Discrete Algorithms, No. 1, Vol. 1, 2000, pp. 205-239, Gonzalo Navarro and Ricardo Baeza-Yates. Advisor: Prof. R. C. T. Lee Speaker: Y. K. Shieh. The approximate string matching problem is:
E N D
A Hybrid Indexing Method for Approximate String Matching Journal of Discrete Algorithms, No. 1, Vol. 1, 2000, pp. 205-239, Gonzalo Navarro and Ricardo Baeza-Yates Advisor: Prof. R. C. T. Lee Speaker: Y. K. Shieh
The approximate string matching problem is: Given a text T of length n, a pattern P of length m (n > m), and a threshold k to the number of "errors" in the matches, find all occurrences of a pattern in a text with k errors.
This paper uses an exhaustive searching mechanism. We open a window T’ in T with size m+k (Rule 2) and try to determine whether we are sure that every prefix T’’ of this window T’ has ed(T’’,P) > k. If the answer is yes, we ignore this window; otherwise, we use dynamic programming to examine whether any prefix T’’ of the window T’ has ed(T’’,P) ≦k.
We use dynamic programming to compute the edit distance between two strings. A matrix C0…|m|,0…|n| is filled, where Cj,i represents the minimum number of operations need to match T1…i to P1…j. This is computed as follows Cj,0 and C0,i represent the edit distance between a string of length j or i and the empty string.
example: T = surgery P = survey k = 2 There are only three prefixes of T, namely surge, surger and surgery, whose edit distances with P=survey are smaller than or equal to k=2.
Let us now see how we can be sure that for a window T’ with size m+k , for every prefix T’’ of T’, ed(T’’,P) > k. We present Lemma 1 of this paper as follows.
Lemma 1 Let T’ in T and P be two strings such that ed(T’, P) ≦k. Let P = P1x1P2x2… xj-1Pj, for strings Pi and xiand for any j≧ 1. Then, at least one string Pi appears in T’ with at most errors. Thus, we always divide the pattern into j pieces. We shall point out how to divide later.
To be more precise, we may say that if ed(T’,P) ≦ k, there exists a Pi in P and a T’’ in T’ such that ed(Pi,T’’)≦ .
Lemma 1 tells us that if for all Pi in P and every substring b in T’, ed(Pi,b) > , then ed(P,T’) > k. Suppose that there is a window T’ with size m+k and for all Pi in P and for every substring b in T’, ed(Pi,b) > . Then, we can be sure that for every prefix T’’of T’ , for all Pi in P and every substring b in T’’, ed(Pi,b) > . T’ T’’ T P
Let us define the following condition. Condition A: For all Pi in P and every substring b in T’, ed(Pi, b) > Thus, if Condition A is satisfied, then for every prefix T’’ of T’, ed(T’’,P)>k. In such a case, we ignore T’ and shift P one step to the right.
Question, how can we be sure that the above condition is satisfied. The approach: For each Pi, we generate all possible modified strings Piwhose distances with Pi are smaller than or equal to k. After generating all possible modified , we may use the suffix tree of T to find all occurrences of , for all i, in T with error less than .
We still have the following questions: • Question 1. How to divide P into j pieces? • Question 2. How to generate all modified Pi’s? • Question 3. How to find the occurrences of Pi’s in T with edit distance less than or equal to .
Question 1: How to divide P into j pieces? It can be proved that an optimal method is to partition P into j pieces with , where σ is the alphabet size. We can get j pieces of P, and the size of every piece is around logσn.
Question 2. How to generate all modified Pi’s? The generation of all modified strings whose distances with P can be done trivially. One method can be found in [HHLS2006] which was reported by C. W. Lu. Another method can be found in [HM2007] reported By L. C. Chen. In this paper, the authors used the second method mentioned in [HM2007].
We can use non-deterministic finite automatons (NFA).A NFA is a five-tuple M=(Q, Σ, δ, q0 , F), where Q is a finite set of states, Σ is a finite input alphabet, δ is a mapping from Q×(Σ∪ {ε}) into the set of subsets of Q, q0 Qis an initial state, and F Q is a set of final states.
P = abac, k = 2. The finite automaton M accepts Lk(P). Lk(P)={aa, ab, ac, ba, bc, aaa, aab, aac, aba, abb, abc, acc, baa, bab, bac, bbc, bcc, aaaa, aaab, aaac, aaba, aabc, aaca, aacb, aacc, abaa, abab, abac, abba, abbb, abbc, abca, abcb, abcc, baac, babc, bbac, bbbc, bcac}.
P = abac, k = 2. The finite automaton M accepts Lk(P). Lk(P)={aa, ab, ac, ba, bc, aaa, aab, aac, aba, abb, abc, acc, baa, bab, bac, bbc, bcc, aaaa, aaab, aaac, aaba, aabc, aaca, aacb, aacc, abaa, abab, abac, abba, abbb, abbc, abca, abcb, abcc, baac, babc, bbac, bbbc, bcac}. Recognize aa
Full example: T = GACACAGACCAAAGCAG n = 17 P = CAAG m = 4 k = 1
P = CAAG j = (m + k) / logσn = (4 + 1) / log317 = 1.9388 Therefore, we partition P into two pieces. P1 = CA P2 = AG According to Lemma 1, at least one piece appears in substrings of T with at most = 0 error. This means that we want to find exact matching of P1 and P2.
NFA with k = 1 of P1 = CA: NFA with k = 1 of P2 = AG:
T = GACACGGACCAAAGCAG We construct the suffix tree of T. A G C GACCAAAGCAG$ A $ C G AC CAG$ AGCAG$ A GCAG$ ACGGACCAAAGCAG$ CAAAGCAG$ 17 GGACCAAAGCAG$ $ CAAAGCAG$ ACGGACCAAAGCAG$ CAAAGCAG$ GGACCAAAGCAG$ CAG$ CGGACCAAAGCAG$ 14 AAGCAG$ G$ 16 11 12 6 15 13 9 7 8 10 5 2 1 4 3
We only need to consider the tree level from root to = 3 . A G C A $ C G AC GA CA A A G 17 CA GG $ 6 A C 11 1,7 14 C A G G 12 C 16 2 13 15 10 9 8 4 5 3 T = GACACGGACCAAAGCAG
A G C A $ C AC G GA CA A A G 17 CA GG $ 6 A C 11 1,7 14 C A G G 12 C 16 2 13 15 10 9 8 4 5 3 T = GACACGGACCAAAGCAG k = 1 NFA of P1: NFA of P2
A G C A $ C AC G GA CA A A G 17 CA GG $ 6 A C 11 1,7 14 C A G G 12 C 16 2 13 15 10 9 8 4 5 3 T = GACACGGACCAAAGCAG k = 1 (not exact match) (not exact match)
A G C A $ C AC G GA CA A A G 17 CA GG $ 6 A C 11 1,7 14 C A G G 12 C 16 2 13 15 10 9 8 4 5 3 T = GACACGGACCAAAGCAG k = 1 Out of active states. (not exact match)
A G C A $ C AC G GA CA A A G 17 CA GG $ 6 A C 11 1,7 14 C A G G 12 C 16 2 13 15 10 9 8 4 5 3 T = GACACGGACCAAAGCAG 13 16 k = 1 Out of active states. We record positions 13 and 16 where AG occurs. (exact match)
A G C A $ C AC G GA CA A A G 17 CA GG $ 6 A C 11 1,7 14 C A G G 12 C 16 2 13 15 10 9 8 4 5 3 T = GACACGGACCAAAGCAG k = 1
A G C A $ C AC G GA CA A A G 17 CA GG $ 6 A C 11 1,7 14 C A G G 12 C 16 2 13 15 10 9 8 4 5 3 T = GACACGGACCAAAGCAG k = 1 (exact match) We record positions 3, 10 and 15 where CA occurs. Out of active states.
A G C A $ C AC G GA CA A A G 17 CA GG $ 6 A C 11 1,7 14 C A G G 12 C 16 2 13 15 10 9 8 4 5 3 T = GACACGGACCAAAGCAG k = 1 (not exact match) Out of active states.
A G C A $ C AC G GA CA A A G 17 CA GG $ 6 A C 11 1,7 14 C A G G 12 C 16 2 13 15 10 9 8 4 5 3 T = GACACGGACCAAAGCAG k = 1 (not exact match) (not exact match)
A G C A $ C AC G GA CA A A G 17 CA GG $ 6 A C 11 1,7 14 C A G G 12 C 16 2 13 15 10 9 8 4 5 3 T = GACACGGACCAAAGCAG k = 1
A G C A $ C AC G GA CA A A G 17 CA GG $ 6 A C 11 1,7 14 C A G G 12 C 16 2 13 15 10 9 8 4 5 3 T = GACACGGACCAAAGCAG k = 1 Out of active states. Out of active states.
A G C A $ C AC G GA CA A A G 17 CA GG $ 6 A C 11 1,7 14 C A G G 12 C 16 2 13 15 10 9 8 4 5 3 T = GACACGGACCAAAGCAG k = 1 (not exact match) Out of active states.
A G C A $ C AC G GA CA A A G 17 CA GG $ 6 A C 11 1,7 14 C A G G 12 C 16 2 13 15 10 9 8 4 5 3 T = GACACGGACCAAAGCAG k = 1 Out of active states. Out of active states.
A G C A $ C AC G GA CA A A G 17 CA GG $ 6 A C 11 1,7 14 C A G G 12 C 16 2 13 15 10 9 8 4 5 3 T = GACACGGACCAAAGCAG k = 1 Out of active states. (not exact match)
After we find all probable positions in T, we verify every substring of those positions. The probable positions of T are: 3, 10, 13, 15, 16 We use the dynamic program to verify whether any approximate string matching occurs between T and P at the above locations.
k = 1 The probable positions of T are 3, 10, 13, 15, 16 m+k No approximate matching with k=1 found.
k = 1 The probable positions of T are: 3, 10, 13, 15, 16 m+k No approximate matching with k=1 found.
k = 1 The probable positions of T are: 3, 10, 13, 15, 16 m+k CACG is found.
The probable positions of T are: 3, 10, 13, 15, 16 m+k This window does not include any probable position. Therefore we can ignore this window.
The probable positions of T are: 3, 10, 13, 15, 16 m+k The window does not include any probable position. Therefore we can shift the window directly.
k = 1 The probable positions of T are: 3, 10, 13, 15, 16 m+k No approximate matching with k=1 found.
k = 1 The probable positions of T are: 3, 10, 13, 15, 16 m+k No approximate matching with k=1 found.
k = 1 The probable positions of T are: 3, 10, 13, 15, 16 m+k No approximate matching with k=1 found.
k = 1 The probable positions of T are: 3, 10, 13, 15, 16 m+k No approximate matching with k=1 found.
k = 1 The probable positions of T are: 3, 10, 13, 15, 16 m+k CAA, CAAA and CAAAG are found.
k = 1 The probable positions of T are: 3, 10, 13, 15, 16 m+k AAAG is found.
k = 1 The probable positions of T are: 3, 10, 13, 15, 16 m+k AAG is found.
k = 1 The probable positions of T are: 3, 10, 13, 15, 16 m+k No approximate matching with k=1 found.
k = 1 The probable positions of T are: 3, 10, 13, 15, 16 m No approximate matching with k=1 found.