1 / 57

A Hybrid Indexing Method for Approximate String Matching

A Hybrid Indexing Method for Approximate String Matching. Journal of Discrete Algorithms, No. 1, Vol. 1, 2000, pp. 205-239, Gonzalo Navarro and Ricardo Baeza-Yates. Advisor: Prof. R. C. T. Lee Speaker: Y. K. Shieh. The approximate string matching problem is:

fritz
Download Presentation

A Hybrid Indexing Method for Approximate String Matching

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Hybrid Indexing Method for Approximate String Matching Journal of Discrete Algorithms, No. 1, Vol. 1, 2000, pp. 205-239, Gonzalo Navarro and Ricardo Baeza-Yates Advisor: Prof. R. C. T. Lee Speaker: Y. K. Shieh

  2. The approximate string matching problem is: Given a text T of length n, a pattern P of length m (n > m), and a threshold k to the number of "errors" in the matches, find all occurrences of a pattern in a text with k errors.

  3. This paper uses an exhaustive searching mechanism. We open a window T’ in T with size m+k (Rule 2) and try to determine whether we are sure that every prefix T’’ of this window T’ has ed(T’’,P) > k. If the answer is yes, we ignore this window; otherwise, we use dynamic programming to examine whether any prefix T’’ of the window T’ has ed(T’’,P) ≦k.

  4. We use dynamic programming to compute the edit distance between two strings. A matrix C0…|m|,0…|n| is filled, where Cj,i represents the minimum number of operations need to match T1…i to P1…j. This is computed as follows Cj,0 and C0,i represent the edit distance between a string of length j or i and the empty string.

  5. example: T = surgery P = survey k = 2 There are only three prefixes of T, namely surge, surger and surgery, whose edit distances with P=survey are smaller than or equal to k=2.

  6. Let us now see how we can be sure that for a window T’ with size m+k , for every prefix T’’ of T’, ed(T’’,P) > k. We present Lemma 1 of this paper as follows.

  7. Lemma 1 Let T’ in T and P be two strings such that ed(T’, P) ≦k. Let P = P1x1P2x2… xj-1Pj, for strings Pi and xiand for any j≧ 1. Then, at least one string Pi appears in T’ with at most errors. Thus, we always divide the pattern into j pieces. We shall point out how to divide later.

  8. To be more precise, we may say that if ed(T’,P) ≦ k, there exists a Pi in P and a T’’ in T’ such that ed(Pi,T’’)≦ .

  9. Lemma 1 tells us that if for all Pi in P and every substring b in T’, ed(Pi,b) > , then ed(P,T’) > k. Suppose that there is a window T’ with size m+k and for all Pi in P and for every substring b in T’, ed(Pi,b) > . Then, we can be sure that for every prefix T’’of T’ , for all Pi in P and every substring b in T’’, ed(Pi,b) > . T’ T’’ T P

  10. Let us define the following condition. Condition A: For all Pi in P and every substring b in T’, ed(Pi, b) > Thus, if Condition A is satisfied, then for every prefix T’’ of T’, ed(T’’,P)>k. In such a case, we ignore T’ and shift P one step to the right.

  11. Question, how can we be sure that the above condition is satisfied. The approach: For each Pi, we generate all possible modified strings Piwhose distances with Pi are smaller than or equal to k. After generating all possible modified , we may use the suffix tree of T to find all occurrences of , for all i, in T with error less than .

  12. We still have the following questions: • Question 1. How to divide P into j pieces? • Question 2. How to generate all modified Pi’s? • Question 3. How to find the occurrences of Pi’s in T with edit distance less than or equal to .

  13. Question 1: How to divide P into j pieces? It can be proved that an optimal method is to partition P into j pieces with , where σ is the alphabet size. We can get j pieces of P, and the size of every piece is around logσn.

  14. Question 2. How to generate all modified Pi’s? The generation of all modified strings whose distances with P can be done trivially. One method can be found in [HHLS2006] which was reported by C. W. Lu. Another method can be found in [HM2007] reported By L. C. Chen. In this paper, the authors used the second method mentioned in [HM2007].

  15. We can use non-deterministic finite automatons (NFA).A NFA is a five-tuple M=(Q, Σ, δ, q0 , F), where Q is a finite set of states, Σ is a finite input alphabet, δ is a mapping from Q×(Σ∪ {ε}) into the set of subsets of Q, q0 Qis an initial state, and F Q is a set of final states.

  16. P = abac, k = 2. The finite automaton M accepts Lk(P). Lk(P)={aa, ab, ac, ba, bc, aaa, aab, aac, aba, abb, abc, acc, baa, bab, bac, bbc, bcc, aaaa, aaab, aaac, aaba, aabc, aaca, aacb, aacc, abaa, abab, abac, abba, abbb, abbc, abca, abcb, abcc, baac, babc, bbac, bbbc, bcac}.

  17. P = abac, k = 2. The finite automaton M accepts Lk(P). Lk(P)={aa, ab, ac, ba, bc, aaa, aab, aac, aba, abb, abc, acc, baa, bab, bac, bbc, bcc, aaaa, aaab, aaac, aaba, aabc, aaca, aacb, aacc, abaa, abab, abac, abba, abbb, abbc, abca, abcb, abcc, baac, babc, bbac, bbbc, bcac}. Recognize aa

  18. Full example: T = GACACAGACCAAAGCAG n = 17 P = CAAG m = 4 k = 1

  19. P = CAAG j = (m + k) / logσn = (4 + 1) / log317 = 1.9388 Therefore, we partition P into two pieces. P1 = CA P2 = AG According to Lemma 1, at least one piece appears in substrings of T with at most = 0 error. This means that we want to find exact matching of P1 and P2.

  20. NFA with k = 1 of P1 = CA: NFA with k = 1 of P2 = AG:

  21. T = GACACGGACCAAAGCAG We construct the suffix tree of T. A G C GACCAAAGCAG$ A $ C G AC CAG$ AGCAG$ A GCAG$ ACGGACCAAAGCAG$ CAAAGCAG$ 17 GGACCAAAGCAG$ $ CAAAGCAG$ ACGGACCAAAGCAG$ CAAAGCAG$ GGACCAAAGCAG$ CAG$ CGGACCAAAGCAG$ 14 AAGCAG$ G$ 16 11 12 6 15 13 9 7 8 10 5 2 1 4 3

  22. We only need to consider the tree level from root to = 3 . A G C A $ C G AC GA CA A A G 17 CA GG $ 6 A C 11 1,7 14 C A G G 12 C 16 2 13 15 10 9 8 4 5 3 T = GACACGGACCAAAGCAG

  23. A G C A $ C AC G GA CA A A G 17 CA GG $ 6 A C 11 1,7 14 C A G G 12 C 16 2 13 15 10 9 8 4 5 3 T = GACACGGACCAAAGCAG k = 1 NFA of P1: NFA of P2

  24. A G C A $ C AC G GA CA A A G 17 CA GG $ 6 A C 11 1,7 14 C A G G 12 C 16 2 13 15 10 9 8 4 5 3 T = GACACGGACCAAAGCAG k = 1 (not exact match) (not exact match)

  25. A G C A $ C AC G GA CA A A G 17 CA GG $ 6 A C 11 1,7 14 C A G G 12 C 16 2 13 15 10 9 8 4 5 3 T = GACACGGACCAAAGCAG k = 1 Out of active states. (not exact match)

  26. A G C A $ C AC G GA CA A A G 17 CA GG $ 6 A C 11 1,7 14 C A G G 12 C 16 2 13 15 10 9 8 4 5 3 T = GACACGGACCAAAGCAG 13 16 k = 1 Out of active states. We record positions 13 and 16 where AG occurs. (exact match)

  27. A G C A $ C AC G GA CA A A G 17 CA GG $ 6 A C 11 1,7 14 C A G G 12 C 16 2 13 15 10 9 8 4 5 3 T = GACACGGACCAAAGCAG k = 1

  28. A G C A $ C AC G GA CA A A G 17 CA GG $ 6 A C 11 1,7 14 C A G G 12 C 16 2 13 15 10 9 8 4 5 3 T = GACACGGACCAAAGCAG k = 1 (exact match) We record positions 3, 10 and 15 where CA occurs. Out of active states.

  29. A G C A $ C AC G GA CA A A G 17 CA GG $ 6 A C 11 1,7 14 C A G G 12 C 16 2 13 15 10 9 8 4 5 3 T = GACACGGACCAAAGCAG k = 1 (not exact match) Out of active states.

  30. A G C A $ C AC G GA CA A A G 17 CA GG $ 6 A C 11 1,7 14 C A G G 12 C 16 2 13 15 10 9 8 4 5 3 T = GACACGGACCAAAGCAG k = 1 (not exact match) (not exact match)

  31. A G C A $ C AC G GA CA A A G 17 CA GG $ 6 A C 11 1,7 14 C A G G 12 C 16 2 13 15 10 9 8 4 5 3 T = GACACGGACCAAAGCAG k = 1

  32. A G C A $ C AC G GA CA A A G 17 CA GG $ 6 A C 11 1,7 14 C A G G 12 C 16 2 13 15 10 9 8 4 5 3 T = GACACGGACCAAAGCAG k = 1 Out of active states. Out of active states.

  33. A G C A $ C AC G GA CA A A G 17 CA GG $ 6 A C 11 1,7 14 C A G G 12 C 16 2 13 15 10 9 8 4 5 3 T = GACACGGACCAAAGCAG k = 1 (not exact match) Out of active states.

  34. A G C A $ C AC G GA CA A A G 17 CA GG $ 6 A C 11 1,7 14 C A G G 12 C 16 2 13 15 10 9 8 4 5 3 T = GACACGGACCAAAGCAG k = 1 Out of active states. Out of active states.

  35. A G C A $ C AC G GA CA A A G 17 CA GG $ 6 A C 11 1,7 14 C A G G 12 C 16 2 13 15 10 9 8 4 5 3 T = GACACGGACCAAAGCAG k = 1 Out of active states. (not exact match)

  36. After we find all probable positions in T, we verify every substring of those positions. The probable positions of T are: 3, 10, 13, 15, 16 We use the dynamic program to verify whether any approximate string matching occurs between T and P at the above locations.

  37. k = 1 The probable positions of T are 3, 10, 13, 15, 16 m+k No approximate matching with k=1 found.

  38. k = 1 The probable positions of T are: 3, 10, 13, 15, 16 m+k No approximate matching with k=1 found.

  39. k = 1 The probable positions of T are: 3, 10, 13, 15, 16 m+k CACG is found.

  40. The probable positions of T are: 3, 10, 13, 15, 16 m+k This window does not include any probable position. Therefore we can ignore this window.

  41. The probable positions of T are: 3, 10, 13, 15, 16 m+k The window does not include any probable position. Therefore we can shift the window directly.

  42. k = 1 The probable positions of T are: 3, 10, 13, 15, 16 m+k No approximate matching with k=1 found.

  43. k = 1 The probable positions of T are: 3, 10, 13, 15, 16 m+k No approximate matching with k=1 found.

  44. k = 1 The probable positions of T are: 3, 10, 13, 15, 16 m+k No approximate matching with k=1 found.

  45. k = 1 The probable positions of T are: 3, 10, 13, 15, 16 m+k No approximate matching with k=1 found.

  46. k = 1 The probable positions of T are: 3, 10, 13, 15, 16 m+k CAA, CAAA and CAAAG are found.

  47. k = 1 The probable positions of T are: 3, 10, 13, 15, 16 m+k AAAG is found.

  48. k = 1 The probable positions of T are: 3, 10, 13, 15, 16 m+k AAG is found.

  49. k = 1 The probable positions of T are: 3, 10, 13, 15, 16 m+k No approximate matching with k=1 found.

  50. k = 1 The probable positions of T are: 3, 10, 13, 15, 16 m No approximate matching with k=1 found.

More Related