1 / 84

Fast Algorithm for String Matching with k Mismatches

Fast Algorithm for String Matching with k Mismatches. by Amihood Amir, Moshe Lewenstein, and Ely Porat, Journal of Algorithms , to appear, 2003/2004. Speaker: R92921097 李宜益 R92921084 何明彥 R92921083 余宗恩 Advisor: 呂學一 老師. General Case.

marcel
Download Presentation

Fast Algorithm for String Matching with k Mismatches

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fast Algorithm for String Matching with k Mismatches by Amihood Amir, Moshe Lewenstein, and Ely Porat, Journal of Algorithms, to appear, 2003/2004 Speaker: R92921097 李宜益 R92921084 何明彥 R92921083 余宗恩 Advisor: 呂學一 老師

  2. General Case Speaker: R92921097 李宜益

  3. Outline Introduction Problem Definition and Preliminaries Large and Small Alphabets General Alphabets

  4. Introduction • Two types of matching problems • Generalized matching problem • Approximate matching problem • Previous research • Landau and Vishkin : O( ) • Abrahamson : O( )

  5. Introduction • Complexity : O( ) • Contribution : • The fastest known algorithm for string matching with k mismatches. • Identifying and exploiting a new technique that has been implicitly used in some recent papers – counting.

  6. Problem Definition and Preliminaries • Let a, b . Define • Let be two strings over alphabet . Then the hamming distance between X and Y (ham(X, Y)) is defined as

  7. Problem Definition and Preliminaries • The The String matching with k mismatches Problem is defined as follows: Input : Text T =t0…tn-1, pattern P =p0…pm-1, where ti, pj, i = 0,…n-1; j =0,…m-1, and a natural number k . Output : All pairs <i, ham(P,T(i))>, where i is a text location for which ham(P,T(i)) k, where T(i) = titi+1…ti+m-1

  8. Lager and Small Alphabets • Large alphabets • Number of different alphabets in the pattern exceeds 2k • Small alphabets • Number of different alphabets in the pattern less than

  9. Large Alphabets(1) • Two stages • Marking stage • Identifying the potential starts of the pattern. • Verification stage • Verifying which of the potential candidates is indeed a pattern occurrence.

  10. Large Alphabets(2) • The Marking Stage • Let {a1,…,a2k} be 2k different alphabet symbols appearing in the text and let ij be the smallest index in the pattern where ajappears, j = 1,...,2k Text a1 a2 a3 aj …… a2k 11 23 35 ij Pattern a1 a1 a2 a3 aj

  11. Large Alphabets(3) • M.1. for every symbol ti; if ti = aj then mark text location i–j • M.2. discard every text location that is marked less than k marks • Time: O(n)

  12. Example • Text : aabcabc • Pattern : abc • K : 1 1 3 0 0 3 0 0 # of marks a a b c a b c

  13. Lemma 1 • After the marking stage, there are at most undiscarded locations proof:

  14. Verification Stage • Using suffix tree and Lowest Common Ancestor to check whether a location exists a matching that is less than k mismatches. • takes O(k) for each candidate • Total time : O( )

  15. Small Alphabets • Using convolutions, as introduced by Fischer and Paterson • Define String S = s0…sn-1, then SR is the reverse of the string sn-1…s0

  16. Example b c b c a a a • Text(T) • Pattern(P) • K 0011011 1101101 1110110 001 010 100 b c a 2

  17. Example 000011011 011011010 111011000

  18. Each multiplication takes O (nlogm) using FFT We do multiplications Can be solved in O (n logm) Time complexity

  19. General Alphabets • Cases which the size of the pattern alphabet is between 2 and 2k • Definition • A symbol that appears in the pattern at least 2 times is called frequent. A symbol that is not frequent is called rare.

  20. Many Frequent Symbols • More than frequent symbols • Lemma 2 • Let be frequent symbols. Then there exist in the text at most locations where there is a pattern occurrence with no more than k errors. proof: Choose 2 occurrences of every frequent symbol in pattern and call them relevant occurrences The total number of marks is at most There are at most

  21. Finding the Potential Locations • Example : k = 2 • Frequent symbols : 1 2 3 4 5 6 7 8 9 10 11 12 13 14 a a b c a b c a a b c a b d a b frequent symbols :

  22. Finding the Potential Locations •  : don’t care • Using the “less than matching with “don’t care” problem” proposed by Amir et. This can be done in O( ) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 a a b c  a b c  a a b c  a b d 

  23. The Verification Stage • By lemma 2, we have at most candidates • Using suffix tree and Lowest Common Ancestor to check whether a location exists a matching that is less than k mismatches. • takes O(k) for each candidate • Total time : O(n + )=O( )

  24. Few Frequent Symbols • Using the convolutions as described in “Small Alphabets” to deal with the frequent symbols • takes O( ) • Then replace all frequent symbols in p by “don’t cares” • Case 1 : the remaining symbols and all their occurrences together less than 2k • Case 2 : the remaining symbols and all their occurrences together at least 2k

  25. Case 1 • Using the algorithm “Pattern Matching with Swaps” of Amir et. This can be done in O( ) • Total time complexity : O( )

  26. Case 2 • Choose any 2k symbols • # of chosen symbols does not exceed • Using the previous method “finding the potential positions” • We have at most O( ) potential positions and verifying each location is O(k) • Total time complexity: O( )

  27. Introduction to Break Speaker : R92921084 何明彥

  28. OUTLINE • Assumption • Periodicity • Break • Counting Argument • P has 2k disjoint k-breaks • P has 2k disjoint l-breaks • Local matches

  29. OUTLINE (Partition) • Assumption • Periodicity • Break • Counting Argument • P has 2k disjoint k-breaks • P has 2k disjoint l-breaks • Local matches

  30. Assumption(1/2) • Text T: |T|=n • Pattern P: |P|=m =>n=2m T: P:

  31. Assumption(2/2) • Therefore, spilt text into substring of length 2m. • Every pattern occurrence appears in some substring. • for 2m length substrings of the text yields an algorithm of

  32. OUTLINE • Assumption • Periodicity • Break • Counting Argument • P has 2k disjoint k-breaks • P has 2k disjoint l-breaks • Local matches

  33. Periodicity(1/2) • Def: • A string S[1..n] is periodic if such that S[j]=S[i+j-1]. • S is periodic if : j 2 , is a prefix of ; otherwise is aperiodic. ex: ABCABCAB ABCDABC periodic aperiodic

  34. Periodicity(2/2) • If P is periodic with a short period, it is quite simple to come up with a quick algorithm for string matching with k mismatch. T: P:

  35. OUTLINE • Assumption • Periodicity • Break • Counting Argument • P has 2k disjoint k-breaks • P has 2k disjoint l-breaks • Local matches

  36. Break(1/4) aperiod period period l-break l-break l-break • Def: • A break of a string S is an aperiodic substring of S. • An l-break is a break of length l. • A large number of breaks are useful for fast algorithm for string matching with k mismatches.

  37. Break(2/4) • Lemma 3: Let P be a pattern with 2k disjoint l-break and let T be a text. In each match of P in T at least k of the l-break match exactly.

  38. Break(3/4) • Pf/ There are at most k mismatches in a match and P has 2k disjoint l-breaks. Since at most k do not match exactly, at least k must match exactly.

  39. Break(4/4) • Lemma 4: P is an m length pattern with < 2kl-breaks. the length of T is 2m. Then all matches of P in T are in a substring of T which has at most O(k) l-breaks. proved in section 6 from Cole , Hariharan "Approximate string matching: a simple faster algorithm "

  40. OUTLINE • Assumption • Periodicity • Break • Counting Argument • P has 2k disjoint k-breaks • P has 2k disjoint l-breaks • Local matches

  41. Counting Arguments(1/3) • Theorem 1: P is a pattern with 2k disjointk-breaks. In every kcontiguous locations in T ,at most 4 matches of the pattern.

  42. Counting Arguments(2/3) k kbreak • pf/ T P ABCDABCDABC ABCDABC ABCDABC ABCDABC ABCDABC ABCDABC ABCDABC

  43. Counting Arguments(3/3) • Forkcontiguous locations in T, the overall numbers of exact matches of thek-breaks is at most4k. • This means that at most 4 locations have kk-breakswith an exact match, in their respective locations.

  44. OUTLINE • Assumption • Periodicity • Break • Counting Argument • P has 2k disjoint k-breaks • P has 2k disjoint l-breaks • Local matches

  45. P has 2k disjointk-breaks (1/4) • Corollary 1: If P has 2k disjointk-breaksthen there are at most matches of P in T. These matches can be found in O(n+m) time.

  46. P has 2k disjointk-breaks (2/4) • pf/ From Theorem 1 there are at most matches of P in T. Therefore, if we knew these locations in advance, verificationwould take O(k) per location. • next we describe a method of finding the candidate location in time O(n)

  47. P has 2k disjointk-breaks (3/4) Find all exact matches of all breaks in the text. • There are O(n) exact matches of breaks and they can be found in linear time. • There is a total of O(n) marks. • For every such match, mark all text locations for pattern occurrence appropriate for this break. • Discard every text location that is marked less than k marks.

  48. P has 2k disjointk-breaks (4/4) • There are l distinct breaks, appearing a1…al time respectively. The total # of appearance of each distinct k-break does not exceed • The total # of marks is • Each distinct k-break can appear at most times in the text and since there are2k k-breaks .#of all k-breaks in the text does not exceed 4n. The total length of all k-breaks ≤m. All exact matches of all k-breaks in the text can be found inO(n+m)

  49. OUTLINE • Assumption • Periodicity • Break • Counting Argument • P has 2k disjoint k-breaks • P has 2k disjoint l-breaks • Local matches

  50. P has 2k disjointl-breaks(1/7) To circumvent this problem, rather than searching for all matches, se need a way to seek for local match. • The pattern does not always contain 2k k-breaks. Nevertheless, they may be an l such that there are 2k l-breaks. • By Corollary 1, finding them may take costly time.

More Related