Fast Algorithm for String Matching with k Mismatches

Fast Algorithm for String Matching with k Mismatches by Amihood Amir, Moshe Lewenstein, and Ely Porat, Journal of Algorithms, to appear, 2003/2004 Speaker: R92921097 李宜益 R92921084 何明彥 R92921083 余宗恩 Advisor: 呂學一老師

General Case Speaker: R92921097 李宜益

Outline Introduction Problem Definition and Preliminaries Large and Small Alphabets General Alphabets

Introduction • Two types of matching problems • Generalized matching problem • Approximate matching problem • Previous research • Landau and Vishkin : O( ) • Abrahamson : O( )

Introduction • Complexity : O( ) • Contribution : • The fastest known algorithm for string matching with k mismatches. • Identifying and exploiting a new technique that has been implicitly used in some recent papers – counting.

Problem Definition and Preliminaries • Let a, b . Define • Let be two strings over alphabet . Then the hamming distance between X and Y (ham(X, Y)) is defined as

Problem Definition and Preliminaries • The The String matching with k mismatches Problem is defined as follows: Input : Text T =t0…tn-1, pattern P =p0…pm-1, where ti, pj, i = 0,…n-1; j =0,…m-1, and a natural number k . Output : All pairs <i, ham(P,T(i))>, where i is a text location for which ham(P,T(i)) k, where T(i) = titi+1…ti+m-1

Lager and Small Alphabets • Large alphabets • Number of different alphabets in the pattern exceeds 2k • Small alphabets • Number of different alphabets in the pattern less than

Large Alphabets(1) • Two stages • Marking stage • Identifying the potential starts of the pattern. • Verification stage • Verifying which of the potential candidates is indeed a pattern occurrence.

Large Alphabets(2) • The Marking Stage • Let {a1,…,a2k} be 2k different alphabet symbols appearing in the text and let ij be the smallest index in the pattern where ajappears, j = 1,...,2k Text a1 a2 a3 aj …… a2k 11 23 35 ij Pattern a1 a1 a2 a3 aj

Large Alphabets(3) • M.1. for every symbol ti; if ti = aj then mark text location i–j • M.2. discard every text location that is marked less than k marks • Time: O(n)

Example • Text : aabcabc • Pattern : abc • K : 1 1 3 0 0 3 0 0 # of marks a a b c a b c

Lemma 1 • After the marking stage, there are at most undiscarded locations proof:

Verification Stage • Using suffix tree and Lowest Common Ancestor to check whether a location exists a matching that is less than k mismatches. • takes O(k) for each candidate • Total time : O( )

Small Alphabets • Using convolutions, as introduced by Fischer and Paterson • Define String S = s0…sn-1, then SR is the reverse of the string sn-1…s0

Example b c b c a a a • Text(T) • Pattern(P) • K 0011011 1101101 1110110 001 010 100 b c a 2

Example 000011011 011011010 111011000

Each multiplication takes O (nlogm) using FFT We do multiplications Can be solved in O (n logm) Time complexity

General Alphabets • Cases which the size of the pattern alphabet is between 2 and 2k • Definition • A symbol that appears in the pattern at least 2 times is called frequent. A symbol that is not frequent is called rare.

Many Frequent Symbols • More than frequent symbols • Lemma 2 • Let be frequent symbols. Then there exist in the text at most locations where there is a pattern occurrence with no more than k errors. proof: Choose 2 occurrences of every frequent symbol in pattern and call them relevant occurrences The total number of marks is at most There are at most

Finding the Potential Locations • Example : k = 2 • Frequent symbols : 1 2 3 4 5 6 7 8 9 10 11 12 13 14 a a b c a b c a a b c a b d a b frequent symbols :

Finding the Potential Locations •  : don’t care • Using the “less than matching with “don’t care” problem” proposed by Amir et. This can be done in O( ) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 a a b c  a b c  a a b c  a b d 

The Verification Stage • By lemma 2, we have at most candidates • Using suffix tree and Lowest Common Ancestor to check whether a location exists a matching that is less than k mismatches. • takes O(k) for each candidate • Total time : O(n + )=O( )

Few Frequent Symbols • Using the convolutions as described in “Small Alphabets” to deal with the frequent symbols • takes O( ) • Then replace all frequent symbols in p by “don’t cares” • Case 1 : the remaining symbols and all their occurrences together less than 2k • Case 2 : the remaining symbols and all their occurrences together at least 2k

Case 1 • Using the algorithm “Pattern Matching with Swaps” of Amir et. This can be done in O( ) • Total time complexity : O( )

Case 2 • Choose any 2k symbols • # of chosen symbols does not exceed • Using the previous method “finding the potential positions” • We have at most O( ) potential positions and verifying each location is O(k) • Total time complexity: O( )

Introduction to Break Speaker : R92921084 何明彥

OUTLINE • Assumption • Periodicity • Break • Counting Argument • P has 2k disjoint k-breaks • P has 2k disjoint l-breaks • Local matches

OUTLINE (Partition) • Assumption • Periodicity • Break • Counting Argument • P has 2k disjoint k-breaks • P has 2k disjoint l-breaks • Local matches

Assumption(1/2) • Text T: |T|=n • Pattern P: |P|=m =>n=2m T: P:

Assumption(2/2) • Therefore, spilt text into substring of length 2m. • Every pattern occurrence appears in some substring. • for 2m length substrings of the text yields an algorithm of

Periodicity(1/2) • Def: • A string S[1..n] is periodic if such that S[j]=S[i+j-1]. • S is periodic if : j 2 , is a prefix of ; otherwise is aperiodic. ex: ABCABCAB ABCDABC periodic aperiodic

Periodicity(2/2) • If P is periodic with a short period, it is quite simple to come up with a quick algorithm for string matching with k mismatch. T: P:

Break(1/4) aperiod period period l-break l-break l-break • Def: • A break of a string S is an aperiodic substring of S. • An l-break is a break of length l. • A large number of breaks are useful for fast algorithm for string matching with k mismatches.

Break(2/4) • Lemma 3: Let P be a pattern with 2k disjoint l-break and let T be a text. In each match of P in T at least k of the l-break match exactly.

Break(3/4) • Pf/ There are at most k mismatches in a match and P has 2k disjoint l-breaks. Since at most k do not match exactly, at least k must match exactly.

Break(4/4) • Lemma 4: P is an m length pattern with < 2kl-breaks. the length of T is 2m. Then all matches of P in T are in a substring of T which has at most O(k) l-breaks. proved in section 6 from Cole , Hariharan "Approximate string matching: a simple faster algorithm "

Counting Arguments(1/3) • Theorem 1: P is a pattern with 2k disjointk-breaks. In every kcontiguous locations in T ,at most 4 matches of the pattern.

Counting Arguments(2/3) k kbreak • pf/ T P ABCDABCDABC ABCDABC ABCDABC ABCDABC ABCDABC ABCDABC ABCDABC

Counting Arguments(3/3) • Forkcontiguous locations in T, the overall numbers of exact matches of thek-breaks is at most4k. • This means that at most 4 locations have kk-breakswith an exact match, in their respective locations.

P has 2k disjointk-breaks (1/4) • Corollary 1: If P has 2k disjointk-breaksthen there are at most matches of P in T. These matches can be found in O(n+m) time.

P has 2k disjointk-breaks (2/4) • pf/ From Theorem 1 there are at most matches of P in T. Therefore, if we knew these locations in advance, verificationwould take O(k) per location. • next we describe a method of finding the candidate location in time O(n)

P has 2k disjointk-breaks (3/4) Find all exact matches of all breaks in the text. • There are O(n) exact matches of breaks and they can be found in linear time. • There is a total of O(n) marks. • For every such match, mark all text locations for pattern occurrence appropriate for this break. • Discard every text location that is marked less than k marks.

P has 2k disjointk-breaks (4/4) • There are l distinct breaks, appearing a1…al time respectively. The total # of appearance of each distinct k-break does not exceed • The total # of marks is • Each distinct k-break can appear at most times in the text and since there are2k k-breaks .#of all k-breaks in the text does not exceed 4n. The total length of all k-breaks ≤m. All exact matches of all k-breaks in the text can be found inO(n+m)

P has 2k disjointl-breaks(1/7) To circumvent this problem, rather than searching for all matches, se need a way to seek for local match. • The pattern does not always contain 2k k-breaks. Nevertheless, they may be an l such that there are 2k l-breaks. • By Corollary 1, finding them may take costly time.

Fast Algorithm for String Matching with k Mismatches

Fast Algorithm for String Matching with k Mismatches

Presentation Transcript

String Matching

A Fast String Matching Algorithm

A Fast String Matching Algorithm

A fast algorithm for Maximum Subset Matching

String Matching

Survey: String Matching with k Mismatches

String Matching with Mismatches

Faster algorithms for string matching with k mismatches

String Matching

String Matching

String Matching

String Matching

A Fast String Searching Algorithm

String Matching: Knuth-Morris-Pratt algorithm

brute force string matching algorithm

Faster Algorithm for String Matching with k Mismatches

A fast algorithm for approximate string matching on gene sequences

String matching

String Matching

String Matching

BWT Arrays and Mismatching Trees: A New Way for String Matching with k Mismatches

String Matching