160 likes | 445 Views
Approximate String Matching using Compressed Suffix Arrays. T.N.D. Huynh, W.K. Sung National University of Singapore W.K. Hon , T.W. Lam The University of Hong Kong. String Matching Problem.
E N D
Approximate String Matching using Compressed Suffix Arrays T.N.D. Huynh, W.K. Sung National University of Singapore W.K. Hon, T.W. Lam The University of Hong Kong
String Matching Problem • Given a text T of length n over an alphabetΣ, a pattern P of lengthm, find all occurrences of P inside the text T • E.g.,T = barbara P = bar 2 occurrences, at position 1 and 4 in T
Index for String Matching • Often, T is given ahead, which is going to be matched with various P later • Also, n » m. E.g., T = Human Genome ~ 3 * 109 P = Gene ~ 103 It pays to waste some space to build an index for T that will facilitate later matching
k-Approximate String Matching • Find all occurrences of P in T that have at most k “errors” (mismatch, edits) from P • E.g., T = barbara P = rba 5 occurrences, at positions 1 (delete r from P), 2 (insert atoP), 3 (match), 4 (delete r from P), 6 (delete b from P)
Our Index • Our index is Suffix Array + Inverse • Definition 1: The suffix array of T is an array SA such that SA[i] stores the starting position of the i-th smallest suffix of T
An Example of Suffix Array • E.g., T = barbara
Our Index: Suffix Array + Inverse • Lemma 1: Given a pattern P. Suppose P occurs in T. Then all (exact) occurrences of P corresponds toa range, say[st,ed], in SA such that SA[st], SA[st+1], …, SA[ed] are position of all such occurrences.
Our Index: Suffix Array + Inverse • Lemma 2: Given the range [st1,ed1] for P1and the range [st2,ed2] for P2. Then, the range [st,ed] for P1P2 can be found in O(log n) time, based on SA and its inverse. • Idea of proof: Similar to Manber & Myers’ algorithm, using binary search.
Our Index: Suffix Array + Inverse • Corollary 3: Given the range [st,ed] for P, and an array C such that C[c] stores the total occurrences of a character in T that is smaller than the character c. Then, the range of cP can be found in O(log n) time. • Proof: Directly follows from Lemma 2 since [C[c-1]+1, C[c]] is the range of SA that corresponds to c.
1-Approximate Matching Algorithm [The delete case] • Find the range [sti,edi] for P[1…i], for every i [1,m] • Find the range [sti’,edi’] for P[i…m], for every i [1,m] • For every i [1,m], find the range of P[1…i-1] P[i+1..m]. Report the occurrences. Time complexity: O(m log n + occ)
1-Approximate Matching Algorithm • For the mismatch case or other edit cases, the algorithm is similar, except that in Step 3, we have to find the range for |Σ|m strings (instead of m strings in the delete case). Time complexity: O(|Σ|m log n+occ)
The General Case • Our algorithm can be extended to solve the general k-approximate matching problem. The time complexity will be: O(|Σ|k mk log n + occ) • Further, if we replaceSA + InversebyCSA of Grossi & Vitter,the space becomes O(n) bits, and the time will be blown up by an O(log n) factor
Future Work • Can we improve the time to O(m + occ) for the 1-approximate matchingproblem?