380 likes | 557 Views
A Sublinear Algorithm For Weakly Approximating Edit Distance Batu, Ergun, Killian, Magen, Raskhodnikova, Rubinfeld, Sami. Presentation by Itai Dinur. Edit Distance (Levenshtein distance).
E N D
A Sublinear Algorithm For Weakly Approximating Edit DistanceBatu, Ergun, Killian, Magen, Raskhodnikova, Rubinfeld, Sami Presentation by Itai Dinur
Edit Distance(Levenshtein distance) • Let A,B be two strings over a fixed alphabet Σ. The edit distance D(A,B) between A and B is defined as the minimum number of character insertions, deletions, and substitutions that transform A into B, or vice versa.
Applications • Bioinformatics • Text processing • Web search
Algorithms • Wagner and Fischer gave a dynamic programming algorithm that runs in time O(n2) • Masek and Paterson gave an improved algorithm that runs in time O(n2/logn)
The Edit Distance Testing Problem • On input A,B and parameters 0<α<1, C>1: • If D(A,B)≤nα, output CLOSE with probability at least 2/3 • If D(A,B)>n/C, output FAR with probability at least 2/3 • Note that the output is unrestricted for nα<D(A,B)≤n/C • E.g. cannot distinguish between n0.1 and n0.9 • The algorithm presented for the problem runs in time Õ(nmax{α/2,2α-1})
Motivation • In some applications, given many pairs of strings, one is interested in computing the edit distance only for close strings • For string pairs where the edit distance is above a certain threshold, the actual value of the distance is irrelevant
Lower Bound • Any probabilistic algorithm for the edit distance test problem requires Ω(nα/2) queries • The algorithm presented for the problem runs in time Õ(nmax{α/2,2α-1}), which is close to optimal for α≤2/3
Other Approximations • There are several papers that give better approximation results, but none run in sublinear time • Andoni and Onak give an algorithm that computes the edit distance between two strings up to a factor of in n1+o(1) time
Algorithm Overview • A recursive divide and conquer algorithm • B is broken into substrings which are recursively matched against A • The matches are pieced together to form a matching for A • It is too expensive to match all the substrings • A small number of them are sampled and matched, relying on statistical properties of the matchings
A abcd1234efgh5678 Bcd02 I has a (2,1)-(approximate) matching with respect to A Approximate Matching • Definition 1: An interval I = B[s…e] has a (t,E)-(approximate) matching with respect to A if for some interval A[s’…e’], s’=s+t and D(A[s’…e’],I)≤E
Coordinated Matching • Definition 2:Let I = (I1,…Ik) be a collection of intervals. We say that I has a (t,σ,E,D)-coordinated matching with A if for all but D of the intervals IiI, Ii has a (ti,E)-matching with A, where |t-ti|≤σ A abcd1234efgh5678 Bcd0236gjfkl5 I has a (1,1,2,1)-coordinated matchingwith A
Coordinated Matching to Approximate Matching • We decompose an interval I of size S into k disjoint continuous subintervals, I=(I1,…Ik), each of size S’=S/k (assuming k|S) • Lemma 1: If (I1,…Ik) has a (t,σ,εS’,δk)-coordinated matching with A, then I has a (t,βS)-(approximate) matching with A, where β = (2σ/S’ + ε+δ)
Approximate Matching to Coordinated Matching • Lemma 2: Let c>1 and S>cE. If I has a (t,E)-matching with A then I=(I1,…Ik)has (t,E,cE/k,k/c)-coordinated matching with A • Lemma 3: If I has a (t,E)-matching with A, and k≥E, then I=(I1,…Ik)has (t,E,0,E)-coordinated matching with A
To match A and B • Decompose B into a set of continuous disjoint intervals I • Lemma 2 argues that a match for A and B gives a coordinated matching for A and I • Use a subroutine (COORD-MATCHES) to find coordinated matches for I • Lemma 1 infers the existence of good matches for B from coordinated matches for I
COORD-MATCHES • COORD-MATCHES(A,I,σ,E,D,ε,c) • Let d be a constant, l=dlog(n). Choose samples i1,…,il uniformly and independently from [1,…,k] • For each chosen sample ij compute Tj=MATCHES(A,ij,E) • Let Δ=(D/k+ε/2)l • Return the set T, where t T iff Tj∩[t-σ…t+σ]=Ø for at most Δ sets Tj
Sampling Lemma • Lemma 4: Suppose that a random element of a set S of size n has a property Z with probability p. For any positive ε and c, there exists d such that for dlog(n) random samples from S the fraction p’ of these samples with property Z satisfies p-ε/2≤p’≤p+ε/2 with probability 1-1/nc
COORD-MATCHES • Lemma 5: With probability 1-1/nc-1 over the random coins of COORD-MATCHES, the output T of COORD-MATCHES(A,I,σ,E,D,ε,c) has the following properties: • If I hasa (t,σ,E,D)-coordinated matching then t T • If t T then I has a (t,σ,E,D+εk)-coordinated matching
MATCHES(A,I,E) • If E≥1, use a recursive call to COORD-MATCHES • If E<1 (i.e E=0), then A must contain the interval I unchanged. The set of t values is computed directly using the algorithm SHIFTS
Implementing SHIFTS • A naïve implementation of SHIFTS may give an output set T consisting of n elements • We may restrict the allowed shifts to [-nα,…,+nα ] • However, we need a running time of o(nα), so we must further restrict the set of possible outputs
The Approximate Matching problem • Actually, we will solve the approximate matching problem: Given a block I=B[s…e] of length b=e-s+1, and a constant c2>1, find all indexes s’ such that A[s’…(s’+b-1)] matches I, in a sense that the two substrings have Hamming-distance at most b/c2 • Note that if D(A,B)<nα, it is enough to consider s’ in the interval [s-nα,s+nα]
The Approximate Matching problem • Naively, we can randomly sample O(log(n)) indexes i to determine (with high probability) if a substring of A[(t+1)…(t+b)] matches I, for a given t, and try all 2nαpossible shifts • Requires Ω(nα) queries to A
The Ruler Procedure • We can compare pairs of characters A[i],I[j] such that a pair is compared for every i-j from 0 to u=2nαwith √u queries to each string given that b>√u • In A character positions divisible by √u are queried A[√u,2√u,…u] . In I, √u consecutive positions are queried I[1…√u] • Define cen=ët/√uû+1mil=t(mod√u), then for i=cen√u, j=√u-mil i-j=t
The Ruler Procedure • To test whether a block matches: pick l=Θ(log(n)) random numbers m1,m2…,ml from [0,b-√u] • For each cen and mil marks construct a fingerprint with l offsets e.g. f(√u)=A[√u+m1,√u+m2,…,√u+ml] • Detect with high probability if a block matches with shift t by comparing the cen and mil fingerprints. i.e. f(cen√u)= A[cen√u+m1…cen√u+ml] and f(t(mod√u)) =I[t(mod√u)+m1… t(mod√u)+ml]
The Ruler Procedure • If b≤√u we have only O(b) mil marks and Ω(u/b) cen marks • We can find all matching shifts by using O(max{√u,u/b}log(n)) queries
Efficient Implementation of the Ruler • We need an efficiently algorithm to compare all fingerprints and return valid shifts u=|A|-|B|=9 √u=3 l=2 m1=1 m2=3 A dbadaabcdabddcd Babcdab
Efficient Implementation of the Ruler u=|A|-|B|=9 √u=3 l=2 m1=1 m2=3 A dbadaabcdabddcd B abcdab
Efficient Implementation of the Ruler u=|A|-|B|=9 √u=3 l=2 m1=1 m2=3 A dbadaabcdabddcd Babcdab
Quantizing the Ruler • The explicit list of all matching t can have Ω(u) values • We round the values of t to multiples of some integer Q and return all quantized shifts • The running time is O(max{√u,u/b,u/Q}log(n))
SHIFTS(A,I,Q) • Initialize the fingerprint data structure • Pick l=Θ(log(n)) random numbers m1,m2…,ml • Add all the fingerprints f(i) of A to the data structure, adding i to the A-list of f(i) • Add all the fingerprints f(j) of I to the data structure, adding j to the B-list of f(j) • Quantize all A-lists and B-lists • For each fingerprint, output the list of quantized shifts (differences)
SHIFTS(A,I,Q) • Theorem 1: Procedure SHIFTS finds all quantized shifts of interval I in A, with high probability. It runs in time O(max{√u,u/b,u/Q}log(n)), where u=|A|-b
MATCHES(A,I,E) • If E<1, use SHIFTS to compute T • If E≥1 • Set k=min{εn1-α,2c1E} • Decompose I into a set I of continuous disjoint intervals of size |I|/k • Compute T=COORD-MATCHES(A,I,E,c1E/k,k/c1) • Return T
DECIDE(A,B,α,C) • Choose sufficiently small ε, and sufficiently large c1 (given α,C) • Let the quantization parameter be Q=εmin{n1-α,nα/2} • Set T = MATCHES(A,B,nα) • If T is nonempty, output CLOSE, otherwise output FAR
DECIDE(A,B,α,C) • For any fixed α<1, we can choose constants ε and c1 such that procedure DECIDE solves the edit distance testing problem with high probability
Running Time Analysis • Note that when k=2c1E, COORD-MATCHES is called with edit distance parameter c1E/k=1/2<1. I.e. next call to MATCHES will call SHIFTS and end the recursion • Each level, The interval input to MATCHES goes down by a factor of k=Ω(n1-α), after r=α/(1-α) levels the intervals are of length n/nr(1-α)=O(n1-α), E=O(nα/nr(1-α))=O(1) and SHIFT will be called next
Running Time Analysisα<1/2 • One level of recursion • B is broken to intervals of size O(nα) • dlog(n) calls to SHIFT with Q=εnα/2 • Each call takes O(max{√u,u/b,u/Q}log(n)) = O(max{nα/2,1,nα/2}log(n))=O(nα/2log(n)) • One merge taking O(nα/2log(n)) • Total running time O(nα/2log2(n))
Running Time Analysis 1/2<α<2/3 • Two levels of recursion • At the last level, B is broken to intervals of size O(nα/2) • log2(n) calls to SHIFT with Q=εnα/2 • Each call takes O(nα/2log(n)) • log(n) merges each taking O(nα/2log(n)) • Total running time O(nα/2log3(n))
Running Time Analysisα>2/3 • r>2 levels of recursion • At the last level, B is broken to intervals of size O(n1-α) • logO(1)(n) calls to SHIFT with Q=εn1-α • Note that n1-α<nα/2 • Each call takes O(max{√u,u/b,u/Q}log(n)) = O((u/b)log(n))=O(n2α-1log(n)) • Total running time Õ(n2α-1log(n))
Conclusion • We saw an algorithm for the edit distance test problem that runs in time Õ(nmax{α/2,2α-1}) • Any probabilistic algorithm for the edit distance test problem requires Ω(nα/2) queries