Approximating Edit Distance in Near-Linear Time

Approximating Edit Distance in Near-Linear Time Alexandr Andoni (MIT) Joint work with Krzysztof Onak (MIT)

Edit Distance • For two strings x,y  ∑n • ed(x,y) = minimum number of edit operations to transform x into y • Edit operations = insertion/deletion/substitution • Important in: computational biology, text processing, etc Example: ED(0101010, 1010101) = 2

Computing Edit Distance • Problem: compute ed(x,y) for given x,y{0,1}n • Exactly: • O(n2)[Levenshtein’65] • O(n2/log2 n) for |∑|=O(1)[Masek-Paterson’80] • Approximately in n1+o(1) time: • n1/3+o(1) approximation [Batu-Ergun-Sahinalp’06], improving over [Sahinalp-Vishkin’96,Cole-Hariharan’02, BarYossef-Jayram-Krauthgamer-Kumar’04] • Sublinear time: • ≤n1-ε vs ≥n/100 in n1-2ε time [Batu-Ergun-Kilian-Magen-Raskhodnikova-Rubinfeld-Sami’03]

Computing via embedding into ℓ1 • Embedding: f:{0,1}n→ ℓ1 • such that ed(x,y) ≈ ||f(x) - f(y)||1 • up to some distortion (=approximation) • Can compute ed(x,y) in time to compute f(x) • Best embedding by [Ostrovsky-Rabani’05]: • distortion = 2Õ(√log n) • Computation time: ~n2 randomized (and similar dimension) • Helps for nearest neighbor search, sketching, but not computation…

Our result • Theorem: Can compute ed(x,y) in • n*2Õ(√log n)time with • 2Õ(√log n)approximation • While uses some ideas of [OR’05] embedding, it is not an algorithm for computing the [OR’05] embedding

Review of Ostrovsky-Rabani embedding • φm = embedding of strings of length m • δ(m) = distortion of φm • Embedding is recursive • Partition into b blocks (b later chosen to be exp(√log m)) • Use embeddings φk for k ≤ m/b • Embed each block separately as follows… X= m/b

Ostrovsky-Rabani embedding (II) X= • Want to approximate ed(x,y) by • ∑i=1..b ∑sS TEMDs(Eis(x), Eis(y)) • EMD(A,B) = min-cost bipartite matching • Finish by embedding TEMD into ℓ1 with small distortion s E2s E3s Ebs E1s= rec. embedding of the s substrings T (thresholded)

Distortion of [OR] embedding • Suppose can embed TEMD into ℓ1 with distortion (log m)O(1) • Then [Ostrovsky-Rabani’05] show that distortion of φm is • δ(m) ≤ (log m)O(1) * [δ(m/b) + b] • For b=exp[√log m] • δ(m) ≤ exp[Õ(√log m)]

Why it is expensive to compute [OR] embedding • In first step, need to compute recursive embedding for ~n/b strings of length ~n/b • The dimension blows up X= s E1s= rec. embedding of the s substrings

Our Algorithm x y i z= • For each length m in some fixed set L[n], compute vectors vimℓ1 such that • ||vim – vjm||1≈ ed( z[i:i+m], z[j:j+m] ) • up to distortion δ(m) • Dimension of vim is only O(log2 n) • Vectors vim are computed inductively from vik for k≤m/b (kL) • Output: ed(x,y)≈||v1n/2 – vn/2+1n/2||1 (i.e., for m=n/2=|x|=|y|) z[i:i+m]

Idea: intuition ||vim – vjm||1 ≈ ed( z[i:i+m], z[j:j+m] ) • For each mL, compute φm(z[i:i+m]) • as in the O-R recursive step except we use vectors vik, k<m/b & kL, in place of recursive embeddings of shorter substrings (sets Eis) • Resulting φm(z[i:i+m]) have high dimension, >m/b… • Use Bourgain’sLemma to vectors φm(z[i:i+m]), i=1..n-m, • [Bourgain]: given n vectors qi, construct n vectors q̃i of O(log2 n) dimension such that ||qi-qj||1≈ ||q̃i-q̃j||1 up to O(log n) distortion. • Apply to vectors φm(z[i:i+m]) to obtain vectors vim of polylogaritmic dimension • incurs O(log n) distortion at each step of recursion. but OK as there are only ~√log n steps, giving an additional distortion of only exp[Õ(√log n)]

Idea: implementation • Essential step is: • Main Lemma: fix n vectors viℓ1, of dimension p=O(log2n). • Let s<n. Define Ai={vi, vi+1, …, vi+s-1}. • Then we can compute vectors qiℓ1kfor k=O(log2n) such that • ||qi – qj||1≈ TEMD(Ai, Aj) up to distortion logO(1) n • Computing qi’s takes Õ(n) time.

Proof of Main Lemma TEMD over n sets Ai • Graph-metric: shortest path on a weighted graph • Sparse: Õ(n) edges • “low” = logO(1) n • minkM is semi-metric on Mk with “distance” dmin,M(x,y)=mini=1..kdM(xi,yi) O(log2 n) minlowℓ1high O(1) minlowℓ1low O(log n) minlowtree-metric O(log3n) sparse graph-metric [Bourgain] (efficient) O(log n) ℓ1low

TEMD over n sets Ai Step 1 O(log2 n) minlowℓ1high • Lemma 1: can embed TEMD over n sets in ({0..M}p, ℓ1) into minO(log n) ℓ1M^pwith O(log2n) distortion, w.h.p. • Use [A-Indyk-Krauthgamer’08] • (similar to Ostrovsky-Rabani embedding) • Embedding: for each Δ = powers of 2 • impose a randomly-shifted grid • one coordinate per cell, equal to # of points in the cell • Theorem [AIK]: • no contraction w.h.p. • expected expansion = O(log2 n) • Just repeat O(log n) times 

minlowℓ1high Step 2 O(1) minlowℓ1low • Lemma 2: can embed an n point set from ℓ1M into minO(log n)ℓ1k, for k=O(log3 n), with O(1) distortion. • Use (weak) dimensionality reduction in ℓ1 • Thm [Indyk’06]: Let A be matrix of size M by k=O(log3 n) with each element chosen from Cauchy distribution. Then for any x̃=Ax, ỹ=Ay: • no contraction: ||x̃-ỹ||1≥||x-y||1(w.h.p.) • 5-expansion: ||x̃-ỹ||1≤5*||x-y||1 (with 0.01probability) • Just use O(log n) of such embeddings

Efficiency of Step 1+2 • From step 1+2, we get some embedding f() of sets Ai={vi, vi+1, …, vi+s-1} into minlowℓ1low • Naively would take Ω(n*s)=Ω(n2) time to compute all f(Ai) • More efficiently: • Note that f() is linear: f(A) = ∑aA f(a) • Then f(Ai) = f(Ai-1)-f(vi-1)+f(vi+s-1) • Compute f(Ai) in order, for a total of Õ(n) time

minlowℓ1low Step 3 O(log n) minlowtree-metric • Lemma 3: can embed ℓ1 over {0..M}p into minO(log^2 n)tree-m, with O(log n) distortion. • For each Δ = a power of 2, take O(log n) random grids. Each grid gives a min-coordinate ∞ Δ 

minlowtree-metric Step 4 O(log3n) sparse graph-metric • Lemma 4: suppose have n points in minlowtree-m,which approximates a metric up to distortion D. Can embed into a graph-metric of size Õ(n) with distortion D.

sparse graph-metric Step 5 O(log n) ℓ1low • Lemma 5: Given a graph with m edges, can embed the graph-metric into ℓ1lowwith O(log n) distortion in Õ(m) time. • Just implement Bourgain’s embedding: • Choose O(log2 n) sets Bi • Need to compute the distance from each node to each Bi • For each Bican compute its distance to each node using Dijkstra’s algorithm in Õ(m) time

Summary of Main Lemma TEMD over n sets Ai • Min-product helps to get low dimension (~small-size sketch) • bypasses impossibility of dim-reduction in ℓ1 • Ok that it is not a metric, as long as it is close to a metric O(log2 n) minlowℓ1high O(1) oblivious minlowℓ1low O(log n) minlowtree-metric O(log3n) sparse graph-metric non-oblivious O(log n) ℓ1low

Conclusion + a question • Theorem: can compute ed(x,y) in n*2Õ(√log n)time with 2Õ(√log n)approximation • Question: can we do the following “oblivious” dimensionality reduction in ℓ1 • Given n, construct a randomized embedding φ:ℓ1M→ℓ1polylog nsuch that for any v1…vnℓ1M, with high probability, φhas distortion logO(1) n on these vectors? • If φ exists, itcannot be linear [Charikar-Sahai’02]

Approximating Edit Distance in Near-Linear Time

Approximating Edit Distance in Near-Linear Time

Presentation Transcript

Approximating Edit Distance in Near-Linear Time

A Binary Linear Programming Formulation of the Graph Edit Distance

Dynamic Programming: Edit Distance

Dynamic Programming: Edit Distance

Towards a Quadratic Time Approximation of Graph Edit Distance

Minimum Edit Distance

Efficient Approximation of Edit Distance

Minimum Edit Distance

Minimum Edit Distance

Edit Distance

Minimum Edit Distance

Dynamic Programming: Edit Distance

Dynamic Programming: Edit Distance

Dynamic Programming: Edit Distance

Dynamic Programming: Edit Distance

Dynamic Programming: Edit Distance

Minimum Edit Distance

Dynamic Programming: Edit Distance

Pair HMMs and edit distance

Distance Approximating Trees: Complexity and Algorithms

Approximating the MST Weight in Sublinear Time

Edit Distance