1 / 22

Approximating Edit Distance in Near-Linear Time

Approximating Edit Distance in Near-Linear Time. Alexandr Andoni (MIT) Joint work with Krzysztof Onak (MIT). Edit Distance. For two strings x,y  ∑ n ed(x,y) = minimum number of edit operations to transform x into y Edit operations = insertion/deletion/substitution

urania
Download Presentation

Approximating Edit Distance in Near-Linear Time

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Approximating Edit Distance in Near-Linear Time Alexandr Andoni (MIT) Joint work with Krzysztof Onak (MIT)

  2. Edit Distance • For two strings x,y  ∑n • ed(x,y) = minimum number of edit operations to transform x into y • Edit operations = insertion/deletion/substitution • Important in: computational biology, text processing, etc Example: ED(0101010, 1010101) = 2

  3. Computing Edit Distance • Problem: compute ed(x,y) for given x,y{0,1}n • Exactly: • O(n2)[Levenshtein’65] • O(n2/log2 n) for |∑|=O(1)[Masek-Paterson’80] • Approximately in n1+o(1) time: • n1/3+o(1) approximation [Batu-Ergun-Sahinalp’06], improving over [Sahinalp-Vishkin’96,Cole-Hariharan’02, BarYossef-Jayram-Krauthgamer-Kumar’04] • Sublinear time: • ≤n1-ε vs ≥n/100 in n1-2ε time [Batu-Ergun-Kilian-Magen-Raskhodnikova-Rubinfeld-Sami’03]

  4. Computing via embedding into ℓ1 • Embedding: f:{0,1}n→ ℓ1 • such that ed(x,y) ≈ ||f(x) - f(y)||1 • up to some distortion (=approximation) • Can compute ed(x,y) in time to compute f(x) • Best embedding by [Ostrovsky-Rabani’05]: • distortion = 2Õ(√log n) • Computation time: ~n2 randomized (and similar dimension) • Helps for nearest neighbor search, sketching, but not computation…

  5. Our result • Theorem: Can compute ed(x,y) in • n*2Õ(√log n)time with • 2Õ(√log n)approximation • While uses some ideas of [OR’05] embedding, it is not an algorithm for computing the [OR’05] embedding

  6. Review of Ostrovsky-Rabani embedding • φm = embedding of strings of length m • δ(m) = distortion of φm • Embedding is recursive • Partition into b blocks (b later chosen to be exp(√log m)) • Use embeddings φk for k ≤ m/b • Embed each block separately as follows… X= m/b

  7. Ostrovsky-Rabani embedding (II) X= • Want to approximate ed(x,y) by • ∑i=1..b ∑sS TEMDs(Eis(x), Eis(y)) • EMD(A,B) = min-cost bipartite matching • Finish by embedding TEMD into ℓ1 with small distortion s E2s E3s Ebs E1s= rec. embedding of the s substrings T (thresholded)

  8. Distortion of [OR] embedding • Suppose can embed TEMD into ℓ1 with distortion (log m)O(1) • Then [Ostrovsky-Rabani’05] show that distortion of φm is • δ(m) ≤ (log m)O(1) * [δ(m/b) + b] • For b=exp[√log m] • δ(m) ≤ exp[Õ(√log m)]

  9. Why it is expensive to compute [OR] embedding • In first step, need to compute recursive embedding for ~n/b strings of length ~n/b • The dimension blows up X= s E1s= rec. embedding of the s substrings

  10. Our Algorithm x y i z= • For each length m in some fixed set L[n], compute vectors vimℓ1 such that • ||vim – vjm||1≈ ed( z[i:i+m], z[j:j+m] ) • up to distortion δ(m) • Dimension of vim is only O(log2 n) • Vectors vim are computed inductively from vik for k≤m/b (kL) • Output: ed(x,y)≈||v1n/2 – vn/2+1n/2||1 (i.e., for m=n/2=|x|=|y|) z[i:i+m]

  11. Idea: intuition ||vim – vjm||1 ≈ ed( z[i:i+m], z[j:j+m] ) • For each mL, compute φm(z[i:i+m]) • as in the O-R recursive step except we use vectors vik, k<m/b & kL, in place of recursive embeddings of shorter substrings (sets Eis) • Resulting φm(z[i:i+m]) have high dimension, >m/b… • Use Bourgain’sLemma to vectors φm(z[i:i+m]), i=1..n-m, • [Bourgain]: given n vectors qi, construct n vectors q̃i of O(log2 n) dimension such that ||qi-qj||1≈ ||q̃i-q̃j||1 up to O(log n) distortion. • Apply to vectors φm(z[i:i+m]) to obtain vectors vim of polylogaritmic dimension • incurs O(log n) distortion at each step of recursion. but OK as there are only ~√log n steps, giving an additional distortion of only exp[Õ(√log n)]

  12. Idea: implementation • Essential step is: • Main Lemma: fix n vectors viℓ1, of dimension p=O(log2n). • Let s<n. Define Ai={vi, vi+1, …, vi+s-1}. • Then we can compute vectors qiℓ1kfor k=O(log2n) such that • ||qi – qj||1≈ TEMD(Ai, Aj) up to distortion logO(1) n • Computing qi’s takes Õ(n) time.

  13. Proof of Main Lemma TEMD over n sets Ai • Graph-metric: shortest path on a weighted graph • Sparse: Õ(n) edges • “low” = logO(1) n • minkM is semi-metric on Mk with “distance” dmin,M(x,y)=mini=1..kdM(xi,yi) O(log2 n) minlowℓ1high O(1) minlowℓ1low O(log n) minlowtree-metric O(log3n) sparse graph-metric [Bourgain] (efficient) O(log n) ℓ1low

  14. TEMD over n sets Ai Step 1 O(log2 n) minlowℓ1high • Lemma 1: can embed TEMD over n sets in ({0..M}p, ℓ1) into minO(log n) ℓ1M^pwith O(log2n) distortion, w.h.p. • Use [A-Indyk-Krauthgamer’08] • (similar to Ostrovsky-Rabani embedding) • Embedding: for each Δ = powers of 2 • impose a randomly-shifted grid • one coordinate per cell, equal to # of points in the cell • Theorem [AIK]: • no contraction w.h.p. • expected expansion = O(log2 n) • Just repeat O(log n) times 

  15. minlowℓ1high Step 2 O(1) minlowℓ1low • Lemma 2: can embed an n point set from ℓ1M into minO(log n)ℓ1k, for k=O(log3 n), with O(1) distortion. • Use (weak) dimensionality reduction in ℓ1 • Thm [Indyk’06]: Let A be matrix of size M by k=O(log3 n) with each element chosen from Cauchy distribution. Then for any x̃=Ax, ỹ=Ay: • no contraction: ||x̃-ỹ||1≥||x-y||1(w.h.p.) • 5-expansion: ||x̃-ỹ||1≤5*||x-y||1 (with 0.01probability) • Just use O(log n) of such embeddings

  16. Efficiency of Step 1+2 • From step 1+2, we get some embedding f() of sets Ai={vi, vi+1, …, vi+s-1} into minlowℓ1low • Naively would take Ω(n*s)=Ω(n2) time to compute all f(Ai) • More efficiently: • Note that f() is linear: f(A) = ∑aA f(a) • Then f(Ai) = f(Ai-1)-f(vi-1)+f(vi+s-1) • Compute f(Ai) in order, for a total of Õ(n) time

  17. minlowℓ1low Step 3 O(log n) minlowtree-metric • Lemma 3: can embed ℓ1 over {0..M}p into minO(log^2 n)tree-m, with O(log n) distortion. • For each Δ = a power of 2, take O(log n) random grids. Each grid gives a min-coordinate ∞ Δ 

  18. minlowtree-metric Step 4 O(log3n) sparse graph-metric • Lemma 4: suppose have n points in minlowtree-m,which approximates a metric up to distortion D. Can embed into a graph-metric of size Õ(n) with distortion D.

  19. sparse graph-metric Step 5 O(log n) ℓ1low • Lemma 5: Given a graph with m edges, can embed the graph-metric into ℓ1lowwith O(log n) distortion in Õ(m) time. • Just implement Bourgain’s embedding: • Choose O(log2 n) sets Bi • Need to compute the distance from each node to each Bi • For each Bican compute its distance to each node using Dijkstra’s algorithm in Õ(m) time

  20. Summary of Main Lemma TEMD over n sets Ai • Min-product helps to get low dimension (~small-size sketch) • bypasses impossibility of dim-reduction in ℓ1 • Ok that it is not a metric, as long as it is close to a metric O(log2 n) minlowℓ1high O(1) oblivious minlowℓ1low O(log n) minlowtree-metric O(log3n) sparse graph-metric non-oblivious O(log n) ℓ1low

  21. Conclusion + a question • Theorem: can compute ed(x,y) in n*2Õ(√log n)time with 2Õ(√log n)approximation • Question: can we do the following “oblivious” dimensionality reduction in ℓ1 • Given n, construct a randomized embedding φ:ℓ1M→ℓ1polylog nsuch that for any v1…vnℓ1M, with high probability, φhas distortion logO(1) n on these vectors? • If φ exists, itcannot be linear [Charikar-Sahai’02]

More Related