1 / 27

Efficient Approximation of Edit Distance

Efficient Approximation of Edit Distance. Robert Krauthgamer, Weizmann Institute of Science SPIRE 2013. TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box .: A A A A A A A A A. Edit Distance (Levenshtein distance). Given two strings x   n , y   m :.

Download Presentation

Efficient Approximation of Edit Distance

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.


Presentation Transcript

  1. Efficient Approximation of Edit Distance Robert Krauthgamer, Weizmann Institute of Science SPIRE 2013 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAA

  2. Edit Distance (Levenshtein distance) Given two strings xn, ym: ed(x,y) = minimum number of character operations (insertion/deletion/substitution) that transform x to y. Examples: ed( banana , ananas ) = 2 ed(00000, 1111) = 5 Applications: • Computational Biology • Text processing • Web search For simplicity: m = n. Generic Search Engine Efficient Approximation of Edit Distance

  3. Dynamic Programming Algorithm • Compute ed(x,y) for input x,y n • O(n2) time by dynamic programming [WF’74] • O(n2/log2 n)time when ||·O(1)[MP’80] D(i,j) = ed( x[1:i], y[1:j] ) 1 1 2 3 4 5 2 2 1 2 3 4 D(i-1, j-1) , if x[i]=y[j] 3 2 2 1 2 3 D(i,j)= min D(i-1, j) + 1 4 3 2 2 1 2 D(i, j-1) + 1 5 4 3 2 2 1 6 5 4 3 3 2 Faster algorithms? Efficient Approximation of Edit Distance

  4. Focus of This Talk • Approximating edit distance • Multiplicatively: ed(x,y) · output ·A¢ed(x,y) • Decision version: ed(x,y) ·r or ed(x,y) > A¢r • Different computational models • RAM, Sampling and query complexity, Sketching, (Streaming) • Interactions (is it surprising?), Techniques • Variants of the problem Efficient Approximation of Edit Distance

  5. RAM Model: Sampling • Idea 1: quickly estimate ed(x,y) by sampling a few positions • Intuition: • If ed(x,y) is small, then “many” large blocks should “match” • “Test” this by reading few (randomly chosen) blocks • Apply this idea recursively (inside blocks) • Theorem [Batu-Ergun-Kilian-Magen-Raskhodinkova-Rubinfeld-Sami’03]: Factornc“weak” approximation in sublinear time. • Obstacles: • “Block match” means both “similar pattern” and “similar location” • Argue that if and only ifed(x,y) is small then … • Can only distinguish ed(x,y)·n/(8A) from ed(x,y)>n/8. Best approximation in (near) linear time? Efficient Approximation of Edit Distance

  6. Learn from Past Success • Suppose x,y are permutations • Every symbol of  appears exactly once • Consider transpositions=block moves (“block edit distance”) • No Insert/delete (unreasonable), no substitution (not needed) • Example: bed(0123456789, 0457689123)=2 • An easy estimate (based on breakpoints) • Compute Sx= {all length 2 substrings of x} = {x[i:i+1] | i=1,…,n-1} • Lemma:bed(x,y) · ½ |SxΔSy| · 3 bed(x,y) • Proof idea: Fix x (wlog identity), let y= • Each block move “creates” at most 3 new breakpoints • Break y at breakpoints, and move (rearrange) the blocks to get x • Can compute |SxΔSy|in linear time!! • Best approximation known in poly-time is 1.375 [Elias-Hartman’06] A B C D Open: better approximation in linear-time? Efficient Approximation of Edit Distance

  7. Reduction to Hamming Distance • |SxΔSy| = Hamming distance between their characteristic vectors • In fact, each vector has ||2=n2 coordinates, but only n-1 are non-zero • We thus obtain f:Permutations{0,1}n2 such that 8x,y, bed(x,y) · ½ ||f(x)-f(y)||1· 3 bed(x,y). • Such a reduction from one metric space (BED on permutations) to another (L1) is called an embedding. This one has distortion D=3. • Known lower bound: distortion into L1 must be ¸4/3 [Polak-K.’12] A sweet spot of fruitful interaction between Math/Geometry (“comparing” metric spaces using embeddings) and CS/Algorithms (solving new problems by “reducing” to old ones) More benefits of “good” embeddings? Efficient Approximation of Edit Distance

  8. Sketching Model • Idea: “summarize” each string separately, then estimate ed(x,y)only from the short sketches s(x),s(y). • Possible at all?? • YES for Hamming distance, and even L1/L2 [Indyk-Motwani’98, Kushilevitz-Ostrosvky-Rabani’00] • Approximation factor A=1+εusing sketch size O(ε-2) bits • It’s essentially a “dimension reduction” [Johnson-Lindenstrauss’86] • Achieved by projection on (inner product with) random direction in space • Consequently, YES also for block edit distance on permutations: • Applies whenever there is an embedding into L1 !! s f BED on perm. Hamming O(ε-2) bits sketch approx. 1+ε distort. D=3 Efficient Approximation of Edit Distance

  9. Applications of Sketching • Input: large database M, with |M| strings of length neach. • Output: all pairwise distances or closest pair (BED on perm) • Naively: in time O(|M|2 n) • Sketching [3+ε approx., decision version]: sketch each string, then estimate all pairs in time O(|M|n + |M|2/ε2) • Practical viewpoint: filteration, i.e., fast pruning of “bad” pairs • Works similarly for Nearest Neighbor Search (NNS): • Reduce NNS for permutations under BED, to NNS for Hamming (L1) Q1. More embeddings? Q2. Sketching directly? Q3. Lower bounds? Efficient Approximation of Edit Distance

  10. Embedding ED on Permutations Theorem [Charikar-K.’06]:Edit distance on permutations of length nembeds into L1 with distortion O(log n). Proof. Define where Intuition: • sign(fa,b(P)) is indicator for “a appears before b” in P • Thus, |fa,b(P)-fa,b(Q)| “measures” if {a,b} is an inversion in P vs. Q Lemma 1: ||f(P)-f(Q)||1 ≤ O(log n) ed(P,Q)‏ • Suppose Q is obtained from P by moving one symbol, say ‘s’ • General case then follows by applying triangle inequality on P,P’,P’’,…,Q • Total contribution of • coordinates s2{a,b} is 2k (1/k) ≤ O(log n)‏ • other coordinates is k k(1/k – 1/(k+1)) ≤ O(log n)‏ Efficient Approximation of Edit Distance

  11. Embedding ED on Permutations (2) Recall where Lemma 1: ||f(P)-f(Q)||1 ≤ O(log n) ed(P,Q)‏ Lemma 2: ||f(P)-f(Q)||1¸ ½ ed(P,Q) • Assume wlog that P=identity • Edit Q into an increasing sequence (thus into P) using quicksort: • Choose a random pivot, • Delete all characters inverted wrt to pivot • Repeat recursively on left and right portions • Now argue ||f(P)-f(Q)||1¸E[ #quicksort deletions ] ¸ ½ ed(P,Q) • QED Surviving subsequence is increasing  ed(P,Q) ≤ 2 #deletions For every inversion (a,b) in Q: Pr[a deleted “by” pivot b] ≤ 1/|Q-1[a]-Q-1[b]+1| ≤ 2 |fa,b(P) – fa,b(Q)| Efficient Approximation of Edit Distance

  12. Embedding Edit Distance • Theorem [Ostrovsky-Rabani’05]: Edit distance on all strings (not only permutations) embeds into L1with distortion 2Õ(√log n). • Previously, distortion ncwas known [BarYossef-Jayram-K.-Kumar’04, Batu-Ergun-Sahinalp’06] • Clever recursive method to match blocks much more accurately • Penalizes both pattern and location errors • Not very fast (quadratic time), but influenced later work on near-linear time algorithms [Andoni-Onak’09, Andoni-Onak-K.’10] • Immediate consequences: • NNS algorithms for edit distance • Sketching Efficient Approximation of Edit Distance

  13. Lower Bounds • Theorem [Khot-Naor’05, K.-Rabani’06]: Embedding edit distance into L1 requires distortion Ω(log n) • Main technique: Fourier analysis [Kahn-Kalai-Linial’88] • L1embedding $ sparsest-cuts $Boolean functions f:{0,1}n  {0,1} • Stronger assertion:O(1)-size sketches for edit distance require Ω̃(log n)approximation, even only for permutations [Andoni-K.’06] • Actually tradeoff between approximation and sketch-size • Techniques: communication complexity and Fourier analysis reduce the problem to sketches that are linear functions (of their input x) Q2’. Sketching vs embedding? Efficient Approximation of Edit Distance

  14. RAM Model: Asymmetric Sampling • Idea 1’: Read all of y, and sampled positions of x • Motivations: • Better chances to “obtain” information • Which y’s are easier/harder? • Sampling issues: • Focus on query complexity bounds (tight?) • Adaptive vs non-adaptive queries • Queries depend on y? • Use dynamic programming in timeO(n1+ε)? y x Efficient Approximation of Edit Distance

  15. Asymmetric Sampling Results [Andoni-Onak-K.’10] • Problem: Decide ed(x,y) ≥ n/10vsed(x,y) ≤ n/(10A) • Complexity = #queries into x (unlimited access to y) #queries Θ(logt n) Θ(log3 n) Θ(log2 n) Θ(log n) A n1/2 n1/3 n1-ε n1/4 n1/(t+1) n1/t-ε n1/2-ε Efficient Approximation of Edit Distance

  16. Overview of Upper Bound • Theorem 1: Can distinguish ed(x,y) ≥ n/10vsed(x,y) ≤ n/(10A) for A=(log n)O(1/ε) approximation with nε queries into x (for any ε>0). • Proof structure: 1. Characterize edit by “tree-distance” Txy • Parameter b≥2 (degree) • Txy≈ ed(x,y) up to 6b*log n factor 2. Prune the tree to subsample x b x1 x2 xn sampled positions in x Efficient Approximation of Edit Distance

  17. x[2] x[3] x[1] Step 1: Tree Distance • Partition x into b blocks, recursively, for h=logbn levels x[1:n] x[1:⅓n] x[⅓n:⅔n] x[⅔n:n] x[s:s+⅓n] … y[1:n] y[u:u+⅓n] • Ti(s,u) = tree-distance between x[s:s+ℓi] and y[u:u+ℓi] where ℓi is the block-length at level i Efficient Approximation of Edit Distance

  18. Tree Distance: Recursive Definition Recall Ti(s,u) = tree-distance between x[s:s+ℓi] and y[u:u+ℓi] Base case: Th(s,u)=Hamming(x[s],y[u]) Output: Txy=T0(s=1,u=1) x[s:s+ℓi] x r0 y y[u:u+ℓi] Efficient Approximation of Edit Distance

  19. Tree Approximates Edit Distance • Lemma:Txy≈ed(x,y) up to 6b*logbn factor. • Hierarchical decomposition inspired by earlier approaches [BEKMRRS’03, OR’05] • All had approximation recurrence of the type A(n) = c*A(n/b) + b for c≥2 • Solves to A(n) ≥ 2√log n factor for every choice of b • Our characterization has no multiplicative loss (c=1): A(n) = A(n/b) + b • Analysis inspired by algorithms for smoothed instances [Andoni-K.’08] Efficient Approximation of Edit Distance

  20. Step 2: Compute the Tree Distance For b=2, tree-distance gives O(log n) approximation! BUT know only how to compute T-distance in Õ(n2) time Instead, for b=(log n)1/ε, can prune the tree to nO(ε) nodes, and approximate T-distance within factor 1+ε Pruning: subsample (log n)O(1) children out of each node Works only when ed(x,y) ≥ (n) Generally, must subsample the tree non-uniformly, using the Precision Sampling Lemma b sampled positions in x Efficient Approximation of Edit Distance

  21. Key tool: non-uniform sampling Goal: For unknown a1, a2, …an[0,1] Estimate their sum, up to an additive constant error Using only “weak” estimates ã1, ã2, …ãn Sum Estimator Adversary 0. fix distribution U 1. Fix a1,a2,…an (unknown) 2. pick “precisions” ui (our algorithm: ui~U[0,1]i.i.d.) • 3. provideã1,ã2,…ãn • s.t. |ai-ãi|<1/ui 4. report S̃=S̃(ã1,…,u1,…) with |S̃ – ∑ai ̃| < 1. Efficient Approximation of Edit Distance

  22. Precision Sampling Goal: estimate ∑aifrom {ãi} s.t.|ai-ãi|<1/ui. Precision Sampling Lemma: Can achieve WHP additive error 1 and multiplicative error 1.5 with expected precision Eu_i~U[ui]=O(log n). Inspired by a technique from [IW’05] for streaming (Fk moments) In fact, PSL gives simple & improved algorithms for Fk moments, cascaded (mixed) norms, ℓp-sampling problems [AKO’11] Also distant relative of Priority Sampling [DLT’07] Efficient Approximation of Edit Distance

  23. Precision Sampling for Edit Distance Apply Precision Sampling to the tree from the characterization recursively at each node If a node has very weak precision, can trim the entire sub-tree Efficient Approximation of Edit Distance

  24. Fast Approximation Algorithm • Theorem [Andoni-Onak-K.’10]:Can approximate ed(x,y) within factor (log n)O(1/ε)using nε queries to x and in time n1+ε (for any ε>0). • Exponential improvement over previous factor 2Õ(√log n)[Andoni-Onak’09] • Asymmetric sampling approach, implemented faster by data structure tricks • Sampling is non-adaptive, independent of y Efficient Approximation of Edit Distance

  25. Smoothed Instances • Smooth Instance (x,y) constructed by: • Start with arbitrary x*,y*2{0,1}n and their optimal alignment A* • Replace each position w/probability p by random bit, but respect A* • Theorem [Andoni-K.’08]: Can approximate ed(x,y) within constant factor, in smoothed runtime that is (whp) near-linear n1+ε. • Some extensions to sublinear time • Techniques: • Match blocks of length L=O(1/p¢log n) that have edit distance ·εL. • A known heuristic technique (e.g. PatternHunter) • To find block matches quickly, we use naive NNS algorithm • Because of smoothing, blocks are likely to be distinct (and even far), so modulo overlaps between blocks, we “effectively” have permutations Open: Better time n¢polylog(n)? Approximation independent of p? Efficient Approximation of Edit Distance

  26. Variants of Edit Distance • Edit distance with block operations • Admits O(log n¢log*n) approximation in near-linear time, via embedding into L1 [Muthukrishnan-Sahnialp’00,Cormode-Muthukrishnan’02] • Open: Distortion lower bounds? Better approximation in polytime? • Edit distance between trees (generalizes strings) • Basic operations: insert/delete/relabel vertex • Can be computed in O(n3) time [Demaine-Mozes-Rossman-Weimann’07] • Open: Embedding? • Edit distance with “rich” alphabet • Can model shape matching [Klein-Tirthapura-Sharvit-Kimia’00] • Challenge: Cost of basic operation varies with symbols Efficient Approximation of Edit Distance

  27. Conclusion • Having multiple computational models is fruitful • New ideas, techniques, viewpoints, applications  can come full circle • Lower bounds —in certain models — highlight limitations of methods • Explore which instances are easy/hard • “Asymmetric algorithms” can work well for symmetric problems • Connections to other fields (sampling, embeddings, communication complexity, Fourier analysis) and computational problems (NNS) • Had much progress, but still many gaps, and much more to go  Thank You! Efficient Approximation of Edit Distance

More Related