270 likes | 445 Views
Efficient Approximation of Edit Distance. Robert Krauthgamer, Weizmann Institute of Science SPIRE 2013. TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box .: A A A A A A A A A. Edit Distance (Levenshtein distance). Given two strings x n , y m :.
E N D
Efficient Approximation of Edit Distance Robert Krauthgamer, Weizmann Institute of Science SPIRE 2013 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAA
Edit Distance (Levenshtein distance) Given two strings xn, ym: ed(x,y) = minimum number of character operations (insertion/deletion/substitution) that transform x to y. Examples: ed( banana , ananas ) = 2 ed(00000, 1111) = 5 Applications: • Computational Biology • Text processing • Web search For simplicity: m = n. Generic Search Engine Efficient Approximation of Edit Distance
Dynamic Programming Algorithm • Compute ed(x,y) for input x,y n • O(n2) time by dynamic programming [WF’74] • O(n2/log2 n)time when ||·O(1)[MP’80] D(i,j) = ed( x[1:i], y[1:j] ) 1 1 2 3 4 5 2 2 1 2 3 4 D(i-1, j-1) , if x[i]=y[j] 3 2 2 1 2 3 D(i,j)= min D(i-1, j) + 1 4 3 2 2 1 2 D(i, j-1) + 1 5 4 3 2 2 1 6 5 4 3 3 2 Faster algorithms? Efficient Approximation of Edit Distance
Focus of This Talk • Approximating edit distance • Multiplicatively: ed(x,y) · output ·A¢ed(x,y) • Decision version: ed(x,y) ·r or ed(x,y) > A¢r • Different computational models • RAM, Sampling and query complexity, Sketching, (Streaming) • Interactions (is it surprising?), Techniques • Variants of the problem Efficient Approximation of Edit Distance
RAM Model: Sampling • Idea 1: quickly estimate ed(x,y) by sampling a few positions • Intuition: • If ed(x,y) is small, then “many” large blocks should “match” • “Test” this by reading few (randomly chosen) blocks • Apply this idea recursively (inside blocks) • Theorem [Batu-Ergun-Kilian-Magen-Raskhodinkova-Rubinfeld-Sami’03]: Factornc“weak” approximation in sublinear time. • Obstacles: • “Block match” means both “similar pattern” and “similar location” • Argue that if and only ifed(x,y) is small then … • Can only distinguish ed(x,y)·n/(8A) from ed(x,y)>n/8. Best approximation in (near) linear time? Efficient Approximation of Edit Distance
Learn from Past Success • Suppose x,y are permutations • Every symbol of appears exactly once • Consider transpositions=block moves (“block edit distance”) • No Insert/delete (unreasonable), no substitution (not needed) • Example: bed(0123456789, 0457689123)=2 • An easy estimate (based on breakpoints) • Compute Sx= {all length 2 substrings of x} = {x[i:i+1] | i=1,…,n-1} • Lemma:bed(x,y) · ½ |SxΔSy| · 3 bed(x,y) • Proof idea: Fix x (wlog identity), let y= • Each block move “creates” at most 3 new breakpoints • Break y at breakpoints, and move (rearrange) the blocks to get x • Can compute |SxΔSy|in linear time!! • Best approximation known in poly-time is 1.375 [Elias-Hartman’06] A B C D Open: better approximation in linear-time? Efficient Approximation of Edit Distance
Reduction to Hamming Distance • |SxΔSy| = Hamming distance between their characteristic vectors • In fact, each vector has ||2=n2 coordinates, but only n-1 are non-zero • We thus obtain f:Permutations{0,1}n2 such that 8x,y, bed(x,y) · ½ ||f(x)-f(y)||1· 3 bed(x,y). • Such a reduction from one metric space (BED on permutations) to another (L1) is called an embedding. This one has distortion D=3. • Known lower bound: distortion into L1 must be ¸4/3 [Polak-K.’12] A sweet spot of fruitful interaction between Math/Geometry (“comparing” metric spaces using embeddings) and CS/Algorithms (solving new problems by “reducing” to old ones) More benefits of “good” embeddings? Efficient Approximation of Edit Distance
Sketching Model • Idea: “summarize” each string separately, then estimate ed(x,y)only from the short sketches s(x),s(y). • Possible at all?? • YES for Hamming distance, and even L1/L2 [Indyk-Motwani’98, Kushilevitz-Ostrosvky-Rabani’00] • Approximation factor A=1+εusing sketch size O(ε-2) bits • It’s essentially a “dimension reduction” [Johnson-Lindenstrauss’86] • Achieved by projection on (inner product with) random direction in space • Consequently, YES also for block edit distance on permutations: • Applies whenever there is an embedding into L1 !! s f BED on perm. Hamming O(ε-2) bits sketch approx. 1+ε distort. D=3 Efficient Approximation of Edit Distance
Applications of Sketching • Input: large database M, with |M| strings of length neach. • Output: all pairwise distances or closest pair (BED on perm) • Naively: in time O(|M|2 n) • Sketching [3+ε approx., decision version]: sketch each string, then estimate all pairs in time O(|M|n + |M|2/ε2) • Practical viewpoint: filteration, i.e., fast pruning of “bad” pairs • Works similarly for Nearest Neighbor Search (NNS): • Reduce NNS for permutations under BED, to NNS for Hamming (L1) Q1. More embeddings? Q2. Sketching directly? Q3. Lower bounds? Efficient Approximation of Edit Distance
Embedding ED on Permutations Theorem [Charikar-K.’06]:Edit distance on permutations of length nembeds into L1 with distortion O(log n). Proof. Define where Intuition: • sign(fa,b(P)) is indicator for “a appears before b” in P • Thus, |fa,b(P)-fa,b(Q)| “measures” if {a,b} is an inversion in P vs. Q Lemma 1: ||f(P)-f(Q)||1 ≤ O(log n) ed(P,Q) • Suppose Q is obtained from P by moving one symbol, say ‘s’ • General case then follows by applying triangle inequality on P,P’,P’’,…,Q • Total contribution of • coordinates s2{a,b} is 2k (1/k) ≤ O(log n) • other coordinates is k k(1/k – 1/(k+1)) ≤ O(log n) Efficient Approximation of Edit Distance
Embedding ED on Permutations (2) Recall where Lemma 1: ||f(P)-f(Q)||1 ≤ O(log n) ed(P,Q) Lemma 2: ||f(P)-f(Q)||1¸ ½ ed(P,Q) • Assume wlog that P=identity • Edit Q into an increasing sequence (thus into P) using quicksort: • Choose a random pivot, • Delete all characters inverted wrt to pivot • Repeat recursively on left and right portions • Now argue ||f(P)-f(Q)||1¸E[ #quicksort deletions ] ¸ ½ ed(P,Q) • QED Surviving subsequence is increasing ed(P,Q) ≤ 2 #deletions For every inversion (a,b) in Q: Pr[a deleted “by” pivot b] ≤ 1/|Q-1[a]-Q-1[b]+1| ≤ 2 |fa,b(P) – fa,b(Q)| Efficient Approximation of Edit Distance
Embedding Edit Distance • Theorem [Ostrovsky-Rabani’05]: Edit distance on all strings (not only permutations) embeds into L1with distortion 2Õ(√log n). • Previously, distortion ncwas known [BarYossef-Jayram-K.-Kumar’04, Batu-Ergun-Sahinalp’06] • Clever recursive method to match blocks much more accurately • Penalizes both pattern and location errors • Not very fast (quadratic time), but influenced later work on near-linear time algorithms [Andoni-Onak’09, Andoni-Onak-K.’10] • Immediate consequences: • NNS algorithms for edit distance • Sketching Efficient Approximation of Edit Distance
Lower Bounds • Theorem [Khot-Naor’05, K.-Rabani’06]: Embedding edit distance into L1 requires distortion Ω(log n) • Main technique: Fourier analysis [Kahn-Kalai-Linial’88] • L1embedding $ sparsest-cuts $Boolean functions f:{0,1}n {0,1} • Stronger assertion:O(1)-size sketches for edit distance require Ω̃(log n)approximation, even only for permutations [Andoni-K.’06] • Actually tradeoff between approximation and sketch-size • Techniques: communication complexity and Fourier analysis reduce the problem to sketches that are linear functions (of their input x) Q2’. Sketching vs embedding? Efficient Approximation of Edit Distance
RAM Model: Asymmetric Sampling • Idea 1’: Read all of y, and sampled positions of x • Motivations: • Better chances to “obtain” information • Which y’s are easier/harder? • Sampling issues: • Focus on query complexity bounds (tight?) • Adaptive vs non-adaptive queries • Queries depend on y? • Use dynamic programming in timeO(n1+ε)? y x Efficient Approximation of Edit Distance
Asymmetric Sampling Results [Andoni-Onak-K.’10] • Problem: Decide ed(x,y) ≥ n/10vsed(x,y) ≤ n/(10A) • Complexity = #queries into x (unlimited access to y) #queries Θ(logt n) Θ(log3 n) Θ(log2 n) Θ(log n) A n1/2 n1/3 n1-ε n1/4 n1/(t+1) n1/t-ε n1/2-ε Efficient Approximation of Edit Distance
Overview of Upper Bound • Theorem 1: Can distinguish ed(x,y) ≥ n/10vsed(x,y) ≤ n/(10A) for A=(log n)O(1/ε) approximation with nε queries into x (for any ε>0). • Proof structure: 1. Characterize edit by “tree-distance” Txy • Parameter b≥2 (degree) • Txy≈ ed(x,y) up to 6b*log n factor 2. Prune the tree to subsample x b x1 x2 xn sampled positions in x Efficient Approximation of Edit Distance
x[2] x[3] x[1] Step 1: Tree Distance • Partition x into b blocks, recursively, for h=logbn levels x[1:n] x[1:⅓n] x[⅓n:⅔n] x[⅔n:n] x[s:s+⅓n] … y[1:n] y[u:u+⅓n] • Ti(s,u) = tree-distance between x[s:s+ℓi] and y[u:u+ℓi] where ℓi is the block-length at level i Efficient Approximation of Edit Distance
Tree Distance: Recursive Definition Recall Ti(s,u) = tree-distance between x[s:s+ℓi] and y[u:u+ℓi] Base case: Th(s,u)=Hamming(x[s],y[u]) Output: Txy=T0(s=1,u=1) x[s:s+ℓi] x r0 y y[u:u+ℓi] Efficient Approximation of Edit Distance
Tree Approximates Edit Distance • Lemma:Txy≈ed(x,y) up to 6b*logbn factor. • Hierarchical decomposition inspired by earlier approaches [BEKMRRS’03, OR’05] • All had approximation recurrence of the type A(n) = c*A(n/b) + b for c≥2 • Solves to A(n) ≥ 2√log n factor for every choice of b • Our characterization has no multiplicative loss (c=1): A(n) = A(n/b) + b • Analysis inspired by algorithms for smoothed instances [Andoni-K.’08] Efficient Approximation of Edit Distance
Step 2: Compute the Tree Distance For b=2, tree-distance gives O(log n) approximation! BUT know only how to compute T-distance in Õ(n2) time Instead, for b=(log n)1/ε, can prune the tree to nO(ε) nodes, and approximate T-distance within factor 1+ε Pruning: subsample (log n)O(1) children out of each node Works only when ed(x,y) ≥ (n) Generally, must subsample the tree non-uniformly, using the Precision Sampling Lemma b sampled positions in x Efficient Approximation of Edit Distance
Key tool: non-uniform sampling Goal: For unknown a1, a2, …an[0,1] Estimate their sum, up to an additive constant error Using only “weak” estimates ã1, ã2, …ãn Sum Estimator Adversary 0. fix distribution U 1. Fix a1,a2,…an (unknown) 2. pick “precisions” ui (our algorithm: ui~U[0,1]i.i.d.) • 3. provideã1,ã2,…ãn • s.t. |ai-ãi|<1/ui 4. report S̃=S̃(ã1,…,u1,…) with |S̃ – ∑ai ̃| < 1. Efficient Approximation of Edit Distance
Precision Sampling Goal: estimate ∑aifrom {ãi} s.t.|ai-ãi|<1/ui. Precision Sampling Lemma: Can achieve WHP additive error 1 and multiplicative error 1.5 with expected precision Eu_i~U[ui]=O(log n). Inspired by a technique from [IW’05] for streaming (Fk moments) In fact, PSL gives simple & improved algorithms for Fk moments, cascaded (mixed) norms, ℓp-sampling problems [AKO’11] Also distant relative of Priority Sampling [DLT’07] Efficient Approximation of Edit Distance
Precision Sampling for Edit Distance Apply Precision Sampling to the tree from the characterization recursively at each node If a node has very weak precision, can trim the entire sub-tree Efficient Approximation of Edit Distance
Fast Approximation Algorithm • Theorem [Andoni-Onak-K.’10]:Can approximate ed(x,y) within factor (log n)O(1/ε)using nε queries to x and in time n1+ε (for any ε>0). • Exponential improvement over previous factor 2Õ(√log n)[Andoni-Onak’09] • Asymmetric sampling approach, implemented faster by data structure tricks • Sampling is non-adaptive, independent of y Efficient Approximation of Edit Distance
Smoothed Instances • Smooth Instance (x,y) constructed by: • Start with arbitrary x*,y*2{0,1}n and their optimal alignment A* • Replace each position w/probability p by random bit, but respect A* • Theorem [Andoni-K.’08]: Can approximate ed(x,y) within constant factor, in smoothed runtime that is (whp) near-linear n1+ε. • Some extensions to sublinear time • Techniques: • Match blocks of length L=O(1/p¢log n) that have edit distance ·εL. • A known heuristic technique (e.g. PatternHunter) • To find block matches quickly, we use naive NNS algorithm • Because of smoothing, blocks are likely to be distinct (and even far), so modulo overlaps between blocks, we “effectively” have permutations Open: Better time n¢polylog(n)? Approximation independent of p? Efficient Approximation of Edit Distance
Variants of Edit Distance • Edit distance with block operations • Admits O(log n¢log*n) approximation in near-linear time, via embedding into L1 [Muthukrishnan-Sahnialp’00,Cormode-Muthukrishnan’02] • Open: Distortion lower bounds? Better approximation in polytime? • Edit distance between trees (generalizes strings) • Basic operations: insert/delete/relabel vertex • Can be computed in O(n3) time [Demaine-Mozes-Rossman-Weimann’07] • Open: Embedding? • Edit distance with “rich” alphabet • Can model shape matching [Klein-Tirthapura-Sharvit-Kimia’00] • Challenge: Cost of basic operation varies with symbols Efficient Approximation of Edit Distance
Conclusion • Having multiple computational models is fruitful • New ideas, techniques, viewpoints, applications can come full circle • Lower bounds —in certain models — highlight limitations of methods • Explore which instances are easy/hard • “Asymmetric algorithms” can work well for symmetric problems • Connections to other fields (sampling, embeddings, communication complexity, Fourier analysis) and computational problems (NNS) • Had much progress, but still many gaps, and much more to go Thank You! Efficient Approximation of Edit Distance