220 likes | 325 Views
Overcoming the L 1 Non-Embeddability Barrier. Robert Krauthgamer (Weizmann Institute) Joint work with Alexandr Andoni and Piotr Indyk (MIT). Algorithms on Metric Spaces. Hamming distance. Fix a metric M Fix a computational problem Solve problem under M. Ulam metric.
E N D
Overcoming the L1 Non-Embeddability Barrier Robert Krauthgamer (Weizmann Institute) Joint work with Alexandr Andoni and Piotr Indyk (MIT)
Algorithms on Metric Spaces Hamming distance • Fix a metric M • Fix a computational problem • Solve problem under M Ulam metric Compute distance between x,y Earthmover distance ED(x,y) = minimum number of edit operations that transform x into y. edit operation = insert/delete/ substitute a character ED(0101010, 1010101) = 2 Nearest Neighbor Search: Preprocess n strings, so that given a query string, can find the closest string to it. … … Overcoming the L_1 non-embeddability barrier
Motivation for Nearest Neighbor • Many applications: • Image search (Euclidean dist, Earth-mover dist) • Processing of genetic information, text processing (edit dist.) • many others… Generic Search Engine Overcoming the L_1 non-embeddability barrier
A General Tool: Embeddings • An embeddingof M into a host metric (H,dH)is a map f : M→H • preserves distances approximately • has distortionA ≥ 1if for all x,yM, dM(x,y) ≤ dH(f(x),f(y)) ≤ A*dM(x,y) • Why? • If H is “easy” (= can solve efficiently computational problems like NNS) • Then get good algorithms for the original space M! f Overcoming the L_1 non-embeddability barrier
Host space? ℓ1=real space with d1(x,y) =∑i |xi-yi| Popular target metric: ℓ1 • Have efficient algorithms: • Distance estimation: O(d) for d-dimensional space (often less) • NNS: c-approx with O(n1/c) query time and O(n1+1/c) space [IM98] • Powerful enough for some things… Overcoming the L_1 non-embeddability barrier
Below logarithmic? (ℓ2)p=real space with dist2p(x,y)=||x-y||2p • Cannot work with ℓ1 • Other possibilities? • (ℓ2)pis bigger and algorithmically tractable • but not rich enough(often same lower bounds) • ℓ∞ is rich (includes all metrics), • but not efficient computationallyusually (high dimension) • And that’s roughly it… • (at least for efficient NNS) ℓ∞=real space with dist∞(x,y)=maxi|xi-yi| Overcoming the L_1 non-embeddability barrier
d1 d1 … … d∞,1 d∞,1 α Meet our new host d1 … • Iterated product space, Ρ22,∞,1= β d∞,1 d22,∞,1 γ Overcoming the L_1 non-embeddability barrier
Why Ρ22,∞,1? • Because we can… • Theorem 1. Ulam embeds into Ρ22,∞,1 with O(1) distortion • Dimensions (γ,β,α)=(d, log d, d) • Theorem 2.Ρ22,∞,1 admits NNS on n points with • O(log log n) approximation • O(nε) query time and O(n1+ε) space • In fact, there is more for Ulam… Rich Algorithmically tractable Overcoming the L_1 non-embeddability barrier
Our Algorithms for Ulam ED(1234567, 7123456) = 2 • Ulam = edit on strings where each symbol appears at most once • A classical distance between rankings • Exhibits hardness of misalignments (as in general edit) • All lower bounds same as for general edit (up to Θ̃() ) • Distortion of embedding into ℓ1 (and (ℓ2)p, etc): Θ̃(log d) • Our approach implies new algorithms for Ulam: 1. NNS with O(log log n) approx, O(nε) query time • Can improve to O(log log d) approx 2. Sketchingwith O(1)-approx in logO(1) d space 3. Distance estimation with O(1)-approx in time If we ever hope for approximation <<log d for NNS under general edit, first we have to get it under Ulam! [BEKMRRS03]: when ED¼d, approx dε in O(d1-2ε) time Overcoming the L_1 non-embeddability barrier
Theorem 1 • Theorem 1. Can embed Ulam into Ρ22,∞,1 with O(1) distortion • Dimensions (γ,β,α)=(d, log d, d) • Proof • “Geometrization” of Ulam characterizations • Previously studied in the context of testing monotonicity (sortedness): • Sublinear algorithms [EKKRV98, ACCL04] • Data-stream algorithms [GJKK07, GG07, EH08] Overcoming the L_1 non-embeddability barrier
Thm 1: Characterizing Ulam • Consider permutations x,yover [d] • Assume for now: x = identity permutation • Idea: • Count # chars in y to delete to obtain increasing sequence (≈ Ulam(x,y)) • Call them faulty characters • Issues: • Ambiguity… • How do we count them? 123456789 123456789 X= 234657891 341256789 y= Overcoming the L_1 non-embeddability barrier
Thm 1: Characterization – inversions • Definition: chars a<b form inversion if b precedes a in y • How to identify faulty char? • Has an inversion? • Doesn’t work: all chars might have inversion • Has many inversions? • Still can miss “faulty” chars • Has many inversions locally? • Same problem Check if either is true! 123456789 123456789 123456789 X= 567981234 234567891 213456798 y= Overcoming the L_1 non-embeddability barrier
Thm 1: Characterization – faulty chars • Definition 1: a is faulty if exists K>0 s.t. • a is inverted w.r.t. a majority of the K symbols preceding a in y • (ok to consider K=2k) • Lemma [ACCL04, GJKK07]: # faulty chars = Θ(Ulam(x,y)). 123456789 234567891 4 characters preceding 1 (all inversions with 1) Overcoming the L_1 non-embeddability barrier
Thm 1: CharacterizationEmbedding • To get embedding, need: • Symmetrization (neither string is identity) • Deal with “exists”, “majority”…? • To resolve (1), use instead X[a;K] … • Definition 2:a is faulty if exists K=2k such that • |X[a;2k] Δ Y[a;2k]| > 2k (symmetric difference) X[5;4] 123456789 123467895 Y[5;4] Overcoming the L_1 non-embeddability barrier
Thm 1: Embedding – final step X[5;22] 123456789 • We have • Replace by weight? • Final embedding: 123467895 Y[5;22] equal 1 iff true )2 ( Overcoming the L_1 non-embeddability barrier
Theorem 2 • Theorem 2.Ρ22,∞,1 admits NNS on n points • O(log log n) approximation • O(nε) query time and O(n1+ε) space for any small ε • (ignoring (αβγ)O(1)) • A rather general approach • “LSH” on ℓ1-products of general metric spaces • Of course, cannot do, but can reduce to ℓ∞-products Overcoming the L_1 non-embeddability barrier
Thm 2: Proof • Let’s start from basics: ℓ1α • [IM98]:c-approx with O(n1/c) query time and O(n1+1/c) space • (ignoring αO(1)) • Ok, what about • Then: NNS for • O(cM * log log n) -approx • Õ(QM) query time • O(SM * n1+ε) space. • Suppose: NNS for M with • cM-approx • QM query time • SM space. [I02] Overcoming the L_1 non-embeddability barrier
Thm 2: What about (ℓ2)2-product? • Enough to consider • (for us, M is the l1-product) • Off-the-shelf? • [I04]: gives space ~n or >log n approximation • We reduce to multiple NNS queries under • Instructive to first look at NNS for standard ℓ1 … Overcoming the L_1 non-embeddability barrier
Thm 2: Review of NNS for ℓ1 • LSH family: collection H of hash functions such that: • For random hH (parameter >0) Pr[h(q)=h(p)] ≈ 1-||q-p||1 / • Query just uses primitive: • Can obtain H by imposing randomly-shifted grid of side-length • Then for h defined by ri2[0, ] at random, primitive becomes: q p “return all points p such that h(q)=h(p) “return all p s.t. |qi-pi|<rifor all i[d] Overcoming the L_1 non-embeddability barrier
Thm 2: LSH for ℓ1-product • Intuition: abstract LSH! • Recall we had: for ri random from [0, ], point p returned if for all i: |qi-pi|<ri • Equivalently • For all i: q p ℓ∞ product of R! For ℓ1 “return all p s.t. |qi-pi|<rifor all i[d] “return all points p’s such that maxi dM(qi,pi)/ri<1 For Overcoming the L_1 non-embeddability barrier
Thm 2: Final • Thus, sufficient to solve primitive: • We reduced NNS over to several instances of NNS over (with appropriately scaled coordinates) • Approximation is O(1)*O(log log n) • Done! “return all points p’s such that maxi dM(qi,pi)/ri<1 (in fact, for k independent choices of (r1,…rd)) For Overcoming the L_1 non-embeddability barrier
Take-home message: • Can embed combinatorial metrics into iterated product spaces • Works for Ulam (=edit on non-repetitive strings) • Approach bypasses non-embeddability results into usual-suspect spaces like ℓ1, (ℓ2)2 … Open: • Embeddings for edit over {0,1}d, EMD, other metrics? • Understanding product spaces? [Jayram-Woodruff]: sketching Thank you! Overcoming the L_1 non-embeddability barrier