Efficient Sketches for Earth-Mover Distance, with Applications

Efficient Sketches for Earth-Mover Distance, with Applications David Woodruff IBM Almaden Joint work with Alexandr Andoni, Khanh Do Ba, and Piotr Indyk

(Planar) Earth-Mover Distance • For multisets A, B of points in [∆]2, |A|=|B|=N, i.e., min cost of perfect matching between A and B EMD(, ) = 6 + 3√2

Geometric Representation of EMD • Map A, B to k-dimensional vectors F(A), F(B) • Image space of F “simple,” e.g., k small • Can estimate EMD(A,B) from F(A), F(B) via some efficient recovery algorithm E 2 Rk F E ≈ EMD(A,B)

Geometric Representation of EMD: Motivation • Visual search and recognition: • Approximate nearest neighbor under EMD • Reduces to approximate NN under simpler distances • Has been applied to fast image search and recognition in large collections of images [Indyk-Thaper’03, Grauman-Darrell’05, Lazebnik-Schmid-Ponce’06] • Data streaming computation: • Estimating the EMD between two point sets given as a stream • Need mapping F to be linear: adding new point a to A translates to adding F(a) to F(A) • Important open problem in streaming [“Kanpur List ’06”]

Prior and New Results Geometric representation of EMD: Main Theorem For any ε2(0,1), there exists a distribution over linear mappings F: R∆2!R∆εs.t. for multisets A,Bµ [∆]2 of equal size, we can produce an O(1/ε)-approximation to EMD(A,B) from F(A), F(B) with probability 2/3.

Implications • Streaming: • Approximate nearest neighbor: * N = number of points * s = number of data points (multisets) to preprocess α>1 free parameter

Proof Outline • Old [Agarwal-Varadarajan’04, Indyk’07]: • Extend EMD to EEMD which: • Handles sets of unequal size |A| · |B| in a grid of side-length k • EEMD(A,B) = min|S|=|A| andS µ B EMD(A,S) + k¢|B\S| • Is induced by a norm ||¢||EEMD, i.e., EEMD(A,B) = ||Â(A) – Â(B)||EEMD, where Â(A)2 R∆2 is the characteristic vector of A • Decomposition of EEMD into weighted sum of small EEMD’s • O(1/ε) distortion • New: • Linear sketching of “sum-norms” EMD over [∆]2 EEMD over [∆ε]2 EEMD over [∆ε]2 EEMD over [∆ε]2 + + … + ∆O(1) terms

Old Idea [Indyk ’07] EEMD over [∆ε]2 EEMD over [∆ε]2 EEMD over [∆ε]2 + + … + ∆O(1) terms EMD over [∆]2 EMD over [∆]2 EEMD over [∆1/2]2 EEMD over [∆1/2]2 + … +

Old Idea [Indyk ’07] Solve EEMD in each of ¢ cells, each a problem in [¢1/2]2 EMD over [∆]2 2

Old Idea [Indyk ’07] Solve one additional EEMD problem in [¢1/2]2 2 Should also scale edge lengths by ¢1/2

Old Idea [Indyk ’07] • Total cost is the sum of the two phases • Algorithm outputs a matching, so its cost is at least the EMD cost • Indyk shows that if we put a random shift of the [¢1/2]2 grid on top of the [¢]2 grid,algorithm’s cost is at most a constant factor times the true EMD cost • Recursive application gives multiple [¢ε]2 grids on top of each other, and results in O(1/ε)-approximation

Main New Technical Theorem ||M||1, X = + + … + For normed space X = (Rt, ||¢||X) and M2Xn, denote ||M||1,X = ∑i ||Mi||X. ||M1||X ||M2||X ||Mn||X Given C > 0 and λ > 0, if C/λ· ||M||1, X· C, there is a distribution over linear mappings μ: Xn!X(λlog n)O(1) such that we can produce an O(1)-approximation to ||M||1,X from μ(M) w.h.p.

Proof Outline: Sum of Norms • First attempt: • Sample (uniformly) a few Mi’s to compute ||Mi||X • Problem: sum could be concentrated in 1 block • Second attempt: • Sample Mi w/probability proportional to ||Mi||X [Indyk’07] • Problem: how to do online? • Techniques from [JW09, MW10]? • Need to sample/retrieve blocks, not just individual coordinates … M2 contains most of mass … M1 M2 M3 Mn

Proof Outline: Sum of Norms (cont.) M = (M1, M2, …, Mn) M2 S11 • Our approach: • Split into exponential levels: • Assume ||M||1, X· C • Sk = {i2[n] s.t. ||Mi||X2(Tk, 2Tk]}, Tk=C/2k • Suffices to estimate |Sk| for each level k. How? • For each level k, subsample from [n] at a rate such that event Ek (“isolation” of level k) holds with probability proportional to |Sk| • Repeat experiment several times, count number of successes M4, M7 S2 S3 M1, M3, M8, M9 … Sℓ M5, M10, Mn M: Subsample: Ek? Y N

Proof Outline: Event Ek • Ek$ “isolation” of level k: • Exactly one i 2Sk gets subsampled • Nothing from Sk’ for k’<k • Verification of trial success/failure • Hash subsampled elements • Each cell maintains vector sum of subsampled Mi’s that hash there • Ek holds roughly (we “accept”) when: • 1 cell has X-norm in (0.9Tk, 2.1Tk] • All other cells have X-norm ≤ 0.9Tk • Check fails only if: • Elements from lighter levels contribute a lot to 1 cell • Elements from heavier levels subsampled and collide • Both unlikely if hash table big enough • Under-estimates |Sk|. If |Sk| > 2k/polylog(n), gives O(1)-approximation • Remark: triangle inequality of norm gives control over impact of collisions Subsample: M1 M4 M5 M6 M9 M11 Mn–1 ∑ ∑ ∑ ∑

Sketch and Recovery Algorithm Sketch: • For every k, the estimator under-estimates |Sk| • If |Sk| > 2k/polylog n, the estimator is (|Sk|) • For each level k, create t hash tables • For each hash table: • Subsample from [n], including each i2[n] w.p. pk = 2-k • Each cell maintains sum of Mi’s that hash to it Recovery algorithm: • For each level k, count number ck of “accepting” hash tables • Return ∑kTk · (ck/t) · (1/pk) {

EMD Wrapup • We achieve a linear embedding of EMD • with constant distortion, namely O(1/ε), • into a space of strongly sublinear dimension, namely ∆ε. • Open problems: • Getting (1+ε)-approximation / proving impossibility • Reducing dimension to logO(1)∆ / proving lower bound

What We Did • We showed that in a data stream, one can sketch ||M||1,X = ∑i ||Mi||X with space about the space complexity of computing (or sketching) ||¢||X • This quantity is known as a cascaded norm, written as L1(X) • Cascaded norms have many applications [CM, JW] • Can we generalize this? E.g., what about L2(X), i.e., (∑i ||Mi||2X )1/2

Cascaded Norms [JW09] • No! • L2(L1), i.e., (∑i ||Mi||21)1/2, requires (n1/2) space, where n is the number of different i, but sketching complexity of L1 is O(log n) • More generally, for p ¸ 1, Lp(L1), i.e., (∑i ||Mi||p 1)1/p is £(n1-1/p) space • So, L1(X) is very special

Thank You!

Efficient Sketches for Earth-Mover Distance, with Applications

Efficient Sketches for Earth-Mover Distance, with Applications

Presentation Transcript

Ed-Join: An Efficient Algorithm for Similarity Joins with Edit Distance Constraints

Efficient Approximate Entity Extraction with Edit Distance Constraints

Asymmetric Distance Estimation with Sketches for Similarity Search in High-Dimensional Spaces

Relevance Feedback for the Earth Mover‘s Distance

Efficient Approximation of Edit Distance

Practice with Quick Sketches

Efficient Sketches for Earth-Mover Distance, with Applications

Measuring Distance in Google Earth

Earth Sciences Applications for EGEE

Earth Sciences Applications for EGEE

An Efficient Distance Calculation Method for Uncertain Objects

Smart Mover, Dumb Mover

An Efficient Distance Calculation Method for Uncertain Objects

Techniques for Developing Efficient Petascale Applications

Know your long distance Mover

Efficient Approximate Entity Extraction with Edit Distance Constraints

Sketches

Earth Sciences Applications for EGEE

The Earth-Moon Distance Question

Learn How to Choose a Long Distance Mover for your Move