Summer School on Hashing’14 Dimension Reduction

Summer School on Hashing’14Dimension Reduction Alex Andoni (Microsoft Research)

Nearest Neighbor Search (NNS) • Preprocess: a set of points • Query: given a query point , report a point with the smallest distance to

Motivation • Generic setup: • Points model objects (e.g. images) • Distance models (dis)similarity measure • Application areas: • machine learning: k-NN rule • speech/image/video/music recognition, vector quantization, bioinformatics, etc… • Distance can be: • Hamming, Euclidean, edit distance, Earth-mover distance, etc… • Primitive for other problems: • find the similar pairs in a set D, clustering… 000000 011100 010100 000100 010100 011111 000000 001100 000100 000100 110100 111111

2D case • Compute Voronoi diagram • Given query , perform point location • Performance: • Space: • Query time:

High-dimensional case • All exact algorithms degrade rapidly with the dimension

Dimension Reduction • If high dimension is an issue, reduce it?! • “flatten” dimension into dimension • Not possible in general: packing bound • But can if: for a fixed subset of • Johnson Lindenstrauss Lemma [JL’84] • Application: NNS in • Trivial scan: query time • Reduce to time if preprocess, where time to reduce dimension of the query point

Johnson Lindenstrauss Lemma • There is a randomized linear map , , that preserves distance between two vectors • up to factor: • with probability ( some constant) • Preserves distances among points for • Time to compute map:

Idea: • Project onto a random subspace of dimension !

1D embedding pdf = • How about one dimension () ? • Map • , • where are iid normal (Gaussian) random variable • Why Gaussian? • Stability property: is distributed as , where is also Gaussian • Equivalently: is centrally distributed, i.e., has random direction, and projection on random direction depends only on length of

1D embedding pdf = • Map , • for any , • Linear: • Want: • Ok to consider since linear • Claim: for any , we have • Expectation: • Standard deviation: • Proof: • Expectation 2 2

Full Dimension Reduction • Just repeat the 1D embedding for times! • where is matrix of Gaussian random variables • Again, want to prove: • for fixed • with probability

Concentration • is distributed as • where each is distributed as Gaussian • Norm • is called chi-squared distribution with degrees • Fact: chi-squared very well concentrated: • Equal to with probability • Akin to central limit theorem

Johnson Lindenstrauss: wrap-up • with high probability • Beyond Gaussians: • Can use instead of Gaussians [AMS’96, Ach’01, TZ’04…]

Faster JL ? • Time: • To compute • time ? • Yes! • [AC’06, AL’08’11, DKS’10, KN’12…] • Will show: time [Ailon-Chazelle’06]

Fast JL Transform • Costly because is dense • Meta-approach: use matrix ? • Suppose sample entries/row • Analysis of one row: • s.t.with probability • Expectation of : • What about variance?? Set normalization constant

Fast JLT: sparse projection • Variance of can be large  • Bad case: is sparse • think: • Even for (each coordinate of goes somewhere) • two coordinates collide (bad) with probability ~ • want exponential in failure probability • really would need • But, take away: may work if is “spread around” • New plan: • “spread around” • use sparse

FJLT: full • = matrix with r.v. on diagonal • = Hadamard matrix: • can be computed in time • composed of only • = sparse matrix as before, size , with “spreading around” Projection: sparse matrix Hadamard (Fast Fourier Transform) Diagonal

Spreading around: intuition • Idea for Hadamard/Fourier Transform: • “Uncertainty principle”: if the original is sparse, then the transform is dense! • Though can “break” ’s that are already dense “spreading around” Projection: sparse matrix Hadamard (Fast Fourier Transform) Diagonal

Spreading around: proof • Suppose • Ideal spreading around: • Lemma: with probability at least , for each coordinate • Proof: • for some vector • Hence is approx. gaussian • Hence

Why projection ? • Why aren’t we done? • choose first few coordinates of • each has same distribution: gaussian • Issue: are not independent • Nevertheless: • since is a change of basis (rotation in )

Projection • Have: • with probability • = projection onto just random coordinates! • Proof: standard concentration • Chernoff: enough to sample terms for approximation • Hence suffices

FJLT: wrap-up • Obtain: • with probability at least • dimension of is • time: • Dimension not optimal: apply regular (dense) JL on to reduce further to • Final time: • [AC’06, AL’08’11, WDLSA’09, DKS’10, KN’12, BN, NPW’14]

Dimension Reduction: beyond Euclidean • Johnson-Lindenstrauss: for Euclidean space • dimension, oblivious • Other norms, such as ? • Essentially no: [CS’02, BC’03, LN’04, JN’10…] • For points, approximation: between and [BC03, NR10, ANN10…] • even if map depends on the dataset! • But can do weak dimension reduction

Towards dimension reduction for • Can we do the “analog” of Euclidean projections? • For , we used: Gaussian distribution • has stability property: • is distributed as • Is there something similar for 1-norm? • Yes: Cauchy distribution! • 1-stable: • is distributed as • What’s wrong then? • Cauchy are heavy-tailed… • doesn’t even have finite expectation

Weak embedding [Indyk’00] • Still, can consider map as before • Each coordinate distributed as Cauchy • does not concentrate at all, but… • Can estimate by: • Median of absolute values of coordinates! • Concentrates because abs(Cauchy) is in the correct range for 90% of the time! • Gives a sketch • OK for nearest neighbor search

Summer School on Hashing’14 Dimension Reduction