A Nonlinear Approach to Dimension Reduction

A Nonlinear Approach to Dimension Reduction Robert Krauthgamer Weizmann Institute of Science Joint work with Lee-Ad Gottlieb TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAA

Data As High-Dimensional Vectors • Data is often represented by vectors in Rd • Image  color or intensity histogram • Document  word frequency • A typical goal – Nearest Neighbor Search: • Preprocess data, so that given a query vector, quickly find closest vector in data set. • Common in various data analysis tasks – classification, learning, clustering. A Nonlinear Approach to Dimension Reduction

Curse of Dimensionality [Bellman’61] • Cost of maintaining data is exponential in dimension • This observation extends to many useful operations • Nearest Neighbor Search [Clarkson’94] • Dimension reduction = represent high-dimensional data in a low-dimensional space • Map given vectors into a low-dimensional space, while preservingmost of the data • Goal: Trade-off accuracy for computational efficiency • Common interpretation: preserve pairwise distances A Nonlinear Approach to Dimension Reduction

The JL Lemma Theorem [Johnson-Lindenstrauss, 1984]: For every n-point set X½Rm and 0<<1, there is a map :XRk, for k=O(-2log n),that preserves all distances within 1+: ||x-y||2 < ||(x)-(y)||2 < (1+) ||x-y||2, 8x,y2X. • Can be realized by a simple linear transformation • A random k£d matrix works – entries from {-1,0,1} [Achlioptas’01] or Gaussian [Gupta-Dasgupta’98,Indyk-Motwani’98] • Applications in a host of problems in computational geometry • Can we do better? • A nearly-matching lower bound was given by [Alon’03] • But it’s existential… A Nonlinear Approach to Dimension Reduction

Doubling Dimension • Definition: Ball B(x,r) = all points within distance r from x. • The doubling constant(of a metric M) is the minimum value ¸>0such that every ball can be covered by ¸balls of half the radius • First used by [Assouad’83], algorithmically by [Clarkson’97] • The doubling dimension is dim(M)=log ¸(M) [Gupta-K.-Lee’03] • Applications: • Approximate nearest neighbor search [Clarkson’97, K.-Lee’04,…, Cole-Gottlieb’06, Indyk-Naor’06,…] • Spanners [Talwar’04,…,Gottlieb-Roditty’08] • Compact Routing [Chan-Gupta-Maggs-Zhou’05,…] • Network Triangulation [Kleinberg-Slivkins-Wexler’04,…] • Distance oracles [HarPeled-Mendel’06] • Embeddings [Gupta-K.-Lee’03, K.-Lee-Mendel-Naor’04, Abraham-Bartal-Neiman’08] Here ≤7. A Nonlinear Approach to Dimension Reduction

A Stronger Version of JL? Theorem [Johnson-Lindenstrauss, 1984]: Every n-point set X½l2 and 0<<1, has a linear embedding :Xl2k, for k=O(-2log n),such that for all x,y2X, ||x-y||2 < ||(x)- (y)||2 < (1+) ||x-y||2. • The near-tight lower bound of [Alon’03] is existential • Holds for X=uniform metric, where dim(X)=log n • Open: Extend to spaces with doubling dimension¿log n? • Open: JL-like embedding into dimension k=O(dim(X))? • Even constant distortion would be interesting [Lang-Plaut’01,Gupta-K.-Lee’03]: • Cannot be attained by linear transformations [Indyk-Naor’06] • Example [Kahane’81, Talagrand’92]: xj = (1,1,…,1,0,…,0) 2 Rn (Wilson’s helix). I.e. ||xi-xj|| = |i-j|1/2. distortion = 1+². We present two partial resolutions, using Õ(dim2(X)) dimensions: 1. Distortion 1+ for a single scale, i.e. pairs where ||x-y||2[r,r]. 2. Global embedding of the snowflakemetric, ||x-y||½. 2’. Conjecture correct whenever ||x-y||2 is a Euclidean metric. A Nonlinear Approach to Dimension Reduction

I. Embedding for a Single Scale • Theorem 1. For every finite subset X½l2, and all 0<<1, r>0, there is embedding f:Xl2k for k=Õ(log(1/)(dim X)2), satisfying 1. Lipschitz: ||f(x)-f(y)|| ≤ ||x-y|| for all x,y2X 2. Bi-Lipschitz at scale r: ||f(x)-f(y)|| ≥ (||x-y||) whenever ||x-y||2 [r, r] 3. Boundedness: ||f(x)|| ≤ r for all x2X • Compared to open question: • Bi-Lipschitz only at one scale (weaker) • But achieves distortion = absolute constant (stronger) • This talk: illustrate the proof for constant distortion • The 1+  distortion is later attained for distances 2[’r, r] • Overall approach: divide and conquer A Nonlinear Approach to Dimension Reduction

Covering radius: r Packing distance: r Step 1: Net Extraction • Extract from the point set X a r-net • Net properties: • r-Covering • r-Packing Packing ) a ball of radius s contains O((s/r)dim(X)) net points • We shall build a good embedding for net points • Why can we ignore non-net points? A Nonlinear Approach to Dimension Reduction

Step 1: Net Extraction and Extension • Recall: ||f||Lip = min {L¸0: ||f(x)-f(y)|| · L||x-y|| for all x,y} • Lipschitz Extension Theorem [Kirszbraun’34]: For every X½l2, every map f:Xl2k can be extended to f’:l2l2k, such that • ||f’||Lip· ||f||Lip. • Therefore, a good embedding just for the net points suffices • Smaller net resolution  less distortion for the non-net points ¸r f’ f ¸9r ¸3r A Nonlinear Approach to Dimension Reduction

Step 2: Padded Decomposition • Partition the space probabilistically into clusters • Properties [Gupta-K.-Lee’03, Abraham-Bartal-Neiman’08]: • Cluster diameter: bounded by dim(X)¢r. • Size: By the doubling property, bounded by (dim(X)/)dim(X) • Padding: Each point is r-padded with probability > 9/10 • Support: O(dim(X)) partitions r-padded ≤ dim(X)¢r not r-padded A Nonlinear Approach to Dimension Reduction

Step 3: JL on Individual Clusters • For each partition, consider each individual cluster • Reduce dimension using the JL lemma • Constant distortion • Target dimension (logarithmic in size): O(log((dim(X)/)dim(X)) = Õ(dim(X)) • Then translate some point to the origin JL A Nonlinear Approach to Dimension Reduction

The story so far… • To review • Step 1: Extract net points • Step 2: Build family of partitions • Step 3: For each partition, apply in each cluster JL and translate to origin • Embedding guarantees for a single partition • Intracluster distance: Constant distortion • Intercluster distance: • Min distance: 0 • Max distance: dim(X)¢r Not good enough • Let’s backtrack… A Nonlinear Approach to Dimension Reduction

The story so far… • To review • Step 1: Extract net points • Step 2: Build family of partitions • Step 3: For each partition, apply in each cluster a Gaussian transform • Step 4: For each partition, apply in each cluster JL and translate to origin • Embedding guarantees for a single partition • Intracluster distance: Constant distortion • Intercluster distance: • Min distance: 0 • Max distance: dim(X)¢r • Not good enough when ||x-y||≈r • Let’s backtrack… A Nonlinear Approach to Dimension Reduction

Step 3: Gaussian Transform • For each partition, apply within each cluster the Gaussian transform (kernel) to distances [Schoenberg’38] • G(t) = (1-e-t2)1/2 • “Adaptation” to scale r: Gr(t) = r(1-e-t2/r2)1/2 • A map f:l2l2 such that ||f(x)-f(y)|| = Gr(||x-y||) • Threshold: Cluster diameter is at most r (Instead of dim(X)¢r) • Distortion: Small distortion of distances in relevant range • Transform can increase dimension… but JL is the next step A Nonlinear Approach to Dimension Reduction

Step 4: JL on Individual Clusters • Steps 3 & 4: • New embedding guarantees • Intracluster: Constant distortion • Intercluster: • Min distance: 0 • Max distance: r (instead of dim(X)¢r) • Caveat: Still not good enough when ||x-y||≈r • Also smooth the map near cluster boundaries Gaussian JL smaller diameter smaller dimension A Nonlinear Approach to Dimension Reduction

Step 5: Glue Partitions • We have an embedding for each partition • For padded points, the guarantees are perfect • For non-padded points, the guarantees are weak • “Glue” together embeddings for different partitions • Concatenate all dim(X) embeddings (and scale down) • Non-padded case occurs 1/10 of the time, so it gets “averaged away” ||F(x)-F(y)||2 = ||f1(x)-f1(y)||2 + … + ||fm(x)-fm(y)||2¼ m¢||x-y||2 • Final dimension = Õ(dim2(X)): • Number of partitions: O(dim(X)) • Dimensions for partitioning: Õ(dim(X)) f1(x) = (1,7), f2(x) = (2,8), f3(x) = (4,9) ) F(x) = f1(x)  f2(x)  f3(x) = (1,7,2,8,4,9) A Nonlinear Approach to Dimension Reduction

Step 6: Kirszbraun Extension Theorem • Kirszbraun’s theorem extends embedding to non-net points without increasing dimension Embedding Embedding + K. A Nonlinear Approach to Dimension Reduction

Single Scale Embedding – Review • Steps: • Net extraction • Padded Decomposition • Gaussian Transform • JL • Glue partitions • Extension theorem • Theorem 1: • Every finite X½l2 admits embedding f:Xl2k for k=Õ((dim X)2), such that 1. Lipschitz: ||f(x)-f(y)|| ≤ ||x-y|| for all x,y2X 2. Bi-Lipschitz at scale r: ||f(x)-f(y)|| ≥ (||x-y||) whenever ||x-y||2 [r, r] 3. Boundedness: ||f(x)|| ≤ r for all x2X A Nonlinear Approach to Dimension Reduction

Single Scale Embedding – Strengthened • Steps: • Net extraction  nets • Padded Decomposition Padding probability 1-. • Gaussian Transform • JL Already 1+ distortion • Glue partitions Higher percentage of padded points • Extension theorem • Theorem 1: • Every finite X½l2 admits embedding f:Xl2k for k=Õ((dim X)2), such that 1. Lipschitz: ||f(x)-f(y)|| ≤ ||x-y|| for all x,y2X 2. Gaussian at scale r: ||f(x)-f(y)||=(1±)Gr(||x-y||) whenever ||x-y||2 [r, r] 3. Boundedness: ||f(x)|| ≤ r for all x2X A Nonlinear Approach to Dimension Reduction

II. Snowflake Embedding • Theorem 2. For all 0<<1, every finite subset X½l2 admits an embedding F:Xl2k, for k=Õ(-4(dim X)2), with distortion 1+ for the snowflake metric ||x-y||½. • We’ll illustrate the construction for constant distortion. • The snowflake technique is due to [Assouad’83] • We give the first implementation achieving distortion 1+. A Nonlinear Approach to Dimension Reduction

II. Snowflake Embedding • Theorem 2. For every 0<<1 and finite subset X½l2 there is an embedding F:Xl2k of the snowflake metric ||x-y||½ achieving dimension k=Õ(-4(dim X)2) and distortion 1+, i.e. • Compared to open question: • We embed the snowflake metric (weaker) • But achieve distortion 1+ (stronger) • We generalize [Kahane’81, Talagrand’92] who study Euclidean embedding of Wilson’s helix (real line w/distances |x-y|1/2) • We’ll illustrate the construction for constant distortion • The snowflake technique is due to [Assouad’83] • We give first implementation that achieves distortion 1+. A Nonlinear Approach to Dimension Reduction

Assouad’s Technique • Basic idea. • Consider single scale embeddings for all r=2i • Fix points x,y2X, and suppose ||x-y||≈s • r = 16s • r = 8s • r = 4s • r = 2s • r = s • r = s/2 • r = s/4 • r = s/8 • r = s/16 y x Lipschitz: ||f(x)-f(y)|| ≤ ||x-y|| = s Gaussian: ||f(x)-f(y)||=(1±)Gr(||x-y||) Boundedness: ||f(x)|| ≤ r A Nonlinear Approach to Dimension Reduction

Assouad’s Technique • Now scale down each embedding by r½ (snowflake) • r = 16s s  s½/4 • r = 8s s  s½/8½ • r = 4s s  s½/2 • r = 2s s  s½/2½ • r = s s  s½ • r = s/2 s/2  s½/2½ • r = s/4 s/4  s½/2 • r = s/8 s/8  s½/8½ • r = s/16 s/16  s½/4 ||f(x)-f(y)||/r½ ||f(x)-f(y)|| Combine these embeddings by addition (and staggering) A Nonlinear Approach to Dimension Reduction

Snowflake Embedding – Review • Steps: • Compute single scale embeddings for all r=2i • Scale-down embedding for r by r½ • Combine embeddings by addition (with some staggering) • By taking more refined scales (powers of 1+ instead of 2) and further staggering, can achieve distortion 1+ for the snowflake • Theorem 2. For all 0<<1, every subset X½l2 embeds into l2k for k=Õ(-4(dim X)2), with distortion 1+ to the snowflake A Nonlinear Approach to Dimension Reduction

Conclusion • Gave two (1+)-distortion low-dimension embeddings for doubling l2-spaces • Single scale • Snowflake • This framework can be extended to l1 and l∞. Some obstacles: • Dimension reduction: Can’t use JL • Lipschitz extension: Can’t use Kirszbraun • Threshold: Can’t use Gaussian transform • Many of the steps in the single-scale embedding are nonlinear, although most “localities” are mapped (near) linearly • Explain empirical success (e.g. Locally Linear Embeddings)? • Applications? • Clustering is one potential area … Thank you! A Nonlinear Approach to Dimension Reduction

A Nonlinear Approach to Dimension Reduction