A Nonlinear Approach to Dimension Reduction

A Nonlinear Approach to Dimension Reduction Lee-Ad Gottlieb Weizmann Institute of Science Joint work with Robert Krauthgamer TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAA

Data As High-Dimensional Vectors • Data is often represented by vectors in Rm • For images, color or intensity • For document, word frequency • A typical goal – Nearest Neighbor Search: • Preprocess data, so that given a query vector, quickly find closest vector in data set. • Common in various data analysis tasks – classification, learning, clustering. A Nonlinear Approach to Dimension Reduction

Curse of Dimensionality • Cost of many useful operations is exponential in dimension • First noted by Bellman (Bel-61) in the context of PDFs • Nearest Neighbor Search (Cla-94) • Dimension reduction: • Represent high-dimensional data in a low-dimensional space • Specifically: Map given vectors into a low-dimensional space, while preservingmost of the data’s “structure” • Trade-off accuracy for computational efficiency A Nonlinear Approach to Dimension Reduction

The JL Lemma • Theorem (Johnson-Lindenstrauss, 1984): • For every n-point Euclidean set X, with dimension d, there is a linear map : XY (Euclidean Y) with • Interpoint distortion 1± • Dimension ofY: k = O(--2 log n) • Can be realized by a trivial linear transformation • Multiply d x n point matrix by a k x d matrix of random entries {-1,0,1} [Ach-01] • An near matching lower bound was given by [Alon-03] • Applications in a host of problems in computational geometry • But can we do better? A Nonlinear Approach to Dimension Reduction

Doubling Dimension • Definition: Ball B(x,r) = all points within distance r from x. • The doubling constant(of a metric M) is the minimum value ¸>0such that every ball can be covered by ¸balls of half the radius • First used by [Ass-83], algorithmically by [Cla-97]. • The doubling dimension is dim(M)=log ¸(M) [GKL-03] • Applications: • Approximate nearest neighbor search [KL-04,CG-06] • Distance oracles [HM-06] • Spanners [GR-08a,GR-08b] • Embeddings [ABN-08,BRS-07] Here ≤7. A Nonlinear Approach to Dimension Reduction

The JL Lemma • Theorem (Johnson-Lindenstrauss, 1984): • For every n-point Euclidean set X, with dimension d, there is a linear map : XY with • Interpoint distortion 1± • Dimension ofY: O(-2 log n) • An almost matching lower bound was given by [Alon-03] • This lower bound considered n roughly equidistant points • So it had dim(X) = log n • So in fact the lower bound is (-2 dim(X)) A Nonlinear Approach to Dimension Reduction

A stronger version of JL? • Open questions: • Can the JL log n lower bound be strengthened to apply to spaces with low doubling dimension? (dim(X) << log n) • Does there exist a JL-like embedding into O(dim(X)) dimensions? [LP-01,GKL-03] • Even constant distortion would be interesting • A linear transformation cannot attain this result [IN-07] • Here, we present a partial resolution to these questions: • Two embeddings that use Õ(dim2(X)) dimensions • Result I: (1±) embedding for a single scale, interpoint distances close to some r. • Result II: (1±) global embedding into the snowflake metric, where every interpoint distance s is replaced by s½ A Nonlinear Approach to Dimension Reduction

Result I – Embedding for Single Scale • Theorem 1 [GK-09]: • Fix scale r>0 and range 0<<1. • Every finite X½l2 admits embedding f:Xl2k for k=Õ(log(1/)(dim X)2), such that 1. Lipschitz: ||f(x)-f(y)|| ≤ ||x-y|| for all x,y2X 2. Bi-Lipschitz at scale r: ||f(x)-f(y)|| ≥ (||x-y||) whenever ||x-y||2 [r, r] 3. Boundedness: ||f(x)|| ≤ r for all x2X • We’ll illustrate the proof for constant range and distortion. A Nonlinear Approach to Dimension Reduction

Result I: The construction • We begin by considering the entire point set. Take for example scale r=20 range = ½ • Assume minimum interpoint distance 1 distance: 1 A Nonlinear Approach to Dimension Reduction

Step 1: Net extraction • From the point set, we extract a net • For example, a 4-net • Net properties: • Covering • Packing • A consequence of the packing property is that a ball of radius s contains O(sdim(X)) points Covering radius: 4 Packing distance: 4 A Nonlinear Approach to Dimension Reduction

Step 1: Net extraction • We want a good embedding for just the net points • From here on, our embedding will ignore non-net points • Why is this valid? A Nonlinear Approach to Dimension Reduction

Step 1: Net extraction • Kirszbraun theorem (Lipschitz extension, 1934): • Given an embedding f : XY , X½S(Euclidean space) • there exists a extensionf ’:S Y • The restriction of f ’ to X is equal to f • f ’ is contractive for S \ X • Therefore, a good embedding just for the net points suffices • Smaller net radius less distortion for the non-net points f ’ 20 20 A Nonlinear Approach to Dimension Reduction

Step 2: Padded decomposition • Decompose the space into probabilistic padded clusters A Nonlinear Approach to Dimension Reduction

Step 2: Padded decomposition • Decompose the space into probabilistic padded clusters • Cluster properties for a given random partition [GKL03,ABN08]: • Diameter: bounded by 20 dim(X) • Size: By the doubling property, bounded (20 dim(X))dim(X) • Padding: A point is 20-padded with probability 1-c, say 9/10 • Support: O(dim(X)) partitions Padded ≤ 20 dim(X) A Nonlinear Approach to Dimension Reduction

Step 3: JL on individual clusters • For each partition, consider each individual cluster A Nonlinear Approach to Dimension Reduction

Step 3: JL on individual clusters • For each partition, consider each individual cluster • Reduce dimension using JL-Lemma • Constant distortion • Target dimension: • logarithimic in size: O(log(20 dim(X))dim(X)) = Õ(dim(X)) • Then translate some point to the origin JL A Nonlinear Approach to Dimension Reduction

The story so far… • To review • Step 1: Extract net points • Step 2: Build family of partitions • Step 3: For each partition, apply JL to each cluster, and translate a cluster point to the origin • Embedding guarantees for a singe partition • Intracluster distance: Constant distortion • Intercluster distance: • Min distance: 0 • Max distance: 20 dim(X) • Not good enough • Let’s backtrack… A Nonlinear Approach to Dimension Reduction

The story so far… • To review • Step 1: Extract net points • Step 2: Build family of partitions • Step 3: For each partition, apply Gaussian transform to each cluster • Step 4: For each partition, apply JL to each cluster, and translate a cluster point to the origin • Embedding guarantees for a singe partition • Intracluster distance: Constant distortion • Intercluster distance: • Min distance: 0 • Max distance: 20 dim(X) • Not good enough • Let’s backtrack… A Nonlinear Approach to Dimension Reduction

Step 3: Gaussian transform • For each partition, apply the Gaussian transform to distances within each cluster (Schoenberg’s theorem, 1938) • f(t) = (1-e-t2)1/2 • Threshold at s: fs(t) = s(1-e-t2/s2)1/2 • Properties for s=20: • Threshold: Cluster diameter is at most 20 (Instead of 20dim(X)) • Distortion: Small distortion of distances in relevant range • Transform can increase dimension… but JL is the next step A Nonlinear Approach to Dimension Reduction

Step 4: JL on individual cluster • Steps 3 & 4: • New embedding guarantees • Intracluster: Constant distortion • Intercluster: • Min distance: 0 • Max distance: 20 (instead of 20dim(X)) • Caveat: Also smooth the edges Gaussian JL smaller diameter smaller dimension A Nonlinear Approach to Dimension Reduction

Step 5: Glue partitions • We have an embedding for a single partition • For padded points, the guarantees are perfect • For non-padded points, the guarantees are weak • “Glue” together embeddings for all dim(X) partitions • Concatenate images (and scale down) • Non-padded case occurs 1/10 of the time, so it gets “averaged away” • Final dimension for non-net points: • Number of partitions: O(dim(X)) • dimension of each embedding: Õ(dim(X)) • = Õ (dim2(X)) f1(x) = (1,7,2), f2(x) = (5,2,3), f3(x) = (4,8,5) F(x) = f1(x)  f2(x)  f3(x) = (1,7,2,5,2,3,4,8,5) A Nonlinear Approach to Dimension Reduction

Step 6: Kirszbraun extension theorem • Kirszbraun’s theorem extends embedding to non-net points within increasing dimension Embedding Embedding + K. A Nonlinear Approach to Dimension Reduction

Result I – Review • Steps: • Net extraction • Padded Decomposition • Gaussian Transform • JL • Glue partitions • Extension theorem • Theorem 1 [GK-09]: • Every finite X½l2 admits embedding f:Xl2k for k=Õ((dim X)2), such that 1. Lipschitz: ||f(x)-f(y)|| ≤ ||x-y|| for all x,y2X 2. Bi-Lipschitz at scale r: ||f(x)-f(y)|| ≥ (||x-y||) whenever ||x-y||2 [r, r] 3. Boundedness: ||f(x)|| ≤ r for all x2X A Nonlinear Approach to Dimension Reduction

Result I – Extension • Steps: • Net extraction  nets • Padded Decomposition Larger padding, prob. guarantees • Gaussian Transform • JL Already (1±) • Glue partitions Higher percentage of padded points • Extension theorem • Theorem 1 [GK-09]: • Every finite X½l2 admits embedding f:Xl2k for k=Õ((dim X)2), such that 1. Lipschitz: ||f(x)-f(y)|| ≤ ||x-y|| for all x,y2X 2. Gaussian at scale r: ||f(x)-f(y)|| ≥(1±)G(||x-y||) whenever ||x-y||2 [r, r] 3. Boundedness: ||f(x)|| ≤ r for all x2X A Nonlinear Approach to Dimension Reduction

Result II – Snowflake Embedding • Theorem 2 [GK-09]: • For 0<<1, every finite subset X½l2 admits an embedding F:Xl2k for k=Õ(-4(dim X)2) with distortion (1±) to the snowflake: s s½ • We’ll illustrate the construction for constant distortion. • The constant distortion construction is due to [Asouad-83] (for non-Euclidean metrics) • In the paper, we implement the same construction with (1±) distortion A Nonlinear Approach to Dimension Reduction

Snowflake embedding • Basic idea. • Fix points x,y 2X, and suppose ||x-y|| ~ s • Now consider many single scale embeddings • r = 16s • r = 8s • r = 4s • r = 2s • r = s • r = s/2 • r = s/4 • r = s/8 • r = s/16 y x Lipschitz: ||f(x)-f(y)|| ≤ ||x-y|| Gaussian: ||f(x)-f(y)|| ≥(1±)G(||x-y||) Boundedness: ||f(x)|| ≤ r A Nonlinear Approach to Dimension Reduction

Snowflake embedding • Now scale down each embedding by r½ (snowflake) • r = 16s s  s½/4 • r = 8s s  s½/8½ • r = 4s s  s½/2 • r = 2s s  s½/2½ • r = s s  s½ • r = s/2 s/2  s½/2½ • r = s/4 s/4  s½/2 • r = s/8 s/8  s½/8½ • r = s/16 s/16  s½/4 A Nonlinear Approach to Dimension Reduction

Snowflake embedding • Join levels by concatenation and addition of coordinates • r = 16s s  s½/4 • r = 8s s  s½/8½ • r = 4s s  s½/2 • r = 2s s  s½/2½ • r = s s  s½ • r = s/2 s/2  s½/2½ • r = s/4 s/4  s½/2 • r = s/8 s/8  s½/8½ • r = s/16 s/16  s½/4 A Nonlinear Approach to Dimension Reduction

Result II – Review • Steps: • Take collection of single scale embeddings • Scale embedding r by r½ • Join embeddings by concatenation and addition • By taking more refined scales (jump by 1± instead of 2), can achieve (1±) distortion to the snowflake • Theorem 2 [GK-09]: • For 0<<1, every finite subset X½l2 admits an embedding F:Xl2k for k=Õ(-4(dim X)2) with distortion (1±) to the snowflake: s s½ A Nonlinear Approach to Dimension Reduction

Conclusion • Gave two (1±) distortion low-dimension embeddings for doubling spaces • Single scale • Snowflake • This framework can be extended to L1 and L∞ • Dimension reduction: Can’t use JL • Extension: Can’t use Kirszbraun • Threshold: Can’t use the Gaussian • Thank you! A Nonlinear Approach to Dimension Reduction

A Nonlinear Approach to Dimension Reduction