Learning Near-Isometric Linear Embeddings

Learning Near-Isometric Linear Embeddings ChinmayHegde MIT AswinSankaranarayanan CMU Wotao Yin UCLA Richard Baraniuk Rice University

Learning Near-Isometric Linear Embeddings ChinmayHegde MIT AswinSankaranarayanan CMU Wotao Yin UCLA Pope FrancisVatican Institute Tech Richard Baraniuk Rice University

challenge 1too much data

Large Scale Datasets

Case in Point: DARPA ARGUS-IS • 1.8 Gigapixelimage sensor

Case in Point: DARPA ARGUS-IS • 1.8 Gpixel image sensor • video rate output: 444 Gbits/s • comm data rate: 274 Mbits/sfactor of 1600x way out of reach ofexisting compressiontechnology • Reconnaissancewithout conscience • too much data to transmit to a ground station • too much data to make effective real-time decisions

challenge 2data too expensive

Case in Point: MR Imaging • Measurements very expensive • $1-3 million per machine • 30 minutes per scan

Case in Point: IR Imaging

DIMENSIONALITYREDUCTION

Intrinsic Dimensionality • Why? Geometry, that’s why • Exploit to perform more efficientanalysis and processing of large-scale data Intrinsic dimension << Extrinsic dimension!

Linear Dimensionality Reduction measurements signal

Linear Dimensionality Reduction Goal: Create a (linear) mapping from RN to RM with M < N that preserves the key geometric properties of the data ex: configuration of the data points

Dimensionality Reduction • Given a training set of signals, find “best” that preserves its geometry

Dimensionality Reduction • Given a training set of signals, find “best” that preserves its geometry • Approach 1: Principal Component Analysis (PCA) via SVD of training signals • find “average” best fitting subspace in least-squares sense • average error metric can distortpoint cloud geometry

Embedding • Given a training set of signals, find “best” that preserves its geometry • Approach 2: Inspired by Restricted Isometry Property (RIP) Whitney Embedding Theorem

Isometric Embedding • Given a training set of signals, find “best” that preserves its geometry • Approach 2: Inspired by RIPand Whitney • design to preserve inter-point distances (secants) • more faithful to training data

Near-Isometric Embedding • Given a training set of signals, find “best” that preserves its geometry • Approach 2: Inspired by RIPand Whitney • design to preserve inter-point distances (secants) • more faithful to training data • but exact isometry can be too much to ask

Near-Isometric Embedding • Given a training set of signals, find “best” that preserves its geometry • Approach 2: Inspired by RIP and Whitney • design to preserve inter-point distances (secants) • more faithful to training data • but exact isometry can be too much to ask

Why Near-Isometry? • Sensing • guarantees existence of a recoveryalgorithm • Machine learning applications • kernelmatrix depends only on pairwise distances • Approximate nearest neighbors for classification • efficient dimensionality reduction

Existence of Near Isometries • Johnson-LindenstraussLemma • Given a set of Q points, there exists a Lipchitz map that achieves near-isometry (with constant ) provided • Random matrices with iidsubGaussian entries work • compressive sensing, locality sensitive hashing, database monitoring, cryptography • Existence of solution! • but constants are poor • oblivious to data structure [J-L, 84] [Frankl and Meahara, 88][Indyk and Motwani, 99] [Achlioptas, 01][Dasgupta and Gupta, 02]

Designed Embeddings • Unfortunately, random projections are data-oblivious (by definition) • Q: Can we beat random projections? • Our quest: A newapproach for designinglinear embeddings for specific datasets

[math alert]

Designing Embeddings • Normalized secants [Whitney; Kirby; Wakin, B ’09] • Goal: approximately preserve the length of • Obviously, projecting in direction of is a bad idea

Designing Embeddings • Normalized secants • Goal: approximately preserve the length of • Note: total number of secants is large:

“Good” Linear Embedding Design • Given: normalized secants • Seek: the “shortest” matrix such that • Think of as the knob that controls the “maximum distortion” that you are willing to tolerate

“Good” Linear Embedding Design • Given: (normalized) secants • Seek: the “shortest” matrix such that

Lifting Trick • Convert quadratic constraints in into linearconstraints in • Given , obtain via matrix square root

Relaxation • Relax rank minimization problem to nuclear norm minimization problem

NuMax • Nuclear norm minimization with Max-norm constraints (NuMax) • Semi-Definite Program (SDP) • solvable by standard interior point methods • Rank of solution is determined by

Accelerating NuMax • Poor scaling with N and S • least squares involves matrices with Srows • SVD of an NxNmatrix • Several avenues to accelerate: • Alternating Direction Method of Multipliers (ADMM) • exploit fact that intermediate estimates of P are low-rank • exploit fact that only a few secants define the optimal embedding (“column generation”)

Accelerated NuMax Can solve for datasetswith Q=100k points in N=1000 dimensions in a few hours

[/math alert]

App: Linear Compression • Images of translating blurred squares live on a K=2 dimensional smooth “surface” (manifold) in N=256 dimensional space • Project a collection of 1000 such images into M-dimensional space while preserving structure(as measured by distortion constant ) N=16x16=256

Rows of “Optimal” measurements signal N=16x16=256

Rows of “Optimal”

App: Linear Compression • M=40 linear measurements enough to ensure isometryconstant of = 0.01

Secant Distortion • Distribution of secant distortions for the translating squares dataset • Embedding dimension M=30 • Input distortion to NuMax is \delta=0.03 • As opposed to PCA and random, NuMax provides distortions sharply concentrated at \delta.

Secant Distortion • Translating squares dataset • N = 16x16 = 256 • M = 30 • = 0.03 • Histograms of normalized secant distortions random PCA NuMax 0.06 0.06 0.06

App: Image Retrieval Goal: Preserve neighborhood structure of a set ofimages • N = 512, Q = 4000, M = 45 suffices to preserve 80% of neighborhoods LabelMeImage Dataset

App: Classification • MNIST digits dataset • N = 20x20 = 400-dim images • 10 classes: digits 0-9 • Q = 60000 training images • Nearest neighbor (NN) classifier • Test on 10000 images • Mis-classification rate of NN classifier using original dataset: 3.63%

App: Classification • MNIST dataset • N = 20x20 = 400-dim images • 10 classes: digits 0-9 • Q = 60000 training images, so S = 1.8 billion secants! • NuMax-CG took 3 hours to process • Mis-classification rate of NN classifier: 3.63% • NuMax provides the best NN-classification rates!

NuMax and Task Adaptivity • Prune the secants according to the task at hand • If goal is reconstruction / retrieval, then preserve allsecants • If goal is signal classification, then preserve inter-class secants differently from intra-class secants • This preferential weighting approach is akin to “boosting”

Optimized Classification Inter-class secants are notshrunk Intra-class secants are notexpanded This simple modification improves NN classification rateswhile using even fewer measurements

Optimized Classification • MNIST dataset • N = 20x20 = 400-dim images • 10 classes: digits 0-9 • Q = 60000 training images, so >1.8 billion secants! • NuMax-CG took 3-4 hours to process Significant reduction in number of measurements (M) Significant improvement in classification rate

Learning Near-Isometric Linear Embeddings