Learning Near-Isometric Linear Embeddings

Learning Near-Isometric Linear Embeddings ChinmayHegde MIT AswinSankaranarayanan CMU Wotao Yin UCLA Edward Snowden Ex-NSA Richard Baraniuk Rice University

NSA PRISM 4972 Gbps Source: Wikipedia.org

NSA PRISM Source: Wikipedia.org

NSA PRISM DIMENSIONALITY REDUCTION Source: Wikipedia.org

Large Scale Datasets

Intrinsic Dimensionality • Why? Geometry, that’s why • Exploit to perform more efficientanalysis and processing of large-scale data Intrinsic dimension << Extrinsic dimension!

Dimensionality Reduction Goal: Create a (linear) mapping from RN to RM with M < N that preserves the key geometric properties of the data ex: configuration of the data points

Dimensionality Reduction • Given a training set of signals, find “best” that preserves its geometry

Dimensionality Reduction • Given a training set of signals, find “best” that preserves its geometry • Approach 1: PCA via SVD of training signals • find average best fitting subspace in least-squares sense • average error metric can distortpoint cloud geometry

Isometric Embedding • Given a training set of signals, find “best” that preserves its geometry • Approach 2: Inspired by RIP

Isometric Embedding • Given a training set of signals, find “best” that preserves its geometry • Approach 2: Inspired by RIP • but not the Restricted Itinerary Property [Maduro, Snowden ’13]

Isometric Embedding • Given a training set of signals, find “best” that preserves its geometry • Approach 2: Inspired by RIP and Whitney • design to preserve inter-point distances (secants) • more faithful to training data

Near-Isometric Embedding • Given a training set of signals, find “best” that preserves its geometry • Approach 2: Inspired by RIP and Whitney • design to preserve inter-point distances (secants) • more faithful to training data • but exact isometry can be too much to ask

Why Near-Isometry? • Sensing • guarantees existence of a recoveryalgorithm • Machine learning applications • kernelmatrix depends only on pairwise distances • Approximate nearest neighbors for classification • efficient dimensionality reduction

Existence of Near Isometries • Johnson-LindenstraussLemma • Given a set of Q points, there exists a Lipchitz map that achieves near-isometry (with constant ) provided • Random matrices with iidsubGaussian entries work • c.f. so-called “compressive sensing” [J-L, 84] [Frankl and Meahara, 88][Indyk and Motwani, 99] [Achlioptas, 01][Dasgupta and Gupta, 02]

L1 Energy http://dealbook.nytimes.com/2013/06/28/oligarchs-assemble-team-for-oil-deals/?_r=0 L1 Energy

Existence of Near Isometries • Johnson-LindenstraussLemma • Given a set of Q points, there exists a Lipchitz map that achieves near-isometry (with constant ) provided • Random matrices with iidsubGaussian entries work • c.f. so-called “compressive sensing” • Existence of solution! • but constants are poor • oblivious to data structure [J-L, 84] [Frankl and Meahara, 88][Indyk and Motwani, 99] [Achlioptas, 01][Dasgupta and Gupta, 02]

Near-Isometric Embedding • Q. Can we beat random projections? • A. … • on the one hand: lower bounds for JL [Alon ’03]

Near-Isometric Embedding • Q. Can we beat random projections? • A. … • on the one hand: lower bounds for JL [Alon ’03] • on the other hand: carefully constructed linearprojections can often do better • Our quest: An optimization based approach for learning“good” linear embeddings

Normalized Secants • Normalized pairwise vectors[Whitney; Kirby; Wakin, B ’09] • Goal is to approximately preserve the length of • Obviously, projecting in direction of is a bad idea

Normalized Secants • Normalized pairwise vectors • Goal is to approximately preserve the length of • Note: total number of secants is large:

“Good” Linear Embedding Design • Given: normalized secants • Seek: the “shortest” matrix such that Erratum alert: we will use Qto denote both the number of data points and the number of secants

“Good” Linear Embedding Design • Given: normalized secants • Seek: the “shortest” matrix such that

Lifting Trick • Convert quadratic constraints in into linearconstraints in • After designing , obtain via matrix square root

Relaxation • Convert quadratic constraints in into linearconstraints in Relax rank minimization to nuclear norm minimization

NuMax • Semi-Definite Program (SDP) • Nuclear norm minimization with Max-norm constraints (NuMax) • Solvable by standard interior point techniques • Rank of solution is determined by

Practical Considerations • In practice Nlarge, Q very large! • Computational cost per iterationscales as

Solving NuMax • Alternating Direction Method of Multipliers (ADMM) • - solve for P using spectral thresholding • - solve for L using least-squares • - solve for q using “clipping” • Computational/memory cost per iteration:

Accelerating NuMax • Poor scaling with N and Q • least squares involves matrices with Q2 rows • SVD of an NxN matrix • Observation 1 • intermediate estimates of P are low-rank • use low-rank representation to reduce memory and accelerate computations • use incremental SVD for faster computations

Accelerating NuMax • Observation 2 • by KKT conditions, by complementary slackness, only constraints that are satisfied with equality determine solutions (“active constraints”) Analogy: Recall support vector machines (SVMs)., where we solve The solution is determined only by the support vectors – those for which

NuMax-CG • Observation 2 • by KKT conditions, by complementary slackness, only constraints that are satisfied with equality determine solutions (“active constraints”) • Hence, given feasibility of a solution P*, only secants vkfor which |vkTP*vk– 1| = determine the value of P* • Key: Number of “support secants” << total number of secants • and so we only need to track the support secants • “column generation” approach to solving NuMax

Computation Time Can solve for datasetswith Q=100k points in N=1000 dimensions in a few hours

Squares – Near Isometry • Images of translating blurred squares live on a K=2 dimensional smooth manifold in N=256 dimensional space • Project a collection of these images into M-dimensional space while preserving structure(as measured by isometry constant ) N=16x16=256

Squares – Near Isometry • M=40 linear measurements enough to ensure isometryconstant of = 0.01 N=16x16=256

Squares – Near Isometry

Squares – CS Recovery • Signal recovery in AWGN N=16x16=256

MNIST (8) – Near Isometry N=20x20=400 M = 14 basis functions achieve = 0.05

MNIST (8) – Near Isometry N=20x20=400

MNIST – NN Classification • MNIST dataset • N = 20x20 = 400-dim images • 10 classes: digits 0-9 • Q = 60000 training images • Nearest neighbor (NN) classifier • Test on 10000 images • Miss-classification rate of NN classifier: 3.63%

MNIST – Naïve NuMax Classification • MNIST dataset • N = 20x20 = 400-dim images • 10 classes: digits 0-9 • Q = 60000 training images, so >1.8 billion secants! • NuMax-CG took 3-4 hours to process • Miss-classification rate of NN classifier: 3.63% • NuMax provides the best NN-classification rates

Task Adaptivity • Prune the secants according to the task at hand • If goal is signal reconstruction, then preserve all secants • If goal is signal classification, then preserve inter-class secants differently from intra-class secants • Can preferentially weight the training set vectors according to their importance (connections with boosting)

Learning Near-Isometric Linear Embeddings