Randomized Algorithms for Efficient Matrix Computations

SDM06 TUTORIAL:Randomized Algorithms for Matrices and Massive Data Sets Michael W. Mahoney Yahoo! Research Petros Drineas CS - RPI Tutorial given at SIAM Data Mining Meeting April 22, 2006 (Most recent copy) available at: http://www.cs.yale.edu/homes/mmahoney http://www.cs.rpi.edu/~drinep

Randomized Linear Algebra Algorithms • Goal: To develop and analyze (fast) Monte Carlo algorithms for performing useful computations on large (and later not so large!) matrices and tensors. • Matrix Multiplication • Computation of the Singular Value Decomposition • Computation of the CUR Decomposition • Testing Feasibility of Linear Programs • Least Squares Approximation • Tensor computations: SVD generalizations • Tensor computations: CUR generalization • Such computations generally require time which is superlinear in the number of nonzero elements of the matrix/tensor, e.g., O(n3) for n x n matrices.

Carefully chosen U O(1) rows O(1) columns Example: the CUR decomposition Algorithmic Motivation: To speed up computations in applications where extremely large data sets are modeled by matrices and, e.g., O(n2) space and O(n3) time is not an option. Structural Motivation: To reveal novel structural properties of the datasets, given sufficient computational time, that are useful in applications. Goal: make ||A-CUR|| small. Why? (Algorithmic) After making two passes over A, we can compute provably good C, U, and R and store them (“sketch”) instead of A: O(m+n) vs. O(n2) space. Why? (Structural) Given sufficient time, we can find C, U and R such that A – CUR is “very” small. This might lead to better understanding of the data. Why? Given a sample consisting of a few columns (C) and a few rows (R) of A, we can compute U and “reconstruct” A as CUR. If the sampling probabilities are not “too bad”, we get provably good accuracy.

Applications of such algorithms • Matrices arise, e.g., since m objects (documents, genomes, images, web pages), each with n features, may be represented by an m x n matrix A. • Covariance Matrices • Latent Semantic Indexing • DNA Microarray Data • Eigenfaces and Image Recognition • Similarity Queries • Matrix Reconstruction • LOTS of other data applications!! • More generally, • Linear and Nonlinear Programming Applications • Design of Approximation Algorithms • Statistical Learning Theory Applications

Overview (1/2) • Data Streaming Models and Random Sampling • Matrix Multiplication • Singular Value Decomposition • CUR Matrix Decomposition • Applications of Matrix CUR • Data mining • DNA microarray (and DNA SNP) data • Recommendation Systems • Kernel-CUR and the Nystrom Method

Overview (2/2) • Tensor-based data sets • Tensor-CUR • Hyperspectral data • Recommendation systems • From Very-Large to Medium-Sized Data • Relative-error CX and CUR Matrix Decompositions • L2 Regression Problems • Application to DNA SNP Data • Conclusions and Open Problems

The Pass Efficient Model • Motivation: Amount of disk/tape space has increased enormously; RAM and computing speeds have increased less rapidly. • Can store large amounts of data, but • Cannotprocess these data with traditional algorithms. • In the Pass-Efficient Model: • Data are assumed to be stored on disk/tape. • Algorithm has access to the data via a pass over the data. • Algorithm is allowed additional RAM space and additional computation time. • An algorithm is pass-efficient if it uses a small constant number of passes and sublinear additional time and space to compute a description of the solution. • Note: If data are an m x n matrix A, then algorithms which require additional time and space that is O(m+n) or O(1) are pass-efficient.

Randomized Algorithms for Linear Algebra problems: A “sketch” consisting of a small number of judiciously chosen and randomly sampled rows and columns (or elements) is sufficient for provably rapid and efficient approximation of many matrix operations. Random Sampling • Random Sampling and Randomized Algorithms: • Better complexity properties (randomization as a resource). • Simpler algorithms and/or analysis (maybe de-randomize later). • Uniform Sampling: • Typically things work in expectation, but poor variance properties. • Non-uniform Sampling: • With “good” probabilities, can make the variance small.

i-th row of B i-th column of A Approximating Matrix Multiplication … (D. & Kannan FOCS ’01, and D., Kannan, & M. TR ’04, SICOMP ’06) Problem Statement Given an m-by-n matrix A and an n-by-p matrix B, approximate the product A·B, OR, equivalently, Approximate the sum of n rank-one matrices. Each term in the summation is a rank-one matrix

i-th row of B i-th column of A …by random sampling • Algorithm • Fix a set of probabilities pi, i=1,…,n, summing up to 1. • For t=1 up to s, • set jt = i, where Pr(jt = i) = pi; • (Pick sterms of the sum, with replacement, with respect to the pi.) • Approximate AB by the sum of the s terms, after scaling.

i-th row of B i-th column of A Random sampling (cont’d) Keeping the terms j1, j2, … js.

Create C and R by performing s i.i.d. trials, with replacement. • For t=1 up to s, pick a column A(jt) and a row B(jt) with probability • Include A(jt)/(spjt)1/2 as a column of C, and B(jt)/(spjt)1/2 as a row of R. The algorithm (matrix notation)

Simple Lemmas • The input matrices are given in “sparse unordered representation”; e.g., their non-zero entries are presented as triples (i, j, Aij) in any order. • The expectation of CR (element-wise) is AB. • Our nonuniform sampling minimizes the variance of the estimator. • It is easy to implement the sampling in two passes. • If the matrices aredense the algorithm runs in O(smp) time, instead of O(nmp) time, • It requires O(sm+sp) RAM space. • Does not tamper with the sparsity of the matrices.

Error Bounds For the above algorithm, For the above algorithm, with probability at least 1-, • This is a relative error bound if||AB||F = (||A||F ||B||F), i.e. if there is “not much cancellation” in the multiplication. • We removed the expectation (by applying a martingale argument) and so have an extra log(1/) factor. • Markov’s inequality would also remove the expectation, introducing an extra 1/ factor.

Special case: B = AT If B = AT, then the sampling probabilities are Also, R = CT, and the error bounds are

Special case: B = AT (cont’d) (Rudelson & Vershynin ’04, Vershynin ’04) Improvement for the spectral norm bound for the special case B = AT. • Uses a result of M. Rudelson for random vectors in isotropic position. • Tight concentration results can be proven using Talagrand’s theory. • The sampling procedure is slightly different; s columns/rows are kept in expectation, i.e., column i is picked with probability:

Singular Value Decomposition (SVD) U (V): orthogonal matrix containing the left (right) singular vectors of A. S: diagonal matrix containing the singular values of A. • Exact computation of the SVD takes O(min{mn2 , m2n}) time. • The top few singular vectors/values can be approximated faster (Lanczos/ Arnoldi methods).

Rank k approximations (Ak) Uk (Vk): orthogonal matrix containing the top k left (right) singular vectors of A. Sk: diagonal matrix containing the top k singular values of A. Also, Ak=UkUkTA. Ak is a matrix of rank k such that ||A-Ak||2,F is minimized over all rank k matrices! This property of very useful in the context of Principal Component Analysis.

Approximating SVD in O(n) time • (D., Frieze, Kannan, Vempala & Vinay SODA ’99, JML ’04, D. Kannan, & M. TR ’04, SICOMP ’06) • Given: m x n matrix A • Sample c columns from A and rescale to form the m x c matrix C. • Compute the m x k matrix Hk of the top k left singular vectors of C. Structural Theorem: For any probabilities and number of columns: ||A-HkHkTA||2,F2 ≤ ||A-Ak||2,F2 + 2√k||AAT-CCT||F Algorithmic Theorem: If pi = |A(i)|2/||A||F2 and c ≥ 42k/2, then: ||A-HkHkTA||2,F2 ≤ ||A-Ak||2,F2 + ||A||F2. Proof: via matrix multiplication theorem and matrix perturbation theory.

C Example of randomized SVD A Title: C:\Petros\Image Processing\baboondet.eps Creator: MATLAB, The Mathworks, Inc. Preview: This EPS picture was not saved with a preview included in it. Comment: This EPS picture will print to a PostScript printer, but not to other types of printers. Original matrix After sampling columns Compute the top k left singular vectors of the matrix C and store them in the 512-by-k matrix Hk.

Example of randomized SVD (cont’d) Title: C:\Petros\Image Processing\baboondet.eps Creator: MATLAB, The Mathworks, Inc. Preview: This EPS picture was not saved with a preview included in it. Comment: This EPS picture will print to a PostScript printer, but not to other types of printers. A HkHkTA A and HkHkTA are close.

More details: Let pij2 [0,1] for all i,j. Create the matrix S from A such that: Element-wise sampling (Achlioptas & McSherry, STOC ’01, JACM ’05) • The Algorithm in 2 lines: • To approximate a matrix A, keep a few elements of the matrix (instead of rows or columns) and zero out the remaining elements. • Compute a rank k approximation to this sparse matrix (using Lanczos methods). ||A-S||2 is bounded !(i) the singular values of A and S are close, and (ii, under additional assumptions) the top k left (right) singular vectors of S span a subspace that is close the to subspace spanned by the top k left (right) singular vectors of A.

Element-wise sampling (cont’d) • Approximating singular values fast: • Zero out (a large number of) elements of A, scale the remaining ones appropriately. • Compute the singular values of the resulting sparse matrix using iterative techniques. • (Good choice for pij: pij = sAij2/i,j Aij2, where s denotes the expected number of elements that we seek to keep in S.) • Note: Each element is kept or discarded independently of the others. • Similar ideas have been used to: • explain the success of Latent Semantic Indexing (LSI): (Papadimitriou, Raghavan, Tamaki, Vempala, PODS ’98 & Azar, Fiat, Karlin, McSherry, Saia STOC ’01) • design recommendation systems: (Azar, Fiat, Karlin, McSherry, Saia STOC ’01 ) • speedup kernel computations: (Achlioptas, McSherry, and Schölkopf, NIPS ’02)

Element-wise vs. row/column sampling • Quite different techniques! • Row/column sampling preserves subspace/structural properties of the matrices. • Element-wise sampling explains how adding noise and/or quantizing the elements of a matrix perturbs its singular values/vectors. • The two techniques should be complementary! • Some similarities and differences: • Similar error bounds. • Element-wise sampling is doable in one pass! • Running time of element-wise sampling depends on the speed of, e.g., Arnoldi methods. • Element-wise methods do not seem amenable to many of the extensions we will present.

Carefully chosen U An application: Given a query vector x, instead of computing A · x, compute CUR · x to identify its nearest neighbors, O(1) rows O(1) columns A novel CUR matrix decomposition (D. & Kannan, SODA ’03, D., Kannan, & M. TR ’04, SICOMP ’06) Create an approximation to the original matrix of the following form:

The CUR decomposition Given a large m-by-n matrix A (stored on disk), compute a decomposition CUR of A such that: • C consists of c = O(k/2) columns of A. • R consists of r = O(k/2) rows of A. • C (R) is created using importance sampling, e.g. columns (rows) are picked in i.i.d. trials with respect to probabilities: • C, U, R can be stored in O(m+n) space, after making two passes through the entire matrix A, using O(m+n) additional space and time. • The product CUR satisfies (with high probability):

Computing U • Intuition: (which can be formalized - see later) • The CUR algorithm expresses every row of the matrix A as a linear combination of a small subset of the rows of A. • This small subset consists of the rows in R. • Given a row of A – say A(i) – the algorithm computes a good fit for the row A(i) using the rows in R as the basis, by approximately solving: But, only c = O(1) elements of the i-th row are given as input. So, we only approximate the optimal vector u instead of computing it exactly. Actually, the pass-efficient CUR decomposition approximates the approximation.

If we pick O(1/2) rows and O(1/2) columns, Error bounds for CUR Assume Ak is the “best” rank k approximation to A (through SVD). Then, if we pick O(k/2) rows and O(k/2) columns,

Previous CUR-type decompositions (For details see Drineas & Mahoney, “A Randomized Algorithm for a Tensor-Based Generalization of the SVD”, ‘05.)

Lower Bounds Question: How many queries does a sampling algorithm need to approximate a given function accurately with high probability? • (Ziv Bar-Yossef ’03, ’04)Lower bounds for the low rank matrix approximation problem and the matrix reconstruction problem. • Any sampling algorithm that w.h.p. finds a good low rank approximation requires (m+n) queries. • Even if the algorithm is given the exact weight distribution over the columns of a matrix it will still require (k/4) queries. • Finding a matrix D such that ||A-D||F ≤  ||A||F requires (mn) queries and that finding a D such that ||A-D||2 ≤  ||A||2requires (m+n) queries. • Applied to our results: • The LinearTimeSVD algorithm is optimal w.r.t. ||||F bounds. • The ConstantTimeSVD algorithm is optimal w.r.t. ||||2 bounds up to poly factors. • The CUR algorithm is optimal for constant .

CUR application: Data Mining • Database: An m-by-n matrix A, e.g., m (>106) objects and n(>105) features. • Queries: Given a new object x, find similar objects (nearest neighbors) in A. • Closeness: Two normalized objects x and y are “close” xT·d = cos(x,d) is high. • Given a query vector x, the matrix product A·x computes all the angles/distances. • Key observation: The exact value xT· d might not be necessary. • The feature values in the vectors are set by coarse heuristics. • It is in general enough to see if xT·d > Threshold. • Algorithm: Given a query vector x, compute CUR · x to identify nearest neighbors. • Theorem: We have a bound on the worst case of x using CUR instead of A:

CUR application: Genetic Microarray Data Exploit structural properties of CUR in biological applications: Experimental conditions Find a “good” set of genes and arrays to include in C and R? Provable and/or heuristic strategies are acceptable. genes • Common in Biological/Chemical/Medical applications of PCA: • Explain the singular vectors, by mapping them to meaningful biological processes. • This is a “challenging” task (think: reification) ! • CUR is a low-rank decomposition in terms of the data that practitioners understand. • Use it to explain the data and do dimensionality reduction, classification, clustering. • Gene microarray data: M., D., & Alter (UT Austin) (sporulation and cell cycle data).

Customer sample (purchases, small surveys) products Customer sample (guinea pigs) customers CUR application: Recommendation Systems (D., Raghavan, & Kerenidis, STOC ’02) • The problem: m customers and n products; Aij is the (unknown) utility of product j for customer i. • The goal: recreate A from a few samples to recommend high utility products. • (KRRT98): Assuming strong clustering of the products, competitive algorithms even with only 2 samples/customer. • (AFKMS01): Assuming sampling of (mn) entries of A and a gap requirement, accurately recreate A. • Question: Can we get competitive performance by sampling o(mn) elements? • Answer: Apply the CUR decomposition:

Kernel-CUR Motivation • Kernel-based learning methods to extract non-linear structure: • Choose features to define a (dot product) space F. • Map the data, X, to F by : X ->F. • Do classification, regression, and clustering in F with linear methods (SVMs,GPs,SVD). • If the Gram matrix G, where Gij=kij=((X(i)), (X(j))), is dense but has low numerical rank, then calculations of interest need O(n2) space and O(n3) time: • matrix inversion in GP prediction, • quadratic programming problems in SVMs, • computation of eigendecomposition of G. • Relevant recent work using low-rank methods: • (Williams and Seeger, NIPS ’01, etc.): Nystrom method for out-of-sample extensions. • (Achlioptas, McSherry, and Schölkopf, NIPS ’02): randomized kernels.

Kernel-CUR Decomposition (D. & M., COLT ’05, TR ’05, JMLR ‘05) • Input: n x n SPSD matrix G, probabilities {pi, 1=1,…,n}, c <= n, and k <= c. • Algorithm: • Let C be the n x c matrix containing c randomly sampled columns of G. • Let W be the c x c matrix with containing intersection of C and CT. • Theorem: Let pi = Gii2/ iGii2. If c = O(k log(1/)/4), then w.p. at least 1-, If c = O(log(1/)/4), then with probability at least 1-,

Overview (2/2) • Tensor-based data sets • Tensor-CUR • Hyperspectral data • Recommendation systems • From Very-Large to Medium-Sized Data • Relative-error CX and CUR Matrix Decompositions • L2 Regression Problems • Application to DNA SNP Data • Conclusions and Open Problems

Datasets modeled as tensors • Tensors: (naively, a dataset subscripted by multiple indices) appear both in Math and CS. • Represent high dimensional functions. • Connections to complexity theory (i.e., matrix multiplication complexity). • Statistical applications (i.e., ICA, HOS, etc.). • Large data-set applications (e.g., Medical Imaging & Hyperspectral Imaging) • Problem: There does not exist a definition of tensor rank (and associated tensor SVD) with the – nice – properties found in the matrix case. • (Lek-Heng Lim ’05: strong impossibility results!) • Common heuristic: “unfold” the tensor along a mode and apply Linear Algebra. • We will do this, but note that this kills the essential tensor structure.

Mode 3 m x n x p tensor A Mode 1 Mode 2 Datasets modeled as tensors (cont’d) Goal: Extract structure from a tensor dataset A (naively, a dataset subscripted by multiple indices) using a small number of samples. • Tensor rank (minimum number of rank-one tensors) is NP-hard to compute. • Tensor -rank (“unfold” along the th mode and the the matrix SVD) is a commonly-used heuristic. Randomized-Tensor-CUR: unfold along a “distinguished” mode and reconstruct. Randomized-Tensor-SVD: unfold along every mode and choose columns. (Drineas & Mahoney, “A Randomized Algorithm for a Tensor-Based Generalization of the SVD,” TR05.)

The TensorCUR algorithm (3-modes) • Choose the preferred mode  (e.g., time). • Pick a few representative “slabs” (let R denote the tensor of the sampled slabs). • Use only information in a small number of representative “fibers” (let C denote the tensor of sampled fibers and U a low-dimensional encoding tensor). • Express the remaining slabs as linear combinations of the basis of sampled slabs.

128 frequencies ca. 500 pixels ca. 500 pixels Tensor-CUR application: Hyperspectral Image Analysis (with M. Maggioni and R. Coifman at Yale) Goal: Extract structure from temporally-resolved images or spectrally-resolved images of medical interest using a small number of samples (images and/or pixels). Note: A temporally or spectrally resolved image may be viewed as a tensor (naively, a dataset subscripted by multiple indices) or as a matrix (whose columns have internal structure that is not modeled). m x n x p tensor A or mn x p matrix A Note: The chosen images are a dictionary from the data to express every image. Note: The chosen pixels are a dictionary from the data to express every pixel.

Sampling hyperspectral data • Sample slabs depending on total absorption. • For example, absorption at two pixel types: Sample fibers uniformly (since intensity depends on stain).

Eigen-analysis of slabs and fibers

Look at the exact (65-th) slab.

The (65-th) slab approximately reconstructed This slab was reconstructed by approximate least-squares fit to the basis from slabs 41 and 50, using 1000 (of 250K) pixels/fibers.

Tissue Classification - Exact Data

Randomized Algorithms for Efficient Matrix Computations