840 likes | 875 Views
This tutorial introduces fast Monte Carlo algorithms for performing computations on large matrices and tensors, including matrix multiplication, singular value decomposition, CUR decomposition, and more. The goal is to develop algorithms that can handle extremely large data sets efficiently.
E N D
VLDB06 TUTORIAL:Randomized Algorithms for Matrices and Massive Data Sets Michael W. Mahoney Yahoo! Research Petros Drineas CS - RPI Tutorial given at VLDB 2006 in Seoul, Korea. http://www.cs.yale.edu/homes/mmahoney http://www.cs.rpi.edu/~drinep
Randomized Linear Algebra Algorithms • Goal: To develop and analyze (fast) Monte Carlo algorithms for performing useful computations on large (and later not so large!) matrices and tensors. • Matrix Multiplication • Computation of the Singular Value Decomposition • Computation of the CUR Decomposition • Testing Feasibility of Linear Programs • Least Squares Approximation • Tensor computations: SVD generalizations • Tensor computations: CUR generalization • Such computations generally require time which is superlinear in the number of nonzero elements of the matrix/tensor, e.g., O(n3) for n x n matrices.
Carefully chosen U O(1) rows O(1) columns Example: the CUR decomposition Algorithmic Motivation: To speed up computations in applications where extremely large data sets are modeled by matrices and, e.g., O(n2) space and O(n3) time is not an option. Structural Motivation: To reveal novel structural properties of the datasets, given sufficient computational time, that are useful in applications. Goal: make ||A-CUR|| small. Why? (Algorithmic) After making two passes over A, we can compute provably good C, U, and R and store them (“sketch”) instead of A: O(m+n) vs. O(n2) space. Why? (Structural) Given sufficient time, we can find C, U and R such that A – CUR is “very” small. This might lead to better understanding of the data. Why? Given a sample consisting of a few columns (C) and a few rows (R) of A, we can compute U and “reconstruct” A as CUR. If the sampling probabilities are not “too bad”, we get provably good accuracy.
Applications of such algorithms • Matrices arise, e.g., since m objects (documents, genomes, images, web pages), each with n features, may be represented by an m x n matrix A. • Covariance Matrices • Latent Semantic Indexing • DNA Microarray Data • Eigenfaces and Image Recognition • Similarity Queries • Matrix Reconstruction • LOTS of other data applications!! • More generally, • Linear and Nonlinear Programming Applications • Design of Approximation Algorithms • Statistical Learning Theory Applications
Overview (1/2) • Data Streaming Models and Random Sampling • Matrix Multiplication • Singular Value Decomposition • CUR Matrix Decomposition • Applications of Matrix CUR • Data mining • DNA microarray (and DNA SNP) data • Recommendation Systems • Kernel-CUR and the Nystrom Method
Overview (2/2) • Tensor-based data sets • Tensor-CUR • Hyperspectral data • Recommendation systems • From Very-Large to Medium-Sized Data • Relative-error CX and CUR Matrix Decompositions • L2 Regression Problems • Application to DNA SNP Data • Conclusions and Open Problems
The Pass Efficient Model • Motivation: Amount of disk/tape space has increased enormously; RAM and computing speeds have increased less rapidly. • Can store large amounts of data, but • Cannotprocess these data with traditional algorithms. • In the Pass-Efficient Model: • Data are assumed to be stored on disk/tape. • Algorithm has access to the data via a pass over the data. • Algorithm is allowed additional RAM space and additional computation time. • An algorithm is pass-efficient if it uses a small constant number of passes and sublinear additional time and space to compute a description of the solution. • Note: If data are an m x n matrix A, then algorithms which require additional time and space that is O(m+n) or O(1) are pass-efficient.
Randomized Algorithms for Linear Algebra problems: A “sketch” consisting of a small number of judiciously chosen and randomly sampled rows and columns (or elements) is sufficient for provably rapid and efficient approximation of many matrix operations. Random Sampling • Random Sampling and Randomized Algorithms: • Better complexity properties (randomization as a resource). • Simpler algorithms and/or analysis (maybe de-randomize later). • Uniform Sampling: • Typically things work in expectation, but poor variance properties. • Non-uniform Sampling: • With “good” probabilities, can make the variance small.
Overview (1/2) • Data Streaming Models and Random Sampling • Matrix Multiplication • Singular Value Decomposition • CUR Matrix Decomposition • Applications of Matrix CUR • Data mining • DNA microarray (and DNA SNP) data • Recommendation Systems • Kernel-CUR and the Nystrom Method
i-th row of B i-th column of A Approximating Matrix Multiplication … (D. & Kannan FOCS ’01, and D., Kannan, & M. TR ’04, SICOMP ’06) Problem Statement Given an m-by-n matrix A and an n-by-p matrix B, approximate the product A·B, OR, equivalently, Approximate the sum of n rank-one matrices. Each term in the summation is a rank-one matrix
i-th row of B i-th column of A …by random sampling • Algorithm • Fix a set of probabilities pi, i=1,…,n, summing up to 1. • For t=1 up to s, • set jt = i, where Pr(jt = i) = pi; • (Pick sterms of the sum, with replacement, with respect to the pi.) • Approximate AB by the sum of the s terms, after scaling.
i-th row of B i-th column of A Random sampling (cont’d) Keeping the terms j1, j2, … js.
Create C and R by performing s i.i.d. trials, with replacement. • For t=1 up to s, pick a column A(jt) and a row B(jt) with probability • Include A(jt)/(spjt)1/2 as a column of C, and B(jt)/(spjt)1/2 as a row of R. The algorithm (matrix notation)
Simple Lemmas • The input matrices are given in “sparse unordered representation”; e.g., their non-zero entries are presented as triples (i, j, Aij) in any order. • The expectation of CR (element-wise) is AB. • Our nonuniform sampling minimizes the variance of the estimator. • It is easy to implement the sampling in two passes. • If the matrices aredense the algorithm runs in O(smp) time, instead of O(nmp) time, • It requires O(sm+sp) RAM space. • Does not tamper with the sparsity of the matrices.
Error Bounds For the above algorithm, For the above algorithm, with probability at least 1-, • This is a relative error bound if||AB||F = (||A||F ||B||F), i.e. if there is “not much cancellation” in the multiplication. • We removed the expectation (by applying a martingale argument) and so have an extra log(1/) factor. • Markov’s inequality would also remove the expectation, introducing an extra 1/ factor.
Special case: B = AT If B = AT, then the sampling probabilities are Also, R = CT, and the error bounds are
Special case: B = AT (cont’d) (Rudelson & Vershynin ’04, Vershynin ’04) Improvement for the spectral norm bound for the special case B = AT. • Uses a result of M. Rudelson for random vectors in isotropic position. • Tight concentration results can be proven using Talagrand’s theory. • The sampling procedure is slightly different; s columns/rows are kept in expectation, i.e., column i is picked with probability:
Overview (1/2) • Data Streaming Models and Random Sampling • Matrix Multiplication • Singular Value Decomposition • CUR Matrix Decomposition • Applications of Matrix CUR • Data mining • DNA microarray (and DNA SNP) data • Recommendation Systems • Kernel-CUR and the Nystrom Method
Singular Value Decomposition (SVD) U (V): orthogonal matrix containing the left (right) singular vectors of A. S: diagonal matrix containing the singular values of A. • Exact computation of the SVD takes O(min{mn2 , m2n}) time. • The top few singular vectors/values can be approximated faster (Lanczos/ Arnoldi methods).
Rank k approximations (Ak) Uk (Vk): orthogonal matrix containing the top k left (right) singular vectors of A. Sk: diagonal matrix containing the top k singular values of A. Also, Ak=UkUkTA. Ak is a matrix of rank k such that ||A-Ak||2,F is minimized over all rank k matrices! This property of very useful in the context of Principal Component Analysis.
Approximating SVD in O(n) time • (D., Frieze, Kannan, Vempala & Vinay SODA ’99, JML ’04, D. Kannan, & M. TR ’04, SICOMP ’06) • Given: m x n matrix A • Sample c columns from A and rescale to form the m x c matrix C. • Compute the m x k matrix Hk of the top k left singular vectors of C. Structural Theorem: For any probabilities and number of columns: ||A-HkHkTA||2,F2 ≤ ||A-Ak||2,F2 + 2√k||AAT-CCT||F Algorithmic Theorem: If pi = |A(i)|2/||A||F2 and c ≥ 42k/2, then: ||A-HkHkTA||2,F2 ≤ ||A-Ak||2,F2 + ||A||F2. Proof: via matrix multiplication theorem and matrix perturbation theory.
C Example of randomized SVD A Title: C:\Petros\Image Processing\baboondet.eps Creator: MATLAB, The Mathworks, Inc. Preview: This EPS picture was not saved with a preview included in it. Comment: This EPS picture will print to a PostScript printer, but not to other types of printers. Original matrix After sampling columns Compute the top k left singular vectors of the matrix C and store them in the 512-by-k matrix Hk.
Example of randomized SVD (cont’d) Title: C:\Petros\Image Processing\baboondet.eps Creator: MATLAB, The Mathworks, Inc. Preview: This EPS picture was not saved with a preview included in it. Comment: This EPS picture will print to a PostScript printer, but not to other types of printers. A HkHkTA A and HkHkTA are close.
More details: Let pij [0,1] for all i,j. Create the matrix S from A such that: Element-wise sampling (Achlioptas & McSherry, STOC ’01, JACM ’05) • The Algorithm in 2 lines: • To approximate a matrix A, keep a few elements of the matrix (instead of rows or columns) and zero out the remaining elements. • Compute a rank k approximation to this sparse matrix (using Lanczos methods). ||A-S||2 is bounded !(i) the singular values of A and S are close, and (ii, under additional assumptions) the top k left (right) singular vectors of S span a subspace that is close the to subspace spanned by the top k left (right) singular vectors of A.
Overview (1/2) • Data Streaming Models and Random Sampling • Matrix Multiplication • Singular Value Decomposition • CUR Matrix Decomposition • Applications of Matrix CUR • Data mining • DNA microarray (and DNA SNP) data • Recommendation Systems • Kernel-CUR and the Nystrom Method
Carefully chosen U O(1) rows O(1) columns CX and CUR matrix decompositions Recall: Matrices are about their rows and columns. Recall: Low-rank matrices have redundancy in their rows and columns. Def: A CX matrix decomposition is a low-rank approximation explicitly expressed in terms of a small number of columns of the original matrix A (e.g., PCA = CC+A). Def: A CUR matrix decomposition is a low-rank approximation explicitly expressed in terms of a small number of columns and rows of the original matrix A.
Carefully chosen U An application: Given a query vector x, instead of computing A · x, compute CUR · x to identify its nearest neighbors, O(1) rows O(1) columns A novel CUR matrix decomposition (D. & Kannan, SODA ’03, D., Kannan, & M. TR ’04, SICOMP ’06) Create an approximation to the original matrix of the following form:
The CUR decomposition Given a large m-by-n matrix A (stored on disk), compute a decomposition CUR of A such that: • C consists of c = O(k/2) columns of A. • R consists of r = O(k/2) rows of A. • C (R) is created using importance sampling, e.g. columns (rows) are picked in i.i.d. trials with respect to probabilities: • C, U, R can be stored in O(m+n) space, after making two passes through the entire matrix A, using O(m+n) additional space and time. • The product CUR satisfies (with high probability):
Computing U • Intuition: (which can be formalized - see later) • The CUR algorithm expresses every row of the matrix A as a linear combination of a small subset of the rows of A. • This small subset consists of the rows in R. • Given a row of A – say A(i) – the algorithm computes a good fit for the row A(i) using the rows in R as the basis, by approximately solving: But, only c = O(1) elements of the i-th row are given as input. So, we only approximate the optimal vector u instead of computing it exactly. Actually, the pass-efficient CUR decomposition approximates the approximation.
If we pick O(1/2) rows and O(1/2) columns, Error bounds for CUR Assume Ak is the “best” rank k approximation to A (through SVD). Then, if we pick O(k/2) rows and O(k/2) columns,
Previous CUR-type decompositions (For details see Drineas & Mahoney, “A Randomized Algorithm for a Tensor-Based Generalization of the SVD”, ‘05.)
Overview (1/2) • Data Streaming Models and Random Sampling • Matrix Multiplication • Singular Value Decomposition • CUR Matrix Decomposition • Applications of Matrix CUR • Data mining • DNA microarray (and DNA SNP) data • Recommendation Systems • Kernel-CUR and the Nystrom Method
CUR application: Data Mining • Database: An m-by-n matrix A, e.g., m (>106) objects and n(>105) features. • Queries: Given a new object x, find similar objects (nearest neighbors) in A. • Closeness: Two normalized objects x and y are “close” xT·d = cos(x,d) is high. • Given a query vector x, the matrix product A·x computes all the angles/distances. • Key observation: The exact value xT· d might not be necessary. • The feature values in the vectors are set by coarse heuristics. • It is in general enough to see if xT·d > Threshold. • Algorithm: Given a query vector x, compute CUR · x to identify nearest neighbors. • Theorem: We have a bound on the worst case of x using CUR instead of A:
CUR application: Genetic Microarray Data Exploit structural properties of CUR in biological applications: Experimental conditions Find a “good” set of genes and arrays to include in C and R? Provable and/or heuristic strategies are acceptable. genes • Common in Biological/Chemical/Medical applications of PCA: • Explain the singular vectors, by mapping them to meaningful biological processes. • This is a “challenging” task (think: reification) ! • CUR is a low-rank decomposition in terms of the data that practitioners understand. • Use it to explain the data and do dimensionality reduction, classification, clustering. • Gene microarray data: M., D., & Alter (UT Austin) (sporulation and cell cycle data).
Customer sample (purchases, small surveys) products Customer sample (guinea pigs) customers CUR application: Recommendation Systems (D., Raghavan, & Kerenidis, STOC ’02) • The problem: m customers and n products; Aij is the (unknown) utility of product j for customer i. • The goal: recreate A from a few samples to recommend high utility products. • (KRRT98): Assuming strong clustering of the products, competitive algorithms even with only 2 samples/customer. • (AFKMS01): Assuming sampling of (mn) entries of A and a gap requirement, accurately recreate A. • Question: Can we get competitive performance by sampling o(mn) elements? • Answer: Apply the CUR decomposition:
Kernel-CUR Motivation • Kernel-based learning methods to extract non-linear structure: • Choose features to define a (dot product) space F. • Map the data, X, to F by : X ->F. • Do classification, regression, and clustering in F with linear methods (SVMs,GPs,SVD). • If the Gram matrix G, where Gij=kij=((X(i)), (X(j))), is dense but has low numerical rank, then calculations of interest need O(n2) space and O(n3) time: • matrix inversion in GP prediction, • quadratic programming problems in SVMs, • computation of eigendecomposition of G. • Relevant recent work using low-rank methods: • (Williams and Seeger, NIPS ’01, etc.): Nystrom method for out-of-sample extensions. • (Achlioptas, McSherry, and Schölkopf, NIPS ’02): randomized kernels.
Kernel-CUR Decomposition (D. & M., COLT ’05, TR ’05, JMLR ‘05) • Input: n x n SPSD matrix G, probabilities {pi, 1=1,…,n}, c <= n, and k <= c. • Algorithm: • Let C be the n x c matrix containing c randomly sampled columns of G. • Let W be the c x c matrix with containing intersection of C and CT. • Theorem: Let pi = Gii2/ iGii2. If c = O(k log(1/)/4), then w.p. at least 1-, If c = O(log(1/)/4), then with probability at least 1-,
Overview (2/2) • Tensor-based data sets • Tensor-CUR • Hyperspectral data • Recommendation systems • From Very-Large to Medium-Sized Data • Relative-error CX and CUR Matrix Decompositions • L2 Regression Problems • Application to DNA SNP Data • Conclusions and Open Problems
Datasets modeled as tensors • Tensors: (naively, a dataset subscripted by multiple indices) appear both in Math and CS. • Represent high dimensional functions. • Connections to complexity theory (i.e., matrix multiplication complexity). • Statistical applications (i.e., ICA, HOS, etc.). • Large data-set applications (e.g., Medical Imaging & Hyperspectral Imaging) • Problem: There does not exist a definition of tensor rank (and associated tensor SVD) with the – nice – properties found in the matrix case. • (Lek-Heng Lim ’05: strong impossibility results!) • Common heuristic: “unfold” the tensor along a mode and apply Linear Algebra. • We will do this, but note that this kills the essential tensor structure.
Mode 3 m x n x p tensor A Mode 1 Mode 2 Datasets modeled as tensors (cont’d) Goal: Extract structure from a tensor dataset A (naively, a dataset subscripted by multiple indices) using a small number of samples. • Tensor rank (minimum number of rank-one tensors) is NP-hard to compute. • Tensor -rank (“unfold” along the th mode and the the matrix SVD) is a commonly-used heuristic. Randomized-Tensor-CUR: unfold along a “distinguished” mode and reconstruct. Randomized-Tensor-SVD: unfold along every mode and choose columns. (Drineas & Mahoney, “A Randomized Algorithm for a Tensor-Based Generalization of the SVD,” TR05.)
The TensorCUR algorithm (3-modes) • Choose the preferred mode (e.g., time). • Pick a few representative “slabs” (let R denote the tensor of the sampled slabs). • Use only information in a small number of representative “fibers” (let C denote the tensor of sampled fibers and U a low-dimensional encoding tensor). • Express the remaining slabs as linear combinations of the basis of sampled slabs.
128 frequencies ca. 500 pixels ca. 500 pixels Tensor-CUR application: Hyperspectral Image Analysis (with M. Maggioni and R. Coifman at Yale) Goal: Extract structure from temporally-resolved images or spectrally-resolved images of medical interest using a small number of samples (images and/or pixels). Note: A temporally or spectrally resolved image may be viewed as a tensor (naively, a dataset subscripted by multiple indices) or as a matrix (whose columns have internal structure that is not modeled). m x n x p tensor A or mn x p matrix A Note: The chosen images are a dictionary from the data to express every image. Note: The chosen pixels are a dictionary from the data to express every pixel.
The (65-th) slab approximately reconstructed This slab was reconstructed by approximate least-squares fit to the basis from slabs 41 and 50, using 1000 (of 250K) pixels/fibers.
m customers n products n products Tensor-CUR application: Recommendation Systems • Important Comment: • Utility is ordinal and not cardinal concept. • Compare products; don’t assign utility values. • Recommendation Model Revisited: • Every customer has an n-by-n matrix (whose entries are +1,-1) and represent pair-wise product comparisons. • There are m such matrices, forming an n-by-n-by-m 3-mode tensor A. • Extract the “structure” of this tensor.
Application to Jester Joke Recommendations Use just the 14,140 “full” users who rated all 100 Jester jokes. For each user, convert the utility vector to 100 x 100 pair-wise preference matrix. Choose, e.g., 300 slabs/users, and a small number of fibers/comparisons.