Matrix Factorization & Principal Component Analysis

Matrix Factorization&Principal Component Analysis Bamshad Mobasher DePaul University

Principal Component Analysis • PCA is a widely used data compression and dimensionality reduction technique • PCA takes a data matrix, A, of n objects by p variables, which may be correlated, and summarizes it by uncorrelated axes (principal components or principal axes) that are linear combinations of the original p variables • The first k components display most of the variance among objects • The remaining components can be discarded resulting in a lower dimensional representation of the data that still captures most of the relevant information • PCA is computed by determining the eigenvectors and eigenvalues of the covariance matrix • Recall: The covariance of two random variables is their tendency to vary together

Covariance ofvariables i and j Mean ofvariable j Value of variable i in object m Value of variable j in object m Mean ofvariable i Sum over all n objects Principal Component Analysis (PCA) • Notes: • For a variable X, cov(X,X) = var(X) • For independent variables X and Y, cov(X,Y ) = 0 • The covariance matrix is a matrix Cwith elements Ci,j= cov(i,j) • The covariance matrix is square and symmetric. • For independent variables, the covariance matrix will be a diagonal matrix with the variances along the diagonal and covariances in the non-diagonal elements • To calculate the covariance matrix from a dataset, ﬁrst center the data by subtracting the mean of each variable, then compute: 1/n (AT.A)

PC 1 PC 2 Geometric Interpretation of PCA • The goal is to rotate the axes of the p-dimensional space to new positions (principal axes) that have the following properties: • ordered such that principal axis 1 has the highest variance, axis 2 has the next highest variance, .... , and axis p has the lowest variance • covariance among each pair of the principal axes is zero (the principal axes are uncorrelated). Note: Each principal axis is a linear combination of the original two variables Credit: Loretta Battaglia, Southern Illinois University

Covariance Matrix - Example Centered Data Original Data X = A = Cov(X) = 1/(n-1) ATA = Covariance Matrix

Eigenvalues and Eigenvectors • Finding the principal axes involves finding eigenvalues and eigenvectors of the covariance matrix (C = ATA) • eigenvalues are values () such that C.Z = .Z (Zare special vectors called eigenvectors) • This can be re-written as: (C - I).Z= 0 • So, eigenvalues can be found by solving the characteristic equation: det(C - I) = 0 • The eigenvalues, 1, 2, ... p are the variances of the coordinates on each principal component axis • the sum of all p eigenvalues equals the trace of C(the sum of the variances of the original variables) • The eigenvectors of the covariance matrix are the axes of max variance • a good approximation of the full matrix can be computed using only a subset of the eigenvectors and eigenvalues • the eigenvalues are truncated below some threshold; then the data is reprojectedonto the remaining r eigenvectors to get a rank-rapproximation

Eigenvalues and Eigenvectors 1 = 73.718 2 = 0.384 3 = 0.298 Eigenvalues Covariance Matrix Note: 1+2 +3= 74.4 = trace of C (sum of variances in the diagonal) Eigenvectors Z =

Reduced Dimension Space • Coordinates of each object i on the kth principal axis, known as the scores on PC k, are computed as where Uis the n x k matrix of PC scores, X is the n x p centered data matrix and Zis the p x k matrix of eigenvectors • Variance of the scores on each PC axis is equal to the corresponding eigenvalue for that axis • the eigenvalue represents the variance displayed (“explained” or “extracted”) by the kth axis • the sum of the first keigenvalues is the variance explained by the k-dimensional reduced matrix

Reduced Dimension Space • So, to generate the data in the new space: • RowFeatureVector: • Matrix with the eigenvectors in the columns transposed so that the eigenvectors are now in the rows, with the most significant eigenvector at the top • RowZeroMeanData • The mean-adjusted data transposed, i.e. the data items are in each column, with each row holding a separate dimension FinalData = RowFeatureVectorxRowZeroMeanData

Reduced Dimension Space U = ZT.AT= Taking only the top k =1 principle component: U = ZkT.AT=

Matrix Decomposition • Matrix D = m x n • e.g., Ratings matrix with m customers, n items • e.g., term-document matrix with m terms and n documents • Typically • D is sparse, e.g., less than 1% of entries have ratings • n is large, e.g., 18000 movies (Netflix), millions of docs, etc. • So finding matches to less popular items will be difficult • Basic Idea: • compress the columns (items) into a lower-dimensional representation Credit: Based on lecture notes from Padhraic Smyth, University of California, Irvine

Singular Value Decomposition (SVD) D = U SVt m x n m x n n x n n x n where: rows of Vtare eigenvectors of DtD = basis functions S is diagonal, with dii = sqrt(li) (ith eigenvalue) rows of U are coefficients for basis functions in V (here we assumed that m > n, and rank(D) = n) Credit: Based on lecture notes from Padhraic Smyth, University of California, Irvine

SVD Example • Data D = Credit: Based on lecture notes from Padhraic Smyth, University of California, Irvine

SVD Example • Data D = Note the pattern in the data above: the center column values are typically about twice the 1st and 3rd column values: • So there is redundancy in the columns, i.e., the column values are correlated Credit: Based on lecture notes from Padhraic Smyth, University of California, Irvine

SVD Example D = U SVt where U = 0.50 0.14 -0.19 0.12 -0.35 0.07 0.41 -0.54 0.66 0.49 -0.35 -0.67 0.56 0.66 0.27 where S = 48.6 0 0 0 1.5 0 0 0 1.2 and Vt = 0.41 0.82 0.40 0.73 -0.56 0.41 0.55 0.12 -0.82 • Data D = Credit: Based on lecture notes from Padhraic Smyth, University of California, Irvine

SVD Example D = U S Vt where U = 0.50 0.14 -0.19 0.12 -0.35 0.07 0.41 -0.54 0.66 0.49 -0.35 -0.67 0.56 0.66 0.27 where S = 48.6 0 0 0 1.5 0 0 0 1.2 and Vt = 0.41 0.82 0.40 0.73 -0.56 0.41 0.55 0.12 -0.82 • Data D = Note that first singular value is much larger than the others Credit: Based on lecture notes from Padhraic Smyth, University of California, Irvine

SVD Example D = U SVt where U = 0.50 0.14 -0.19 0.12 -0.35 0.07 0.41 -0.54 0.66 0.49 -0.35 -0.67 0.56 0.66 0.27 where S = 48.6 0 0 0 1.5 0 0 0 1.2 and Vt = 0.41 0.82 0.40 0.73 -0.56 0.41 0.55 0.12 -0.82 • Data D = Note that first singular value is much larger than the others First basis function (or eigenvector) carries most of the information and it “discovers” the pattern of column dependence Credit: Based on lecture notes from Padhraic Smyth, University of California, Irvine

Rows in D = weighted sums of basis vectors 1st row of D = [10 20 10] Since D = U S V, then D[0,: ] = U[0,: ] * S* Vt = [24.5 0.2 -0.22] * Vt Vt= 0.41 0.82 0.40 0.73 -0.56 0.41 0.55 0.12 -0.82 • D[0,: ]= 24.5 v1 + 0.2 v2 + -0.22 v3 where v1 , v2 , v3 are rows of Vt and are our basis vectors Thus, [24.5, 0.2, 0.22] are the weights that characterize row 1 in D In general, the ith row of U*S is the set of weights for the ith row in D Credit: Based on lecture notes from Padhraic Smyth, University of California, Irvine

Summary of SVD Representation D = U S Vt Data matrix: Rows = data vectors Vt matrix: Rows = our basis functions U*S matrix: Rows = weights for the rows of D Credit: Based on lecture notes from Padhraic Smyth, University of California, Irvine

How do we compute U, S, and V? • SVD decomposition is a standard eigenvector/value problem • The eigenvectors of Dt D = the rows of V • The eigenvectors of D Dt = the columns of U • The diagonal matrix elements in S are square roots of the eigenvalues of Dt D => finding U,S,V is equivalent to finding eigenvectors of DtD • Solving eigenvalue problems is equivalent to solving a set of linear equations – time complexity is O(m n2 + n3) Credit: Based on lecture notes from Padhraic Smyth, University of California, Irvine

~ ~ Matrix Approximation with SVD D U S Vt m x n m xk k xk k x n where: columns of V are first k eigenvectors of DtD S is diagonal with k largest eigenvalues rows of U are coefficients in reduced dimension V-space This approximation gives the best rank-k approximation to matrix D in a least squares sense (this is also known as principal components analysis) Credit: Based on lecture notes from Padhraic Smyth, University of California, Irvine

Matrix Factorization & Principal Component Analysis