• 1.34k likes • 1.61k Views
An Introduction To Matrix Decomposition and Graphical Model. Lei Zhang/Lead Researcher Microsoft Research Asia 2012-04-17. Outline. Matrix Decomposition PCA, SVD, NMF LDA, ICA, Sparse Coding, etc. Graphical Model Basic concepts in probabilistic machine learning EM pLSA LDA
E N D
An Introduction To Matrix Decomposition and Graphical Model Lei Zhang/Lead Researcher Microsoft Research Asia 2012-04-17
Outline • Matrix Decomposition • PCA, SVD, NMF • LDA, ICA, Sparse Coding, etc. • Graphical Model • Basic concepts in probabilistic machine learning • EM • pLSA • LDA • Two Applications • Document decomposition for “long query” retrieval • Modeling Threaded Discussions
What Is Matrix Decomposition? • We wish to decompose the matrix A by writing it as a product of two or more matrices: An×m = Bn×kCk×m • Suppose A, B, C are column matrices • An×m = (a1, a2, …, am), each ai is a n-dim data sample • Bn×k= (b1, b2, …, bk), each bj is a n-dim basis, and space B consists of k bases. • Ck×m= (c1, c2, …, cm), each ci is the k-dim coordinates of ai projected to space B
Why We Need Matrix Decomposition? • Given one data sample: a1 = Bn×kc1 (a11, a12, …, a1n)T = (b1, b2, …, bk) (c11, c12, …, c1k)T • Another data sample: a2 = Bn×kc2 • More data sample: am = Bn×kcm • Together (m data samples): (a1, a2, …, am) = Bn×k (c1, c2, …, cm) An×m = Bn×kCk×m
Why We Need Matrix Decomposition? (a1, a2, …, am) = Bn×k (c1, c2, …, cm) An×m = Bn×kCk×m • We wish to find a set of new basis B to represent data samples A, and A will become C in the new space. • In general, B captures the common features in A, while C carries specific characteristics of the original samples. • In PCA: B is eigenvectors • In SVD: B is right (column) eigenvectors • In LDA: B is discriminant directions • In NMF: B is local features
Definition – Eigenvalue & Eigenvector Given a m x m matrix C, for any λ and w, if Then λ is called eigenvalue, and w is called eigenvector.
Definition – Principle Component Analysis • Principle Component Analysis (PCA) • Karhunen-Loeve transformation (KL transformation) • Let A be a n × m data matrix in which the rows represent data samples • Each row is a data vector, each column represents a variable • A is centered: the estimated mean is subtracted from each column, so each column has zero mean. • Covariance matrix C (m x m):
Principle Component Analysis • C can be decomposed as follows: C=UΛUT • Λis a diagonal matrix diag(λ1 λ2,…,λn), each λi is an eigenvalue • U is an orthogonal matrix, each column is an eigenvector UTU=I U-1=UT
Maximizing Variance • The objective of the rotation transformation is to find the maximal variance • Projection of data along w is Aw. • Variance: σ2w= (Aw)T(Aw) = wTATAw = wTCw where C = ATA is the covariance matrix of the data (A is centered!) • Task: maximize variance subject to constraint wTw=1.
Optimization Problem • Maximize λis the Lagrange multiplier • Differentiating with respect to w yields • Eigenvalue equation: Cw = λw, where C = ATA. • Once the first principal component is found, we continue in the same fashion to look for the next one, which is orthogonal to (all) the principal component(s) already found.
Property: Data Decomposition • PCA can be treated as data decomposition a=UUTa =(u1,u2,…,un)(u1,u2,…,un)T a =(u1,u2,…,un) (<u1,a>,<u2,a>…,<un,a>)T =(u1,u2,…,un) (b1, b2, …, bn)T = Σbi·ui
Face Recognition – Eigenface • Turk, M.A.; Pentland, A.P. Face recognition using eigenfaces, CVPR 1991 (Citation: 2654) • The eigenface approach • images are points in a vector space • use PCA to reduce dimensionality • face space • compare projections onto face space to recognize faces
PageRank – Power Iteration • Column j has nonzero elements in positions corresponding to outlinks of j (Nj in total) • Row i has nonzero element in positions corresponding to inlinksIi.
Column-Stochastic & Irreducible • Column-Stochastic • where • Irreducible
Iterative PageRank Calculation • For k=1,2,… • Equivalently (λ=1, A is a Markov chain transition matrix) • Why can we use power iteration to find the first eigenvector?
Convergence of the power iteration • Expand the initial approximation r0 in terms of the eigenvectors
SVD - Definition • Any m x n matrix A, with m ≥ n, can be factorized
Singular Values And Singular Vectors • The diagonal elements σj of are the singular values of the matrix A. • The columns of U and V are the left singular vectors and right singular vectors respectively. • Equivalent form of SVD:
Matrix approximation • Theorem: Let Uk = (u1u2 … uk), Vk = (v1v2 … vk) and Σk = diag(σ1, σ2, …, σk), and define • Then • It means, that the best approximation of rank k for the matrix A is
SVD and PCA • We can write: • Remember that in PCA, we treat A as a row matrix • V is just eigenvectors for A • Each column in V is an eigenvector of row matrix A • we use V to approximate a row in A • Equivalently, we can write: • U is just eigenvectors for AT • Each column in U is an eigenvector of column matrix A • We use U to approximate a column in A
Example - LSI • Build a term-by-document matrix A • Compute the SVD of A: A = UΣVT • Approximate A by • Uk: Orthogonal basis, that we use to approximate all the documents • Dk: Column j hold the coordinates of document j in the new basis • Dk is the projection of A onto the subspace spanned by Uk.
SVD and PCA • For symmetric A, SVD is closely related to PCA • PCA: A = UΛUT • U and Λare eigenvectors and eigenvalues. • SVD: A = UΛVT • Uis left(column) eigenvectors • V is right(row) eigenvectors • Λis the same eigenvalues • For symmetric A, column eigenvectors equal to row eigenvectors • Note the difference of A in PCA and SVD • SVD: A is directly the data, e.g. term-by-document matrix • PCA: A is covariance matrix, A=XTX, each row in X is a sample
Latent Semantic Indexing (LSI) • Document file preparation/ preprocessing: • Indexing: collecting terms • Use stop list: eliminate ”meaningless” words • Stemming • Construction term-by-document matrix, sparsematrixstorage. • Query matching: distance measures. • Data compression by low rank approximation: SVD • Ranking and relevance feedback.
Latent Semantic Indexing • Assumption: there is some underlying latent semantic structure in the data. • E.g. car and automobile occur in similar documents, as do cows and sheep. • This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD.
Similarity Measures • Term to term AAT=UΣ2UT = (UΣ)(UΣ)T UΣare the coordinates of A (rows) projected to space V • Document to document ATA=VΣ2VT = (VΣ)(VΣ)T VΣ are the coordinates of A (columns) projected to space U
Similarity Measures • Term to document A=UΣVT = (UΣ½)(VΣ½)T UΣ½are the coordinates of A (rows) projected to space V VΣ½are the coordinates of A (columns) projected to space U
HITS (Hyperlink Induced Topic Search) • Idea: Web includes two flavors of prominent pages: • authorities contain high-quality information, • hubs are comprehensive lists of links to authorities • A page is a good authority, if many hubs point to it. • A page is a good hub if it points to many authorities. • Good authorities are pointed to by good hubs and good hubs point to good authorities. Hubs Authorities
Power Iteration • Each page i has both a hub score hi and an authority score ai. • HITS successively refines these scores by computing • Define the adjacency matrix L of the directed web graph: • Now
HITS and SVD • L: rows are outlinks, columns are inlinks. • a will be the dominant eigenvector of the authority matrixLTL • h will be the dominant eigenvector of the hub matrix LLT • They are in fact the first left and right singular vectors of L!! • We are in fact running SVD on the adjacency matrix.
HITS vsPageRank • PageRank may be computed once, HITS is computed per query. • HITS takes query into account, PageRank doesn’t. • PageRank has no concept of hubs • HITS is sensitive to local topology: insertion or deletion of a small number of nodes may change the scores a lot. • PageRank more stable, because of its random jump step.
Definition • Given a nonnegative matrix Vn×m, find non-negative matrix factors Wn×k and Hk×m such that Vn×m≈Wn×kHk×m • V: column matrix, each column is a data sample (n-dimension) • Wi: k-basis represents one base • H: coordinates of V projected to W vj≈ Wn×khj
Motivation • Non-negativity is natural in many applications... • Probability is also non-negative • Additive model to capture local structure
Multiplicative Update Algorithm • Cost function Euclidean distance • Multiplicative Update
Multiplicative Update Algorithm • Cost function Divergence • Reduce to Kullback-Leibler divergence when • A and B can be regarded as normalized probability distributions. • Multiplicative update • PLSA is NMF with KL divergence
NMF vs PCA • n = 2429 faces, m = 19x19 pixels • Positive values are illustrated with black pixels and negative values with red pixels • NMF Parts-based representation • PCA Holistic representations
Reference • D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. (pdf) NIPS, 2001. • D. D. Lee and H. S. Seung. Learning the parts of objects by non-negative matrix factorization. (pdf) Nature 401, 788-791 (1999).
Major Reference • SaaraHyvönen, Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki (Highly recommend)
Outline • Basic concepts • Likelihood, i.i.d. • ML, MAP and Bayesian Inference • Expectation-Maximization • Mixture Gaussian Parameter estimation • pLSA • Motivation • Derivation & Geometry properties • Applications • LDA • Motivation - Why to add a hyper parameter • Dirichlet Distribution • Variational EM • Relations with other topic modals • Incorporating category information • Summary
Not Included • General graphical model theories • Markov random field (belief propagation) • Detailed derivation of LDA
What Is Machine Learning? Data • Let x = (x1, x2, . . . , xD)T denote a data point, and D = {x(1), x(2) . . . , x(N)}, a data set. D is sometimes associated with desired outputs y1, y2, . . .. Predictions • We are generally interested in predicting something based on the observed data set. • Given D what can we say about x(N+1)? Model • To make predictions, we need to make some assumptions. We can often express these assumptions in the form of a model, with some parameters, θ • Given data D, we learn the model parameters , from which we can predict new data points. • The model can often be expressed as a probability distribution over data points
Likelihood Function • Given a set of parameter values, probability density function (PDF) will show that some data are more probable than other data. • Inversely, given the observed data and a model of interest, Likelihood function is defined as: L(θ) = fθ(x|θ) = p(x|θ) • That is, likelihood function L(θ)will show that some parameters are more likely to have produced the data
Maximum Likelihood (ML) • Maximum likelihood will find the best model parameters that make the data “most likely” generated from this model. • Suppose we are given n data samples (x1, x2, …, xn), • Maximum likelihood will find θ that maximize L(θ) • Predictive distribution
I.I.D. – Independent, Identically Distributed • I.I.D. means • The problem is considerably simplified as: • Usually, log likehood is used
Reference • ZoubinGhahramani, Machine Learning (4F13), 2006, Cambridge(Introduction to Machine Learning, Lectures 1-2 Slides) • Gregor Heinrich, Parameter estimation for text analysis, Technical Note, 2005-2008