1 / 131

An Introduction To Matrix Decomposition and Graphical Model

An Introduction To Matrix Decomposition and Graphical Model. Lei Zhang/Lead Researcher Microsoft Research Asia 2012-04-17. Outline. Matrix Decomposition PCA, SVD, NMF LDA, ICA, Sparse Coding, etc. Graphical Model Basic concepts in probabilistic machine learning EM pLSA LDA

uriel
Download Presentation

An Introduction To Matrix Decomposition and Graphical Model

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Introduction To Matrix Decomposition and Graphical Model Lei Zhang/Lead Researcher Microsoft Research Asia 2012-04-17

  2. Outline • Matrix Decomposition • PCA, SVD, NMF • LDA, ICA, Sparse Coding, etc. • Graphical Model • Basic concepts in probabilistic machine learning • EM • pLSA • LDA • Two Applications • Document decomposition for “long query” retrieval • Modeling Threaded Discussions

  3. What Is Matrix Decomposition? • We wish to decompose the matrix A by writing it as a product of two or more matrices: An×m = Bn×kCk×m • Suppose A, B, C are column matrices • An×m = (a1, a2, …, am), each ai is a n-dim data sample • Bn×k= (b1, b2, …, bk), each bj is a n-dim basis, and space B consists of k bases. • Ck×m= (c1, c2, …, cm), each ci is the k-dim coordinates of ai projected to space B

  4. Why We Need Matrix Decomposition? • Given one data sample: a1 = Bn×kc1 (a11, a12, …, a1n)T = (b1, b2, …, bk) (c11, c12, …, c1k)T • Another data sample: a2 = Bn×kc2 • More data sample: am = Bn×kcm • Together (m data samples): (a1, a2, …, am) = Bn×k (c1, c2, …, cm) An×m = Bn×kCk×m

  5. Why We Need Matrix Decomposition? (a1, a2, …, am) = Bn×k (c1, c2, …, cm) An×m = Bn×kCk×m • We wish to find a set of new basis B to represent data samples A, and A will become C in the new space. • In general, B captures the common features in A, while C carries specific characteristics of the original samples. • In PCA: B is eigenvectors • In SVD: B is right (column) eigenvectors • In LDA: B is discriminant directions • In NMF: B is local features

  6. Principle Component Analysis

  7. Definition – Eigenvalue & Eigenvector Given a m x m matrix C, for any λ and w, if Then λ is called eigenvalue, and w is called eigenvector.

  8. Definition – Principle Component Analysis • Principle Component Analysis (PCA) • Karhunen-Loeve transformation (KL transformation) • Let A be a n × m data matrix in which the rows represent data samples • Each row is a data vector, each column represents a variable • A is centered: the estimated mean is subtracted from each column, so each column has zero mean. • Covariance matrix C (m x m):

  9. Principle Component Analysis • C can be decomposed as follows: C=UΛUT • Λis a diagonal matrix diag(λ1 λ2,…,λn), each λi is an eigenvalue • U is an orthogonal matrix, each column is an eigenvector  UTU=I  U-1=UT

  10. Maximizing Variance • The objective of the rotation transformation is to find the maximal variance • Projection of data along w is Aw. • Variance: σ2w= (Aw)T(Aw) = wTATAw = wTCw where C = ATA is the covariance matrix of the data (A is centered!) • Task: maximize variance subject to constraint wTw=1.

  11. Optimization Problem • Maximize λis the Lagrange multiplier • Differentiating with respect to w yields • Eigenvalue equation: Cw = λw, where C = ATA. • Once the first principal component is found, we continue in the same fashion to look for the next one, which is orthogonal to (all) the principal component(s) already found.

  12. Property: Data Decomposition • PCA can be treated as data decomposition a=UUTa =(u1,u2,…,un)(u1,u2,…,un)T a =(u1,u2,…,un) (<u1,a>,<u2,a>…,<un,a>)T =(u1,u2,…,un) (b1, b2, …, bn)T = Σbi·ui

  13. Face Recognition – Eigenface • Turk, M.A.; Pentland, A.P. Face recognition using eigenfaces, CVPR 1991 (Citation: 2654) • The eigenface approach • images are points in a vector space • use PCA to reduce dimensionality • face space • compare projections onto face space to recognize faces

  14. PageRank – Power Iteration • Column j has nonzero elements in positions corresponding to outlinks of j (Nj in total) • Row i has nonzero element in positions corresponding to inlinksIi.

  15. Column-Stochastic & Irreducible • Column-Stochastic • where • Irreducible

  16. Iterative PageRank Calculation • For k=1,2,… • Equivalently (λ=1, A is a Markov chain transition matrix) • Why can we use power iteration to find the first eigenvector?

  17. Convergence of the power iteration • Expand the initial approximation r0 in terms of the eigenvectors

  18. Singular value decomposition

  19. SVD - Definition • Any m x n matrix A, with m ≥ n, can be factorized

  20. Singular Values And Singular Vectors • The diagonal elements σj of are the singular values of the matrix A. • The columns of U and V are the left singular vectors and right singular vectors respectively. • Equivalent form of SVD:

  21. Matrix approximation • Theorem: Let Uk = (u1u2 … uk), Vk = (v1v2 … vk) and Σk = diag(σ1, σ2, …, σk), and define • Then • It means, that the best approximation of rank k for the matrix A is

  22. SVD and PCA • We can write: • Remember that in PCA, we treat A as a row matrix • V is just eigenvectors for A • Each column in V is an eigenvector of row matrix A • we use V to approximate a row in A • Equivalently, we can write: • U is just eigenvectors for AT • Each column in U is an eigenvector of column matrix A • We use U to approximate a column in A

  23. Example - LSI • Build a term-by-document matrix A • Compute the SVD of A: A = UΣVT • Approximate A by • Uk: Orthogonal basis, that we use to approximate all the documents • Dk: Column j hold the coordinates of document j in the new basis • Dk is the projection of A onto the subspace spanned by Uk.

  24. SVD and PCA • For symmetric A, SVD is closely related to PCA • PCA: A = UΛUT • U and Λare eigenvectors and eigenvalues. • SVD: A = UΛVT • Uis left(column) eigenvectors • V is right(row) eigenvectors • Λis the same eigenvalues • For symmetric A, column eigenvectors equal to row eigenvectors • Note the difference of A in PCA and SVD • SVD: A is directly the data, e.g. term-by-document matrix • PCA: A is covariance matrix, A=XTX, each row in X is a sample

  25. Latent Semantic Indexing (LSI) • Document file preparation/ preprocessing: • Indexing: collecting terms • Use stop list: eliminate ”meaningless” words • Stemming • Construction term-by-document matrix, sparsematrixstorage. • Query matching: distance measures. • Data compression by low rank approximation: SVD • Ranking and relevance feedback.

  26. Latent Semantic Indexing • Assumption: there is some underlying latent semantic structure in the data. • E.g. car and automobile occur in similar documents, as do cows and sheep. • This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD.

  27. Similarity Measures • Term to term AAT=UΣ2UT = (UΣ)(UΣ)T UΣare the coordinates of A (rows) projected to space V • Document to document ATA=VΣ2VT = (VΣ)(VΣ)T VΣ are the coordinates of A (columns) projected to space U

  28. Similarity Measures • Term to document A=UΣVT = (UΣ½)(VΣ½)T UΣ½are the coordinates of A (rows) projected to space V VΣ½are the coordinates of A (columns) projected to space U

  29. HITS (Hyperlink Induced Topic Search) • Idea: Web includes two flavors of prominent pages: • authorities contain high-quality information, • hubs are comprehensive lists of links to authorities • A page is a good authority, if many hubs point to it. • A page is a good hub if it points to many authorities. • Good authorities are pointed to by good hubs and good hubs point to good authorities. Hubs Authorities

  30. Power Iteration • Each page i has both a hub score hi and an authority score ai. • HITS successively refines these scores by computing • Define the adjacency matrix L of the directed web graph: • Now

  31. HITS and SVD • L: rows are outlinks, columns are inlinks. • a will be the dominant eigenvector of the authority matrixLTL • h will be the dominant eigenvector of the hub matrix LLT • They are in fact the first left and right singular vectors of L!! • We are in fact running SVD on the adjacency matrix.

  32. HITS vsPageRank • PageRank may be computed once, HITS is computed per query. • HITS takes query into account, PageRank doesn’t. • PageRank has no concept of hubs • HITS is sensitive to local topology: insertion or deletion of a small number of nodes may change the scores a lot. • PageRank more stable, because of its random jump step.

  33. NMF – Non-Negative Matrix Factorization

  34. Definition • Given a nonnegative matrix Vn×m, find non-negative matrix factors Wn×k and Hk×m such that Vn×m≈Wn×kHk×m • V: column matrix, each column is a data sample (n-dimension) • Wi: k-basis represents one base • H: coordinates of V projected to W vj≈ Wn×khj

  35. Motivation • Non-negativity is natural in many applications... • Probability is also non-negative • Additive model to capture local structure

  36. Multiplicative Update Algorithm • Cost function  Euclidean distance • Multiplicative Update

  37. Multiplicative Update Algorithm • Cost function  Divergence • Reduce to Kullback-Leibler divergence when • A and B can be regarded as normalized probability distributions. • Multiplicative update • PLSA is NMF with KL divergence

  38. NMF vs PCA • n = 2429 faces, m = 19x19 pixels • Positive values are illustrated with black pixels and negative values with red pixels • NMF  Parts-based representation • PCA  Holistic representations

  39. Reference • D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. (pdf) NIPS, 2001. • D. D. Lee and H. S. Seung.  Learning the parts of objects by non-negative matrix factorization. (pdf) Nature 401, 788-791 (1999).

  40. Major Reference • SaaraHyvönen, Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki (Highly recommend)

  41. Outline • Basic concepts • Likelihood, i.i.d. • ML, MAP and Bayesian Inference • Expectation-Maximization • Mixture Gaussian Parameter estimation • pLSA • Motivation • Derivation & Geometry properties • Applications • LDA • Motivation - Why to add a hyper parameter • Dirichlet Distribution • Variational EM • Relations with other topic modals • Incorporating category information • Summary

  42. Not Included • General graphical model theories • Markov random field (belief propagation) • Detailed derivation of LDA

  43. Basic Concepts

  44. What Is Machine Learning? Data • Let x = (x1, x2, . . . , xD)T denote a data point, and D = {x(1), x(2) . . . , x(N)}, a data set. D is sometimes associated with desired outputs y1, y2, . . .. Predictions • We are generally interested in predicting something based on the observed data set. • Given D what can we say about x(N+1)? Model • To make predictions, we need to make some assumptions. We can often express these assumptions in the form of a model, with some parameters, θ • Given data D, we learn the model parameters , from which we can predict new data points. • The model can often be expressed as a probability distribution over data points

  45. Likelihood Function • Given a set of parameter values, probability density function (PDF) will show that some data are more probable than other data. • Inversely, given the observed data and a model of interest, Likelihood function is defined as: L(θ) = fθ(x|θ) = p(x|θ) • That is, likelihood function L(θ)will show that some parameters are more likely to have produced the data

  46. Maximum Likelihood (ML) • Maximum likelihood will find the best model parameters that make the data “most likely” generated from this model. • Suppose we are given n data samples (x1, x2, …, xn), • Maximum likelihood will find θ that maximize L(θ) • Predictive distribution

  47. I.I.D. – Independent, Identically Distributed • I.I.D. means • The problem is considerably simplified as: • Usually, log likehood is used

  48. Reference • ZoubinGhahramani, Machine Learning (4F13), 2006, Cambridge(Introduction to Machine Learning, Lectures 1-2 Slides) • Gregor Heinrich, Parameter estimation for text analysis, Technical Note, 2005-2008

More Related