Understanding K-means Clustering and Gaussian Mixtures: A Summary

Ch 9. Mixture Models and EMPattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by Ho-Sik Seok Biointelligence Laboratory, Seoul National University http://bi.snu.ac.kr/

Contents • 9.1 K-means Clustering • 9.2 Mixtures of Gaussians • 9.2.1 Maximum likelihood • 9.2.2 EM for Gaussian mixtures • 9.3 An Alternative View of EM • 9.3.1 Gaussian mixtures revisited • 9.3.2 Relation to K-means • 9.3.3 Mixtures of Bernouli distributions • 9.3.4 EM for Bayesian linear regression • 9.4 The EM Algorithm in General (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

9.1 K-means Clustering (1/3) • Problem of identifying groups, or clusters, of data points in a multidimensional space • Partitioning the data set into some number K of clusters • Cluster: a group of data points whose inter-point distances are small compared with the distances to points outside of the cluster • Goal: an assignment of data points to clusters such that the sum of the squares of the distances to each data point to its closest vector (the center of the cluster) is a minimum (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

9.1 K-means Clustering (2/3) • Two-stage optimization • In the 1st stage: minimizing J with respect to the rnk, keeping the μk fixed • In the 2nd stage: minimizing J with respect to the μk, keeping rnk fixed The mean of all of the data points assigned to cluster k (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

9.2 Mixtures of Gaussians (1/3)A formulation of Gaussian mixtures in terms of discrete latent variables • Gaussian mixture distribution can be written as a linear superposition of Gaussian • An equivalent formulation of the Gaussian mixture involving an explicit latent variable • Graphical representation of a mixture model • A binary random variable z having a 1-of-k representation • The marginal distribution of x is a Gaussian mixture of the form (*)  for every observed data point xn, there is a corresponding latent variable zn …(*) (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

9.2 Mixtures of Gaussians (3/3) • Generating random samples distributed according to the Gaussian mixture model • Generating a value for z, which denoted as from the marginal distribution p(z) and then generate a value for x from the conditional distribution • The three states of z, corresponding to the three components of the mixture, are depicted in red, green, blue • The corresponding samples from the marginal distribution p(x) • The same samples in which the colors represent the value of the responsibilities γ(znk) associated with data point • Illustrating the responsibilities by evaluating the posterior probability for each component in the mixture distribution which this data set was generated (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

9.2.1 Maximum likelihood (1/3) • Graphical representation of a Gaussian mixture model fora set of N i.i.d. data points{xn}, with corresponding latent points {zn} • The log of the likelihood function ….(*1) (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

9.2.1 Maximum likelihood (2/3) • For simplicity, consider a Gaussian mixture whose components have covariance matrices given by • Suppose that one of the components of the mixture model has its mean μj exactly equal to one of the data points so that μj = xn • This data point will contribute a term in the likelihood function of the term • Once there are at least two components in the mixture, one ofthe components can have a finite variance and therefore assignfinite probability to all of the datapoints while the other componentcan shrink onto one specific data point and thereby contribute an everincreasing additive value to the loglikelihood  over-fitting problem (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

9.2.1 Maximum likelihood (3/3) • Over-fitting problem • Example of the over-fitting in a maximum likelihood approach • This problem does not occur in the case of Bayesian approach • In applying maximum likelihood to a Gaussian mixture models, there should be steps to avoid finding such pathological solutions and instead seek local minima of the likelihood function that are well behaved • Identifiability problem • A K-component mixture will have a total of K! equivalent solutions corresponding to the K! ways of assigning K sets of parameters to K components • Difficulty of maximizing the log likelihood function (*1)  the difficulty arises from the presence of the summation over k that appears inside the logarithm in (*1) (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

9.2.2 EM for Gaussian mixtures (1/4) • Assign some initial values for the means, covariances, and mixing coefficients • Expectation or E step • Using the current value for the parameters to evaluate the posterior probabilities or responsibilities • Maximization or M step • Using the result of II to re-estimate the means, covariances, and mixing coefficients • It is common to run the K-means algorithm in order to find a suitable initial values • The covariance matrices  the sample covariances of the clusters found by the K-means algorithm • Mixing coefficients  the fractions of data points assigned to the respective clusters (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

9.2.2 EM for Gaussian mixtures (2/4) • Given a Gaussian mixture model, the goal is to maximize the likelihood function with respect to the parameters • Initialize the means μk, covariance Σk and mixing coefficients πk • E step • M step • Evaluate the log likelihood (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

…(*2) Responsibilityγ(znk ) A weighted mean of all of the points in the data set 9.2.2 EM for Gaussian mixtures (3/4) • Setting the derivatives of (*2) with respect to the means of the Gaussian components to zero  • Setting the derivatives of (*2) with respect to the covariance of the Gaussian components to zero  • Each data point weighted by the corresponding posterior probability • The denominator given by the effective # of points associated with the corresponding component (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

9.3 An Alternative View of EM • General EM • Maximizing the log likelihood function • Given a joint distribution p(X, Z|Θ) over observed variables X and latent variables Z, governed by parameters Θ • Choose an initial setting for the parameters Θold • E step Evaluate p(Z|X,Θold ) • M step Evaluate Θnew given by Θnew = argmaxΘQ(Θ ,Θold) Q(Θ ,Θold) = ΣZ p(Z|X, Θold)ln p(X, Z| Θ) • It the covariance criterion is not satisfied, then let Θold Θnew (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

9.3.1 Gaussian mixtures revisited (1/2) • For the complete data set {X, Z}, the likelihood function (9.14) • The logarithm acts directly on the Gaussian distribution  much simpler solution to the maximum likelihood problem • zn is the 1 of K coding  the complete-data log likelihood function is a sum of K independent distributions, one for each mixture component  the maximization with respect to a mean or a covariance is exactly for a single Gaussian The mixing coefficients are equal to the fractions of data points assigned to the corresponding components (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

9.3.1 Gaussian mixtures revisited (2/2) • Unknown latent variables  considering of the complete-data log likelihood with respect to the posterior distribution of the latent variables Posterior distribution The expected value of the indicator variable under this posterior distribution The expected value of the complete-data log likelihood function …(*3) (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

9.4 The EM Algorithm in General (1/3) • The EM algorithm is a general technique for finding maximum likelihood solutions for probabilistic models having latent variables (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

9.4 The EM Algorithm in General (2/3) Illustration of the M step - The distribution q(Z) is held fixed and the lower bound L(q, Θ) is maximized with respect to the parameter Θ to give a revised value Illustration of the decomposition Illustration of the E step - The q distribution is set equal to the posterior distribution for the current parameter values, causing the lower bound to move up to the same value as the log likelihood function (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

9.4 The EM Algorithm in General (3/3) • E step • The lower bound L(q, Θold) is maximized with respect to q(Z) while holding Θold fixed • The largest value of L(q, Θold) will occur when the Kullback-Leibler divergence vanishes • M step • The distribution q(Z) is held fixed and the lower bound L(q, Θ) is maximized with respect to Θ • This will cause the lower bound L to increase • The quantity that is being maximized is the expectation of the complete-data log likelihood (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Understanding K-means Clustering and Gaussian Mixtures: A Summary

Understanding K-means Clustering and Gaussian Mixtures: A Summary

Presentation Transcript

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning

PATTERN RECOGNITION AND MACHINE LEARNING

Ch 2. Probability Distribution Pattern Recognition and Machine Learning, C. M. Bishop, 2006.

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning

Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, 2006.

Ch 5. Neural Networks (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, 2006.

Pattern Recognition and Machine Learning

Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, 2006.

Ch 6. Kernel Methods Pattern Recognition and Machine Learning, C. M. Bishop, 2006.

Ch 13. Sequential Data (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, 2006.

Ch 12. continuous latent variables Pattern Recognition and Machine Learning, C. M. Bishop, 2006.

Ch 14. Combining Models Pattern Recognition and Machine Learning, C. M. Bishop, 2006.