1 / 96

Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models

Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models. K-means assigns each data point to exactly one cluster. But:. Soft Clustering with Gaussian mixture models. A “soft” version of K -means clustering Given: Training set S = { x 1 , ..., x N }, and K .

smithbrian
Download Presentation

Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Unsupervised Learning II:Soft Clustering with Gaussian Mixture Models

  2. K-means assigns each data point to exactly one cluster. But:

  3. Soft Clustering with Gaussian mixture models • A “soft” version of K-means clustering • Given: Training set S = {x1, ..., xN}, and K. • Assumption:Data is generated by sampling probabilistically from a “mixture” (linear combination) of K Gaussians. Data Points

  4. Soft Clustering with Gaussian mixture models • A “soft” version of K-means clustering • Given: Training set S = {x1, ..., xN}, and K. • Assumption:Data is generated by sampling probabilistically from a “mixture” (linear combination) of K Gaussians. Probability Density Data Points

  5. Gaussian Mixture Models Assumptions

  6. Gaussian Mixture Models Assumptions • K clusters

  7. Gaussian Mixture Models Assumptions • K clusters • Each cluster is modeled by a Gaussian distribution with a certain mean and standard deviation (or covariance).

  8. Gaussian Mixture Models Assumptions • K clusters • Each cluster is modeled by a Gaussian distribution with a certain mean and standard deviation (or covariance). [This contrasts with K-means, in which each cluster is modeled only by a mean.]

  9. Gaussian Mixture Models Assumptions • K clusters • Each cluster is modeled by a Gaussian distribution with a certain mean and standard deviation (or covariance). [This contrasts with K-means, in which each cluster is modeled only by a mean.] • Assume that each data instance we have was generated by the following procedure:

  10. Gaussian Mixture Models Assumptions • K clusters • Each cluster is modeled by a Gaussian distribution with a certain mean and standard deviation (or covariance). [This contrasts with K-means, in which each cluster is modeled only by a mean.] • Assume that each data instance we have was generated by the following procedure: 1. Select cluster ci with probability P(ci) = πi

  11. Gaussian Mixture Models Assumptions • K clusters • Each cluster is modeled by a Gaussian distribution with a certain mean and standard deviation (or covariance). [This contrasts with K-means, in which each cluster is modeled only by a mean.] • Assume that each data instance we have was generated by the following procedure: 1. Select cluster ci with probability P(ci) = πi 2. Sample point fromci’sGaussian distribution

  12. Mixture of three Gaussians(one dimensional data) Probability Density Data Points

  13. Clustering via Gaussian mixture models • Clusters: Each cluster will correspond to a single Gaussian. Each point x S will have some probability distribution over the K clusters.

  14. Clustering via Gaussian mixture models • Clusters: Each cluster will correspond to a single Gaussian. Each point x S will have some probability distribution over the K clusters. • Goal:Given the data S, find the Gaussians! (And their probabilities πi .)

  15. Clustering via Gaussian mixture models • Clusters: Each cluster will correspond to a single Gaussian. Each point x S will have some probability distribution over the K clusters. • Goal:Given the data S, find the Gaussians! (And their probabilities πi .) i.e., Find parameters {θK} of these K Gaussians such P(S | {θK}) is maximized.

  16. Clustering via Gaussian mixture models • Clusters: Each cluster will correspond to a single Gaussian. Each point x S will have some probability distribution over the K clusters. • Goal:Given the data S, find the Gaussians! (And their probabilities πi .) i.e., Find parameters {θK} of these K Gaussians such P(S | {θK}) is maximized. • This is called a Maximum Likelihood method. • S is the data • {θK} is the “hypothesis” or “model” • P(S | {θK}) is the “likelihood”.

  17. General form of one-dimensional (univariate) Gaussian Mixture Model

  18. Learning a GMM Simplest Case: Maximum Likelihood for Single Univariate Gaussian

  19. Learning a GMM Simplest Case: Maximum Likelihood for Single Univariate Gaussian • Assume training set S has N values generated by a univariant Gaussian distribution:

  20. Learning a GMM Simplest Case: Maximum Likelihood for Single Univariate Gaussian • Assume training set S has N values generated by a univariant Gaussian distribution: • Likelihood function: probability of the data given the model:

  21. Learning a GMM Simplest Case: Maximum Likelihood for Single Univariate Gaussian • Assume training set S has N values generated by a univariant Gaussian distribution: • Likelihood function: probability of the data given the model:

  22. How to estimate parameters  and  from S?

  23. How to estimate parameters  and  from S? • Maximize the likelihood function with respect to  and  .

  24. How to estimate parameters  and  from S? • Maximize the likelihood function with respect to  and  . • We want the  and  that maximize the probability of the data.

  25. How to estimate parameters  and  from S? • Maximize the likelihood function with respect to  and  . • We want the  and  that maximize the probability of the data. • Use log of likelihood to make this tractable.

  26. Log-Likelihood version:

  27. Log-Likelihood version: Find a simplified expression for this.

  28. Now, given simplified expression for ln p(S | ,σ), find the maximum likelihood parameters,  and σ.

  29. Now, given simplified expression for ln p(S | ,σ), find the maximum likelihood parameters,  and σ. First, maximize with respect to .

  30. Now, given simplified expression for ln p(S | ,σ), find the maximum likelihood parameters,  and σ. First, maximize with respect to . = 0 Result: (ML = “Maximum Likelihood”)

  31. Now, given simplified expression for ln p(S | ,σ), find the maximum likelihood parameters,  and σ. First, maximize with respect to . Result: (ML = “Maximum Likelihood”)

  32. Now, given simplified expression for ln p(S | ,σ), find the maximum likelihood parameters,  and σ. First, maximize with respect to . Result: (ML = “Maximum Likelihood”)

  33. Now, given simplified expression for ln p(S | ,σ), find the maximum likelihood parameters,  and σ. First, maximize with respect to . Result: (ML = “Maximum Likelihood”)

  34. Now, given simplified expression for ln p(S | ,σ), find the maximum likelihood parameters,  and σ. First, maximize with respect to . Result: (ML = “Maximum Likelihood”)

  35. Now, maximize with respect to σ. It’s actually easier to maximize with respect to σ2.

  36. Now, maximize with respect to σ. It’s actually easier to maximize with respect to σ2. = 0

  37. Now, maximize with respect to σ. It’s actually easier to maximize with respect to σ2.

  38. Now, maximize with respect to σ. It’s actually easier to maximize with respect to σ2. = 0

  39. Now, maximize with respect to σ. It’s actually easier to maximize with respect to σ2. = 0

  40. The resulting distribution is called a generative model because it can generate new data values. • We say that parameterizes the model. • In general, θ is used to denote the (learnable) parameters of a probabilistic model ML

  41. The resulting distribution is called a generative model because it can generate new data values. • We say that parameterizes the model. • In general, θ is used to denote the (learnable) parameters of a probabilistic model ML

  42. The resulting distribution is called a generative model because it can generate new data values. • We say that parameterizes the model. • In general, θ is used to denote the (learnable) parameters of a probabilistic model ML

  43. Discriminative vs. generative models in machine learning • Discriminative model: Creates a discrimination curve to separate the data into classes • Generative model: Gives the parameters θMLof a probability distribution that gives maximum likelihood p (S | θ)for the data.

  44. Learning a GMMMore general case: Multivariate Gaussian Distribution Multivariate (D-dimensional) Gaussian:

  45. Covariance: Variance: Covariance Matrix Σ : Σi,j= cov (xi , xj)

  46. Let S be a set of multivariate data points (vectors): S = {x1, ..., xm}.

  47. Let S be a set of multivariate data points (vectors): S = {x1, ..., xm}. • General expression for finite Gaussian mixture model:

More Related