960 likes | 978 Views
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models. K-means assigns each data point to exactly one cluster. But:. Soft Clustering with Gaussian mixture models. A “soft” version of K -means clustering Given: Training set S = { x 1 , ..., x N }, and K .
E N D
Unsupervised Learning II:Soft Clustering with Gaussian Mixture Models
K-means assigns each data point to exactly one cluster. But:
Soft Clustering with Gaussian mixture models • A “soft” version of K-means clustering • Given: Training set S = {x1, ..., xN}, and K. • Assumption:Data is generated by sampling probabilistically from a “mixture” (linear combination) of K Gaussians. Data Points
Soft Clustering with Gaussian mixture models • A “soft” version of K-means clustering • Given: Training set S = {x1, ..., xN}, and K. • Assumption:Data is generated by sampling probabilistically from a “mixture” (linear combination) of K Gaussians. Probability Density Data Points
Gaussian Mixture Models Assumptions • K clusters
Gaussian Mixture Models Assumptions • K clusters • Each cluster is modeled by a Gaussian distribution with a certain mean and standard deviation (or covariance).
Gaussian Mixture Models Assumptions • K clusters • Each cluster is modeled by a Gaussian distribution with a certain mean and standard deviation (or covariance). [This contrasts with K-means, in which each cluster is modeled only by a mean.]
Gaussian Mixture Models Assumptions • K clusters • Each cluster is modeled by a Gaussian distribution with a certain mean and standard deviation (or covariance). [This contrasts with K-means, in which each cluster is modeled only by a mean.] • Assume that each data instance we have was generated by the following procedure:
Gaussian Mixture Models Assumptions • K clusters • Each cluster is modeled by a Gaussian distribution with a certain mean and standard deviation (or covariance). [This contrasts with K-means, in which each cluster is modeled only by a mean.] • Assume that each data instance we have was generated by the following procedure: 1. Select cluster ci with probability P(ci) = πi
Gaussian Mixture Models Assumptions • K clusters • Each cluster is modeled by a Gaussian distribution with a certain mean and standard deviation (or covariance). [This contrasts with K-means, in which each cluster is modeled only by a mean.] • Assume that each data instance we have was generated by the following procedure: 1. Select cluster ci with probability P(ci) = πi 2. Sample point fromci’sGaussian distribution
Mixture of three Gaussians(one dimensional data) Probability Density Data Points
Clustering via Gaussian mixture models • Clusters: Each cluster will correspond to a single Gaussian. Each point x S will have some probability distribution over the K clusters.
Clustering via Gaussian mixture models • Clusters: Each cluster will correspond to a single Gaussian. Each point x S will have some probability distribution over the K clusters. • Goal:Given the data S, find the Gaussians! (And their probabilities πi .)
Clustering via Gaussian mixture models • Clusters: Each cluster will correspond to a single Gaussian. Each point x S will have some probability distribution over the K clusters. • Goal:Given the data S, find the Gaussians! (And their probabilities πi .) i.e., Find parameters {θK} of these K Gaussians such P(S | {θK}) is maximized.
Clustering via Gaussian mixture models • Clusters: Each cluster will correspond to a single Gaussian. Each point x S will have some probability distribution over the K clusters. • Goal:Given the data S, find the Gaussians! (And their probabilities πi .) i.e., Find parameters {θK} of these K Gaussians such P(S | {θK}) is maximized. • This is called a Maximum Likelihood method. • S is the data • {θK} is the “hypothesis” or “model” • P(S | {θK}) is the “likelihood”.
General form of one-dimensional (univariate) Gaussian Mixture Model
Learning a GMM Simplest Case: Maximum Likelihood for Single Univariate Gaussian
Learning a GMM Simplest Case: Maximum Likelihood for Single Univariate Gaussian • Assume training set S has N values generated by a univariant Gaussian distribution:
Learning a GMM Simplest Case: Maximum Likelihood for Single Univariate Gaussian • Assume training set S has N values generated by a univariant Gaussian distribution: • Likelihood function: probability of the data given the model:
Learning a GMM Simplest Case: Maximum Likelihood for Single Univariate Gaussian • Assume training set S has N values generated by a univariant Gaussian distribution: • Likelihood function: probability of the data given the model:
How to estimate parameters and from S? • Maximize the likelihood function with respect to and .
How to estimate parameters and from S? • Maximize the likelihood function with respect to and . • We want the and that maximize the probability of the data.
How to estimate parameters and from S? • Maximize the likelihood function with respect to and . • We want the and that maximize the probability of the data. • Use log of likelihood to make this tractable.
Log-Likelihood version: Find a simplified expression for this.
Now, given simplified expression for ln p(S | ,σ), find the maximum likelihood parameters, and σ.
Now, given simplified expression for ln p(S | ,σ), find the maximum likelihood parameters, and σ. First, maximize with respect to .
Now, given simplified expression for ln p(S | ,σ), find the maximum likelihood parameters, and σ. First, maximize with respect to . = 0 Result: (ML = “Maximum Likelihood”)
Now, given simplified expression for ln p(S | ,σ), find the maximum likelihood parameters, and σ. First, maximize with respect to . Result: (ML = “Maximum Likelihood”)
Now, given simplified expression for ln p(S | ,σ), find the maximum likelihood parameters, and σ. First, maximize with respect to . Result: (ML = “Maximum Likelihood”)
Now, given simplified expression for ln p(S | ,σ), find the maximum likelihood parameters, and σ. First, maximize with respect to . Result: (ML = “Maximum Likelihood”)
Now, given simplified expression for ln p(S | ,σ), find the maximum likelihood parameters, and σ. First, maximize with respect to . Result: (ML = “Maximum Likelihood”)
Now, maximize with respect to σ. It’s actually easier to maximize with respect to σ2.
Now, maximize with respect to σ. It’s actually easier to maximize with respect to σ2. = 0
Now, maximize with respect to σ. It’s actually easier to maximize with respect to σ2.
Now, maximize with respect to σ. It’s actually easier to maximize with respect to σ2. = 0
Now, maximize with respect to σ. It’s actually easier to maximize with respect to σ2. = 0
The resulting distribution is called a generative model because it can generate new data values. • We say that parameterizes the model. • In general, θ is used to denote the (learnable) parameters of a probabilistic model ML
The resulting distribution is called a generative model because it can generate new data values. • We say that parameterizes the model. • In general, θ is used to denote the (learnable) parameters of a probabilistic model ML
The resulting distribution is called a generative model because it can generate new data values. • We say that parameterizes the model. • In general, θ is used to denote the (learnable) parameters of a probabilistic model ML
Discriminative vs. generative models in machine learning • Discriminative model: Creates a discrimination curve to separate the data into classes • Generative model: Gives the parameters θMLof a probability distribution that gives maximum likelihood p (S | θ)for the data.
Learning a GMMMore general case: Multivariate Gaussian Distribution Multivariate (D-dimensional) Gaussian:
Covariance: Variance: Covariance Matrix Σ : Σi,j= cov (xi , xj)
Let S be a set of multivariate data points (vectors): S = {x1, ..., xm}.
Let S be a set of multivariate data points (vectors): S = {x1, ..., xm}. • General expression for finite Gaussian mixture model: