580.691 Learning Theory Reza Shadmehr EM and expected complete log-likelihood Mixture of Experts

580.691 Learning Theory Reza Shadmehr EM and expected complete log-likelihood Mixture of Experts Identification of a linear dynamical system

The log likelihood of the unlabeled data Hidden variable Measured variable The unlabeled data In the last lecture we assumed that in the M step, we knew the posterior probabilities, and found the derivative of the log-likelihood with respect to mu and sigma to maximize the log-likelihood. Today we take a more general approach to include both the E and M steps into the log-likelihood.

A more general formulation of EM: Expected complete log likelihood The real data is not labeled. But for now, assume that someone labeled it, resulting in the “complete data”. Complete log-likelihood Expected complete log-likelihood In EM, in the E step we fix theta and try to maximize the expected complete log-likelihood by setting expected value of our hidden variables z to the posterior probability. In the M step, we fix expected value of z and try to maximize the expected complete log-likelihood by setting the parameters theta.

A more general formulation of EM: Expected complete log likelihood In the M step, we fix expected value of z and try to maximize the expected complete log-likelihood by setting parameters theta. Expected complete log-likelihood

Function to maximize The value pi that maximizes this function is one. But that’s not interesting because we also have another constraint: The sum of priors should be one. So we want to maximize this function given the constraint that the sum of pi_i is 1. constraint

Function to maximize Function to minimize constraint We have 3 such equations, one for each pi. If we add the equations together, we get:

EM algorithm: Summary We begin with a guess about the mixture parameters: The “E” step: Calculate the expected complete log-likelihood. In the mixture example, this reduces to just computing the posterior probabilities: The “M” step: maximize the expected complete log-likelihood with respect to the model parameters theta:

Selecting number of mixture components A simple idea that helps with selection of number of mixture components is to form a cost that depends on both the log-likelihood of the data and the number of parameters used in the model. As the number of parameters increases, the log-likelihood increases. We want a cost that balances the change in the log-likelihood with the cost of having increasing parameters. A common technique is to find the m mixture components that minimize the “description length”. The effective number of parameters in the model Minimize the description length Maximum likelihood estimate of parameters for m mixture components Number of data points

10 7.5 5 2.5 0 -2.5 -5 -1 0 1 2 3 4 5 1 0.8 0.6 0.4 0.2 -1 1 2 3 4 5 Mixture of Experts The data set (x,y) is clearly non-linear, but we can break it up into two linear problems. We will try to switch from one “expert” to the another at around x=0. Expert 2 Expert 1 + Moderator Conditional probability of choosing expert 2 Expert 1 Expert 2

We have observed a sequence of data points (x,y), and believe that it was generated by a process shown to the right: Note that y depends on both x (which we can measure) and z, which is hidden from us. For example, the dependence of y on x might be a simple linear model, but conditioned on z, where z is a multi-nomial. The Moderator (gating network) When there are only two experts, the moderator can be a logistic function: When there are multiple experts, the moderator can be a soft-max function:

Based on our hypothesis, we should have the following distribution of observed data: A key quantity is the posterior probability of the latent variable z: Parameters of the moderator Parameters of the expert Posterior probability that the observed y “belongs” to the i-th expert. Note that the posterior probability for the i-th expert is updated based on how probable the observed data y was for this expert. In a way, the expression tells us that given the observed data y, how strongly should we assign it to expert i.

Output of the i-th expert Output of the moderator Parameters of the i-th expert Output of the whole network Suppose there are two experts (m=2). For a given value of x, the two regressions each give us a Gaussian distribution at their mean. Therefore, for each value of x, we have a bimodal probability distribution for y. We have a mixture distribution in the output space y for each input value of x. The log-likelihood of the observed data.

The complete log-likelihood for the mixture of experts problem The “completed” data: Complete log-likelihood Expected complete log-likelihood (assuming that someone had given us theta)

The E step for the mixture of experts problem In the E step, we begin by assuming that we have theta. To compute the expected complete log-likelihood, all we need are the posterior probabilities. The posterior for each expert depends on the likelihood that the observed data y came from that expert.

The M step for the mixture of experts problem: the moderator Exactly the same as the IRLS cost function. We find first and second derivatives and find a learning rule: The moderator learns from the posterior probability.

The M step for the mixture of experts problem: weights of the expert A weighted least-squares problem The expert i learns from the observed data point y, weighted by the posterior probability that the error came from that expert.

The M step for the mixture of experts problem: variance of the expert

Parameter Estimation for Linear Dynamical Systems using EM Objective: to find the parameters A, B, C, Q, and R of a linear dynamical system from a set of data that includes inputs u and outputs y. Need to find the expected complete log-likelihood

Posterior estimate of state and variance

580.691 Learning Theory Reza Shadmehr EM and expected complete log-likelihood Mixture of Experts