1 / 25

580.691 Learning Theory Reza Shadmehr EM and expected complete log-likelihood Mixture of Experts

580.691 Learning Theory Reza Shadmehr EM and expected complete log-likelihood Mixture of Experts Identification of a linear dynamical system. The log likelihood of the unlabeled data. Hidden variable. Measured variable. The unlabeled data.

orenda
Download Presentation

580.691 Learning Theory Reza Shadmehr EM and expected complete log-likelihood Mixture of Experts

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 580.691 Learning Theory Reza Shadmehr EM and expected complete log-likelihood Mixture of Experts Identification of a linear dynamical system

  2. The log likelihood of the unlabeled data Hidden variable Measured variable The unlabeled data In the last lecture we assumed that in the M step, we knew the posterior probabilities, and found the derivative of the log-likelihood with respect to mu and sigma to maximize the log-likelihood. Today we take a more general approach to include both the E and M steps into the log-likelihood.

  3. A more general formulation of EM: Expected complete log likelihood The real data is not labeled. But for now, assume that someone labeled it, resulting in the “complete data”. Complete log-likelihood Expected complete log-likelihood In EM, in the E step we fix theta and try to maximize the expected complete log-likelihood by setting expected value of our hidden variables z to the posterior probability. In the M step, we fix expected value of z and try to maximize the expected complete log-likelihood by setting the parameters theta.

  4. A more general formulation of EM: Expected complete log likelihood In the M step, we fix expected value of z and try to maximize the expected complete log-likelihood by setting parameters theta. Expected complete log-likelihood

  5. Function to maximize The value pi that maximizes this function is one. But that’s not interesting because we also have another constraint: The sum of priors should be one. So we want to maximize this function given the constraint that the sum of pi_i is 1. constraint

  6. Function to maximize Function to minimize constraint We have 3 such equations, one for each pi. If we add the equations together, we get:

  7. EM algorithm: Summary We begin with a guess about the mixture parameters: The “E” step: Calculate the expected complete log-likelihood. In the mixture example, this reduces to just computing the posterior probabilities: The “M” step: maximize the expected complete log-likelihood with respect to the model parameters theta:

  8. Selecting number of mixture components A simple idea that helps with selection of number of mixture components is to form a cost that depends on both the log-likelihood of the data and the number of parameters used in the model. As the number of parameters increases, the log-likelihood increases. We want a cost that balances the change in the log-likelihood with the cost of having increasing parameters. A common technique is to find the m mixture components that minimize the “description length”. The effective number of parameters in the model Minimize the description length Maximum likelihood estimate of parameters for m mixture components Number of data points

  9. 10 7.5 5 2.5 0 -2.5 -5 -1 0 1 2 3 4 5 1 0.8 0.6 0.4 0.2 -1 1 2 3 4 5 Mixture of Experts The data set (x,y) is clearly non-linear, but we can break it up into two linear problems. We will try to switch from one “expert” to the another at around x=0. Expert 2 Expert 1 + Moderator Conditional probability of choosing expert 2 Expert 1 Expert 2

  10. We have observed a sequence of data points (x,y), and believe that it was generated by a process shown to the right: Note that y depends on both x (which we can measure) and z, which is hidden from us. For example, the dependence of y on x might be a simple linear model, but conditioned on z, where z is a multi-nomial. The Moderator (gating network) When there are only two experts, the moderator can be a logistic function: When there are multiple experts, the moderator can be a soft-max function:

  11. Based on our hypothesis, we should have the following distribution of observed data: A key quantity is the posterior probability of the latent variable z: Parameters of the moderator Parameters of the expert Posterior probability that the observed y “belongs” to the i-th expert. Note that the posterior probability for the i-th expert is updated based on how probable the observed data y was for this expert. In a way, the expression tells us that given the observed data y, how strongly should we assign it to expert i.

  12. Output of the i-th expert Output of the moderator Parameters of the i-th expert Output of the whole network Suppose there are two experts (m=2). For a given value of x, the two regressions each give us a Gaussian distribution at their mean. Therefore, for each value of x, we have a bimodal probability distribution for y. We have a mixture distribution in the output space y for each input value of x. The log-likelihood of the observed data.

  13. The complete log-likelihood for the mixture of experts problem The “completed” data: Complete log-likelihood Expected complete log-likelihood (assuming that someone had given us theta)

  14. The E step for the mixture of experts problem In the E step, we begin by assuming that we have theta. To compute the expected complete log-likelihood, all we need are the posterior probabilities. The posterior for each expert depends on the likelihood that the observed data y came from that expert.

  15. The M step for the mixture of experts problem: the moderator Exactly the same as the IRLS cost function. We find first and second derivatives and find a learning rule: The moderator learns from the posterior probability.

  16. The M step for the mixture of experts problem: weights of the expert A weighted least-squares problem The expert i learns from the observed data point y, weighted by the posterior probability that the error came from that expert.

  17. The M step for the mixture of experts problem: variance of the expert

  18. Parameter Estimation for Linear Dynamical Systems using EM Objective: to find the parameters A, B, C, Q, and R of a linear dynamical system from a set of data that includes inputs u and outputs y. Need to find the expected complete log-likelihood

  19. Posterior estimate of state and variance

More Related