1 / 23

Hierarchical Mixture of Experts

Hierarchical Mixture of Experts. Presented by Qi An Machine learning reading group Duke University 07/15/2005. Outline. Background Hierarchical tree structure Gating networks Expert networks E-M algorithm Experimental results Conclusions. Background. The idea of mixture of experts

Download Presentation

Hierarchical Mixture of Experts

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hierarchical Mixture of Experts Presented by Qi An Machine learning reading group Duke University 07/15/2005

  2. Outline • Background • Hierarchical tree structure • Gating networks • Expert networks • E-M algorithm • Experimental results • Conclusions

  3. Background • The idea of mixture of experts • First presented by Jacobs and Hintons in 1988 • Hierarchical mixture of experts • Proposed by Jordan and Jacobs in 1994 • Difference from previous mixture model • Mixing weights depends on both the input and the output

  4. Example (ME)

  5. One-layer structure μ Ellipsoidal Gating function g1 g2 g3 Gating Network x μ1 μ2 μ3 Expert Network Expert Network Expert Network x x x

  6. Example (HME)

  7. Hierarchical tree structure Linear Gating function

  8. Expert network • At the leaves of trees for each expert: linear predictor output of the expert link function For example: logistic function for binary classification

  9. Gating network • At the nonterminal of the tree top layer: other layer:

  10. Output • At the non-leaves nodes top node: other nodes:

  11. Probability model • For each expert, assume the true output y is chosen from a distribution P with mean μij • Therefore, the total probability of generating y from x is given by

  12. Posterior probabilities • Since the gij and gi are computed based only on the input x, we refer them as prior probabilities. • We can define the posterior probabilities with the knowledge of both the input x and the output y using Bayes’ rule

  13. E-M algorithm • Introduce auxiliary variables zij which have an interpretation as the labels that corresponds to the experts. • The probability model can be simplified with the knowledge of auxiliary variables

  14. E-M algorithm • Complete-data likelihood: • The E-step

  15. E-M algorithm • The M-step

  16. IRLS • Iteratively reweighted least squares alg. • An iterative algorithm for computing the maximum likelihood estimates of the parameters of a generalized linear model • A special case for Fisher scoring method

  17. Algorithm E-step M-step

  18. Online algorithm • This algorithm can be used for online regression • For Expert network: where Rij is the inverse covariance matrix for EN(i,j)

  19. Online algorithm • For Gating network: where Si is the inverse covariance matrix and where Sijis the inverse covariance matrix

  20. Results • Simulated data of a four-joint robot arm moving in three-dimensional space

  21. Results

  22. Conclusions • Introduce a tree-structured architecture for supervised learning • Much faster than traditional back-propagation algorithm • Can be used for on-line learning

  23. Thank you Questions?

More Related