Hierarchical Mixture of Experts

Hierarchical Mixture of Experts Presented by Qi An Machine learning reading group Duke University 07/15/2005

Outline • Background • Hierarchical tree structure • Gating networks • Expert networks • E-M algorithm • Experimental results • Conclusions

Background • The idea of mixture of experts • First presented by Jacobs and Hintons in 1988 • Hierarchical mixture of experts • Proposed by Jordan and Jacobs in 1994 • Difference from previous mixture model • Mixing weights depends on both the input and the output

Example (ME)

One-layer structure μ Ellipsoidal Gating function g1 g2 g3 Gating Network x μ1 μ2 μ3 Expert Network Expert Network Expert Network x x x

Example (HME)

Hierarchical tree structure Linear Gating function

Expert network • At the leaves of trees for each expert: linear predictor output of the expert link function For example: logistic function for binary classification

Gating network • At the nonterminal of the tree top layer: other layer:

Output • At the non-leaves nodes top node: other nodes:

Probability model • For each expert, assume the true output y is chosen from a distribution P with mean μij • Therefore, the total probability of generating y from x is given by

Posterior probabilities • Since the gij and gi are computed based only on the input x, we refer them as prior probabilities. • We can define the posterior probabilities with the knowledge of both the input x and the output y using Bayes’ rule

E-M algorithm • Introduce auxiliary variables zij which have an interpretation as the labels that corresponds to the experts. • The probability model can be simplified with the knowledge of auxiliary variables

E-M algorithm • Complete-data likelihood: • The E-step

E-M algorithm • The M-step

IRLS • Iteratively reweighted least squares alg. • An iterative algorithm for computing the maximum likelihood estimates of the parameters of a generalized linear model • A special case for Fisher scoring method

Algorithm E-step M-step

Online algorithm • This algorithm can be used for online regression • For Expert network: where Rij is the inverse covariance matrix for EN(i,j)

Online algorithm • For Gating network: where Si is the inverse covariance matrix and where Sijis the inverse covariance matrix

Results • Simulated data of a four-joint robot arm moving in three-dimensional space

Results

Conclusions • Introduce a tree-structured architecture for supervised learning • Much faster than traditional back-propagation algorithm • Can be used for on-line learning

Thank you Questions?

Hierarchical Mixture of Experts

Hierarchical Mixture of Experts

Presentation Transcript

MIXTURE PROBLEMS

MIXTURE PROBLEMS

Supervised learning: Mixture Of Experts (MOE) Network

CisModule: De novo discovery of cis-regulatory modules by hierarchical mixture modeling

Mixture Problems

Separation of mixture

Mixture Modeling

Mixture

Mixture Problems

Bayesian Multi-Population Haplotype Inference via a Hierarchical Dirichlet Process Mixture

Phylogeny of Mixture Models

Hierarchical Double Dirichlet Process Mixture of Gaussian Processes

Mixture Problems

Mixture problems

A Comparison of Two MCMC Algorithms for Hierarchical Mixture Models

Lecture 38 Mixture of Experts Neural Network

Mixture

Mixture

Mixture Models