Hybrids of generative and discriminative methods for machine learning

MSRC Summer School - 30/06/2009 Hybrids of generative anddiscriminative methods for machine learning Cambridge – UK

Motivation • Generative models • prior knowledge • handle missing data such as labels • Discriminative models • perform well at classification • However • no straightforward way to combine them

Content • Generative and discriminative methods • A principled hybrid framework • Study of the properties on a toy example • Influence of the amount of labelled data

Generative methods • Answer: “what does a cat look like? and a dog?” => data and labels joint distribution x : data c : label  : parameters

Generative methods • Objective function: G() = p() p(X, C|) G() = p() n p(xn, cn|) • 1 reusable model per class, can deal with incomplete data • Example: GMMs

Example of generative model

Discriminative methods • Answer: “is it a cat or a dog?” => labels posterior distribution x : data c : label  : parameters

Discriminative methods • The objective function is D() = p() p(C|X, ) D() = p() n p(cn|xn, ) • Focus on regions of ambiguity, make faster predictions • Example: neural networks, SVMs

Example of discriminative model SVMs / NNs

Generative versus discriminative No effect of the double mode on the decision boundary

Semi-supervised learning • Few labelled data / lots of unlabelled data • Discriminative methods overfit, generative models only help classify if they are “good” • Need to have the modelling power of generative models while performing at discriminating => hybrid models

Convex combinationBouchard et al, COMPSTAT 04 • Generative objective function: G() = p() n p(xn, cn|) • Discriminative objective function: D() = p() n p(cn|xn, ) • Convex combination: log L() =   log D() + (1- )  log G() [0,1]

A principled hybrid model

A principled hybrid model •  - posterior distribution of the labels ’- marginal distribution of the data  and ’ communicate through a prior • Hybrid objective function: L(,’) = p(,’)  n p(cn|xn, ) n p(xn|’)

A principled hybrid model •  = ’ => p(, ’) = p() (-’) L(,’) = p() (-’) n p(cn|xn, ) n p(xn|’) L() = G() generative case •   ’ => p(, ’) = p() p(’) L(,’) = [ p() n p(cn|xn, ) ]  [ p(’) n p(xn|’) ] L(,’) = D()  f(’) discriminative case

A principled hybrid model • Anything in between – hybrid case • Choice of prior: p(, ’) = p() N(’|, (a)) a  0 =>   0 =>  = ’ a 1 =>    =>   ’

Why principled? • Consistent with the likelihood of graphical models => one way to train a system • Everything can now be modelled => potential to be Bayesian • Potential to learn a

Learning • EM / Laplace approximation / MCMC either intractable or too slow • Conjugate gradients flexible, easy to check BUT sensitive to initialisation, slow • Variational inference

Toy example

Toy example • 2 elongated distributions • Only spherical gaussians allowed => wrong model • 2 labelled points per class => strong risk of overfitting

Toy example

Decision boundaries

A real example • Images are a special case, as they contain several features each • 2 levels of supervision: at the image level, and at the feature level • Image label only => weakly labelled • Image label + segmentation => fully labelled

The underlying generative model multinomial multinomial gaussian

The underlying generative model weakly – fully labelled

Experimental set-up • 3 classes: bikes, cows, sheep • : 1 Gaussian per class => poor generative model • 75 training images for each category

HF framework

HF versus CC

Results • When increasing the proportion of fully labelled data, the trend is: generative  hybrid  discriminative • Weakly labelled data has little influence on the trend • With sufficient fully labelled data, HF tends to perform better than CC

Experimental set-up • 3 classes: lions, tigers and cheetahs • : 1 Gaussian per class => poor generative model • 75 training images for each category

HF framework

HF versus CC

Results • Hybrid models consistently perform better • However, generative and discriminative models haven’t reached saturation • No clear difference between HF and CC

Conclusion • Principled hybrid framework • Possibility to learn the best trade-off • Helps for ambiguous datasets when labelled data is scarce • Problem of optimisation

Future avenues • Bayesian version (posterior distribution of ) under study • Replace  by a diagonal matrix  to allow flexibility => need for the Bayesian version • Choice of priors

Thank you!

Hybrids of generative and discriminative methods for machine learning

Hybrids of generative and discriminative methods for machine learning

Presentation Transcript

CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers

Generative and Discriminative Models in NLP: A Survey

Bayes Classifiers and Generative Methods

Discriminative Learning of Extraction Sets for Machine Translation

Learning and Vision: Generative Methods

Discriminative and generative methods for bags of features

Generative and Discriminative Models in Text Classification

Generative Models vs. Discriminative models

Discriminative and Generative Recognition

Machine learning methods for protein analyses

Generative learning methods for bags of features

Discriminative and generative classifiers

Machine Learning Methods

Generative and Discriminative Models in NLP: A Survey

Hybrids of generative and discriminative methods for machine learning

Machine Learning Methods for Cybersecurity

Ensemble Methods for Machine Learning

Machine learning methods for protein analyses