PAC-Bayesian Theorems for Gaussian Process Classifications

PAC-Bayesian Theorems forGaussian Process Classifications Matthias Seeger University of Edinburgh

Overview • PAC-Bayesian theorem for Gibbs classifiers • Application to Gaussian process classification • Experiments • Conclusions

What Is a PAC Bound? Sample S= {(xi,ti) | i=1,…,n} Unknown P* • Algorithm: Sa Predictor t* from x*Generalisation error: gen(S) • PAC/distribution free bound: i.i.d.

Nonuniform PAC Bounds • A PAC bound has tohold independent of correctnessof prior knowledge • It does not have tobe independentof prior knowledge • Unfortunately, most standard VC bounds are only vaguely dependent on prior/model they are applied to lack tightness

w y1 y2 y3 t1 t2 t3 Gibbs Classifiers • Bayes classifier: • Gibbs classifier:New independent w for each prediction R3 2{-1,+1}

PAC-Bayesian Theorem Result for Gibbs classifiers • Prior P(w), independent of S • Posterior Q(w), may depend on S • Expected generalisation error: • Expected empirical error:

PAC-Bayesian Theorem (II) McAllester (1999): • D[Q || P]: Relative entropyIf Q(w) feasible approximation to Bayesian posterior, we can compute D[Q || P]

The Proof Idea Step 1: Inequality for a dumb classifier Let .Large deviation bound holds for fixed w (use Asymptotic Equipartition Property). Since P(w) independent of S, bound holds also “on average”

The Proof Idea (II) Could use Jensen’s inequality: But so what?? P is fixed a-priori, giving a pretty dumb classifier! • Can we exchange P for Q? Yes! • What do we have to pay? n-1 D[Q || P]

Convex Duality • Could finish proof using tricks and Jensen.Let’s see what’s behind instead! • Convex (Legendre) Duality:A very simple, but powerful concept:Parameterise linear lower bounds to a convex function • Behind the scenes (almost) everywhere:EM, variational bounds, primal-dual optimisation, …, PAC-Bayesian theorem

Convex Duality (II)

Convex Duality (III)

The Proof Idea (III) • Works just as well for spaces of functions and distributions. • For our purpose:is convex and has the dual

The Proof Idea (IV) • This gives the boundfor all Q, l • Set l(w) = n D(w). Then:Have already bounded 2nd term right.And on the left (Jensen again):

Comments • PAC-Bayesian technique generic:Use specific large deviation bounds for the Q-independent term • Choice of Q:Trade-off between emp(S,Q) and divergence D[Q || P].Bayesian posterior a good candidate

Gaussian Process Classification • Recall yesterday:We approximate true posterior process by a Gaussian one:

The Relative Entropy • But, then the relative entropy is just: • Straightforward to compute for all GPC approximations in this class

Concrete GPC Methods We considered so far: • Laplace GPC [Barber/Williams] • Sparse greedy GPC (IVM) [Csato/Opper, Lawrence/Seeger/Herbrich] Setup:Downsampled MNIST (2s vs. 3s). RBF kernels. Model selection using independent holdout sets (no ML-II allowed here!)

Results for Laplace GPC

Results Sparse Greedy GPC Extremely tight for a kernel classifier bound Note: These results are for Gibbs classifiers.Bayes classifiers do better, but the (original)PAC-Bayesian theorem does not hold

Comparison Compression Bound • Compression bound for sparse greedy GPC (Bayes version, not Gibbs) • Problem: Bound not configurable by prior knowledge, not specific to the algorithm

Comparison With SVM • Compression bound (best we could find!) • Note: Bound values lower than for sparse GPC onlybecause of sparser solution:Bound does not depend on algorithm!

Model Selection

The Bayes Classifier • Very recently, Meir and Zhang obtained PAC-Bayesian bound for Bayes-type classifiers • Uses recent Rademacher complexity bounds together with convex duality argument • Can be applied to GP classification as well (not yet done)

Conclusions • PAC-Bayesian technique (convex duality) leads to tighter bounds than previously available for Bayes-type classifiers (to our knowledge) • Easy extension to multi-class scenarios • Application to GP classification:Tighter bounds than previously available for kernel machines (to our knowledge)

Conclusions (II) • Value in practice: Bound holds for any posterior approximation, not just the true posterior itself • Some open problems: • Unbounded loss functions • Characterize the slack in the bound • Incorporating ML-II model selection over continuous hyperparameter space

PAC-Bayesian Theorems for Gaussian Process Classifications