260 likes | 437 Views
PAC-Bayesian Theorems for Gaussian Process Classifications. Matthias Seeger University of Edinburgh. Overview. PAC-Bayesian theorem for Gibbs classifiers Application to Gaussian process classification Experiments Conclusions. What Is a PAC Bound?. Sample S= {( x i ,t i ) | i=1,…,n}.
E N D
PAC-Bayesian Theorems forGaussian Process Classifications Matthias Seeger University of Edinburgh
Overview • PAC-Bayesian theorem for Gibbs classifiers • Application to Gaussian process classification • Experiments • Conclusions
What Is a PAC Bound? Sample S= {(xi,ti) | i=1,…,n} Unknown P* • Algorithm: Sa Predictor t* from x*Generalisation error: gen(S) • PAC/distribution free bound: i.i.d.
Nonuniform PAC Bounds • A PAC bound has tohold independent of correctnessof prior knowledge • It does not have tobe independentof prior knowledge • Unfortunately, most standard VC bounds are only vaguely dependent on prior/model they are applied to lack tightness
w y1 y2 y3 t1 t2 t3 Gibbs Classifiers • Bayes classifier: • Gibbs classifier:New independent w for each prediction R3 2{-1,+1}
PAC-Bayesian Theorem Result for Gibbs classifiers • Prior P(w), independent of S • Posterior Q(w), may depend on S • Expected generalisation error: • Expected empirical error:
PAC-Bayesian Theorem (II) McAllester (1999): • D[Q || P]: Relative entropyIf Q(w) feasible approximation to Bayesian posterior, we can compute D[Q || P]
The Proof Idea Step 1: Inequality for a dumb classifier Let .Large deviation bound holds for fixed w (use Asymptotic Equipartition Property). Since P(w) independent of S, bound holds also “on average”
The Proof Idea (II) Could use Jensen’s inequality: But so what?? P is fixed a-priori, giving a pretty dumb classifier! • Can we exchange P for Q? Yes! • What do we have to pay? n-1 D[Q || P]
Convex Duality • Could finish proof using tricks and Jensen.Let’s see what’s behind instead! • Convex (Legendre) Duality:A very simple, but powerful concept:Parameterise linear lower bounds to a convex function • Behind the scenes (almost) everywhere:EM, variational bounds, primal-dual optimisation, …, PAC-Bayesian theorem
The Proof Idea (III) • Works just as well for spaces of functions and distributions. • For our purpose:is convex and has the dual
The Proof Idea (IV) • This gives the boundfor all Q, l • Set l(w) = n D(w). Then:Have already bounded 2nd term right.And on the left (Jensen again):
Comments • PAC-Bayesian technique generic:Use specific large deviation bounds for the Q-independent term • Choice of Q:Trade-off between emp(S,Q) and divergence D[Q || P].Bayesian posterior a good candidate
Gaussian Process Classification • Recall yesterday:We approximate true posterior process by a Gaussian one:
The Relative Entropy • But, then the relative entropy is just: • Straightforward to compute for all GPC approximations in this class
Concrete GPC Methods We considered so far: • Laplace GPC [Barber/Williams] • Sparse greedy GPC (IVM) [Csato/Opper, Lawrence/Seeger/Herbrich] Setup:Downsampled MNIST (2s vs. 3s). RBF kernels. Model selection using independent holdout sets (no ML-II allowed here!)
Results Sparse Greedy GPC Extremely tight for a kernel classifier bound Note: These results are for Gibbs classifiers.Bayes classifiers do better, but the (original)PAC-Bayesian theorem does not hold
Comparison Compression Bound • Compression bound for sparse greedy GPC (Bayes version, not Gibbs) • Problem: Bound not configurable by prior knowledge, not specific to the algorithm
Comparison With SVM • Compression bound (best we could find!) • Note: Bound values lower than for sparse GPC onlybecause of sparser solution:Bound does not depend on algorithm!
The Bayes Classifier • Very recently, Meir and Zhang obtained PAC-Bayesian bound for Bayes-type classifiers • Uses recent Rademacher complexity bounds together with convex duality argument • Can be applied to GP classification as well (not yet done)
Conclusions • PAC-Bayesian technique (convex duality) leads to tighter bounds than previously available for Bayes-type classifiers (to our knowledge) • Easy extension to multi-class scenarios • Application to GP classification:Tighter bounds than previously available for kernel machines (to our knowledge)
Conclusions (II) • Value in practice: Bound holds for any posterior approximation, not just the true posterior itself • Some open problems: • Unbounded loss functions • Characterize the slack in the bound • Incorporating ML-II model selection over continuous hyperparameter space