Sparse Approximations to Bayesian Gaussian Processes

Sparse Approximations toBayesian Gaussian Processes Matthias Seeger University of Edinburgh

Collaborators • Neil Lawrence (Sheffield) • Chris Williams (Edinburgh) • Ralf Herbrich (MSR Cambridge)

Overview of the Talk • Gaussian processes and approximations • Understanding sparse schemes aslikelihood approximations • Two schemes and their relationships • Fast greedy selection for the projected latent variables scheme (GP regression)

Why Sparse Approximations? • GPs lead to very powerful Bayesian methods for function fitting, classification, etc. Yet: (Almost) Nobody uses them! • Reason: Horrible scaling O(n3) • If sparse approximations work, there is a host of applications, e.g. as building blocks in Bayesian networks, etc.

y1 y2 y3 u1 u2 u3 x1 x2 x3 Gaussian Process Models Target y separated by latent u from all other variables Inference a finite problem Gaussian prior(dense),kernel K

Conditional GP (Prior) n-dim. Gaussian Parameterisation Data D = {(xi,yi) | i=1,…,n}.Latent outputs u = (u1,…,un). Approximate posterior process P(u(¢) | D)by GP Q(u(¢) | D)

GP Approximations • Most (non-MCMC) GP approximations use this representation • Exact computation of Q(u | D) intractable, needs • Attractive for sparse approximations:Sequential fitting of Q(u | D) to P(u | D)

Assumed Density Filtering Update (ADF step):

Towards Sparsity • ADF = Bayesian Online [Opper].Multiple updates: Cavity method [Opper, Winther], EP [Minka] • Generalizations: EP [Minka], ADATAP [Csato,Opper,Winther: COW] • Sequential updates suitable for sparse online or greedy methods

Depends on uI only Likelihood Approximations Active set: I ½ {1,…,n}, |I| = d¿ n Several sparse schemes can be understood aslikelihood approximations

y2 y3 u2 u3 x2 x3 Likelihood Approximations (II) y1 y4 u1 u4 x1 x4 Active Set I = {2,3}

Likelihood Approximations (III) For such sparse schemes: • O(d2) parameters at most • Prediction in O(d2), O(d) for mean only • Approximations to marginal likelihood (variational lower bound, ADATAP [COW]), PAC bounds [Seeger], etc., become cheap as well!

Two Schemes • IVM [Lawrence, Seeger, Herbrich: LSH]ADF with fast greedy forward selection • Sparse Greedy GPR [Smola, Bartlett: SB]Greedy, expensive. Can be sped up:Projected Latent Variables [Seeger, Lawrence, Williams]. More general:Sparse batch ADATAP [COW] • Not here: Sparse Online GP [Csato, Opper]

Only d are non-zero Informative Vector Machine • ADF, stopped after dinclusions [could do deletions, exchanges] • Fast greedy forward selection using criteria known in active learning • Faster than SVM on hard MNIST binary tasks, yet probabilistic (error bars, etc.)

Why So Simple? • Locality Property of ADF:Marginal Qnew(ui) in O(1) from Q(ui) • Locality Property and Gaussianity:Relations like: Fast evaluation of differential criteria

KL-Optimal Projections • Csato/Opper observed:

KL-Optimal Projections (II) • For Gaussian likelihood: • Can be used online or batch • A bit unfortunate: We use relative entropy both ways around!

Projected Latent Variables • Full GPR samples uI» P(uI), uR» P(uR | uI), y» N(y | u, s2I). • Instead: y» N(y | E[u | uI], s2I). Latent variables uR replaced by projections in likelihood [SB] (without interpret.) • Note: Sparse batch ADATAP [COW] more general (non-Gaussian likelihoods)

Fast Greedy Selections • With this likelihood approximation, typical forward selection criteria (MAP [SB]; diff. entropy, info-gain [LSH]) are too expensive • Problem: Upon inclusion, latent ui is coupled with all targets y • Cheap criterion: Ignore most couplings for score evaluation (not for inclusion!)

Yet Another Approximation • To score xi, we approximate Qnew(u | D) after inclusion of i by • Example: Information gain

Fast Greedy Selections (II) • Leads to O(1) criteria.Cost of searching over all remaining points dominated by cost for inclusion • Can easily be generalized to allow for couplings between ui and some targets, if desired • Can be done for sparse batch ADATAP as well

Marginal Likelihood • The marginal likelihood is • Can be optimized efficiently w.r.t. s and kernel parameters, O(n d (d+p)) per gradient, p number of parameters • Keep I fixed during line searches, reselect for search directions

Conclusions • Most sparse approximations can be understood as likelihood approximations • Several schemes available, all O(n d2), yet constants do matter here! • Fast information-theoretic criteria effective for classification Extension to active learning straightforward

Conclusions (II) • Missing: Experimental comparison, esp. to test effectiveness of marginal likelihood optimization • Extensions: • C classes: Easy in O(n d2 C2), maybe in O(n d2 C) • Integrate with Bayesian networks[Friedman, Nachman]

Sparse Approximations to Bayesian Gaussian Processes