240 likes | 382 Views
Sparse Approximations to Bayesian Gaussian Processes. Matthias Seeger University of Edinburgh. Collaborators. Neil Lawrence (Sheffield) Chris Williams (Edinburgh) Ralf Herbrich (MSR Cambridge). Overview of the Talk. Gaussian processes and approximations
E N D
Sparse Approximations toBayesian Gaussian Processes Matthias Seeger University of Edinburgh
Collaborators • Neil Lawrence (Sheffield) • Chris Williams (Edinburgh) • Ralf Herbrich (MSR Cambridge)
Overview of the Talk • Gaussian processes and approximations • Understanding sparse schemes aslikelihood approximations • Two schemes and their relationships • Fast greedy selection for the projected latent variables scheme (GP regression)
Why Sparse Approximations? • GPs lead to very powerful Bayesian methods for function fitting, classification, etc. Yet: (Almost) Nobody uses them! • Reason: Horrible scaling O(n3) • If sparse approximations work, there is a host of applications, e.g. as building blocks in Bayesian networks, etc.
y1 y2 y3 u1 u2 u3 x1 x2 x3 Gaussian Process Models Target y separated by latent u from all other variables Inference a finite problem Gaussian prior(dense),kernel K
Conditional GP (Prior) n-dim. Gaussian Parameterisation Data D = {(xi,yi) | i=1,…,n}.Latent outputs u = (u1,…,un). Approximate posterior process P(u(¢) | D)by GP Q(u(¢) | D)
GP Approximations • Most (non-MCMC) GP approximations use this representation • Exact computation of Q(u | D) intractable, needs • Attractive for sparse approximations:Sequential fitting of Q(u | D) to P(u | D)
Assumed Density Filtering Update (ADF step):
Towards Sparsity • ADF = Bayesian Online [Opper].Multiple updates: Cavity method [Opper, Winther], EP [Minka] • Generalizations: EP [Minka], ADATAP [Csato,Opper,Winther: COW] • Sequential updates suitable for sparse online or greedy methods
Depends on uI only Likelihood Approximations Active set: I ½ {1,…,n}, |I| = d¿ n Several sparse schemes can be understood aslikelihood approximations
y2 y3 u2 u3 x2 x3 Likelihood Approximations (II) y1 y4 u1 u4 x1 x4 Active Set I = {2,3}
Likelihood Approximations (III) For such sparse schemes: • O(d2) parameters at most • Prediction in O(d2), O(d) for mean only • Approximations to marginal likelihood (variational lower bound, ADATAP [COW]), PAC bounds [Seeger], etc., become cheap as well!
Two Schemes • IVM [Lawrence, Seeger, Herbrich: LSH]ADF with fast greedy forward selection • Sparse Greedy GPR [Smola, Bartlett: SB]Greedy, expensive. Can be sped up:Projected Latent Variables [Seeger, Lawrence, Williams]. More general:Sparse batch ADATAP [COW] • Not here: Sparse Online GP [Csato, Opper]
Only d are non-zero Informative Vector Machine • ADF, stopped after dinclusions [could do deletions, exchanges] • Fast greedy forward selection using criteria known in active learning • Faster than SVM on hard MNIST binary tasks, yet probabilistic (error bars, etc.)
Why So Simple? • Locality Property of ADF:Marginal Qnew(ui) in O(1) from Q(ui) • Locality Property and Gaussianity:Relations like: Fast evaluation of differential criteria
KL-Optimal Projections • Csato/Opper observed:
KL-Optimal Projections (II) • For Gaussian likelihood: • Can be used online or batch • A bit unfortunate: We use relative entropy both ways around!
Projected Latent Variables • Full GPR samples uI» P(uI), uR» P(uR | uI), y» N(y | u, s2I). • Instead: y» N(y | E[u | uI], s2I). Latent variables uR replaced by projections in likelihood [SB] (without interpret.) • Note: Sparse batch ADATAP [COW] more general (non-Gaussian likelihoods)
Fast Greedy Selections • With this likelihood approximation, typical forward selection criteria (MAP [SB]; diff. entropy, info-gain [LSH]) are too expensive • Problem: Upon inclusion, latent ui is coupled with all targets y • Cheap criterion: Ignore most couplings for score evaluation (not for inclusion!)
Yet Another Approximation • To score xi, we approximate Qnew(u | D) after inclusion of i by • Example: Information gain
Fast Greedy Selections (II) • Leads to O(1) criteria.Cost of searching over all remaining points dominated by cost for inclusion • Can easily be generalized to allow for couplings between ui and some targets, if desired • Can be done for sparse batch ADATAP as well
Marginal Likelihood • The marginal likelihood is • Can be optimized efficiently w.r.t. s and kernel parameters, O(n d (d+p)) per gradient, p number of parameters • Keep I fixed during line searches, reselect for search directions
Conclusions • Most sparse approximations can be understood as likelihood approximations • Several schemes available, all O(n d2), yet constants do matter here! • Fast information-theoretic criteria effective for classification Extension to active learning straightforward
Conclusions (II) • Missing: Experimental comparison, esp. to test effectiveness of marginal likelihood optimization • Extensions: • C classes: Easy in O(n d2 C2), maybe in O(n d2 C) • Integrate with Bayesian networks[Friedman, Nachman]