  1. Sparse Gaussian Process ClassificationWith Multiple Classes Matthias W. SeegerMichael I. Jordan University of California, Berkeley www.cs.berkeley.edu/~mseeger

  2. Gaussian Processes are different • Kernel Machines:Estimate single “best” function to solve problem • Bayesian Gaussian Processes:Inference over random functions mean predictions and uncertainty estimates • Gives posterior distribution over functions • More expressive • Powerful empirical Bayesian model selection • Combination in larger probabilistic structure  Harder to run, but worth it!

  3. The Need for Linear Time • “So Gaussian Processes aim for more than Kernel Machines --- Do they run much slower then?” Not necessarily (anymore)! • GP multi-way classification: • Linear in number datapoints • Linear in number classes • No artificial “output coding” • Predictive uncertainties • Empirical Bayesian model selection

  4. Sparse GP Approximations • Lawrence, Seeger, Herbrich: IVM (NIPS 02) • Home in on active set , size • Replace likelihood by alikelihood approximation , a Gaussian functionof only • Use information criteria to find I greedily • Restricted to models with one process only (like other sparse GP methods)

  5. Multi-Class Models • Multinomial Likelihood (“Softmax”) • Use one process uc(¢) for each class • Processes independenta priori • Different kernels K(c)for each class

  6. “But That’s Easy…” • … we thought back then, but: Posterior covariance • Both are block-diagonal, but in different systems!Together: A has no simple structure!

  7. Second Order Approximation • u(c) should be coupled a posteriori  Diagonalnot useful • Hessian of has simple form • Allow for likelihood coupling to be represented exactly up to second order: , diagonal minus rank 1

  8. Subproblems • Efficient representation exploiting the prior independence and constrained form • ADF projections to constrained Gaussian to compute site precision blocks • Forward selection of I • Extensions of simple myopic scheme • Model selection based on conditional inference

  9. Representation • Exploits block-diagonal matrix structures • Nontrivial to get numerics right (Cholesky factors) • Dominating stub buffers , to compute marginal moments • Update after inclusion (stubs) in total

  10. Restricted ADF Projection • Hard (non-convex) because constrained • Use double-loop scheme: outer loop analytic, inner loop convex  very fast • Initialization matters. Our choice can be motivated from second order approximation (once more)

  11. Information Gain Criterion • Selection score measures “informativeness” of candidates, given current belief after inclusion of candidate i • Points close or wrong side of class boundaries • Requires marginal computed from stubs • Score candidates prior to each inclusion

  12. Active Set I Solid Set Liquid Set Inclusion Freezing i Extensions of Myopic Scheme • growing • fixed site parameters (for efficiency) • fixed size • site parameters iteratively updated using EP

  13. Overview Inference Algorithm Selection Phase:Compute marginals, score O(n/C) candidates. Select winner Inclusion Phase:Include pattern. Move oldest liquid to solid active set EP Phase:Run EP updates iteratively on liquid set site parameters

  14. Model Selection • Use variational bound on marginal likelihood based on inference approximation • Gradient costs inference plus • Minimize using Quasi Newton, reselecting I and site parameters for new search directions(non-standard optimization problem)

  15. Preliminary Experiments • Small part of MNIST (even digits, C=5, n=800) • No model selection (MS not yet tested), all K(c) the same: • dfinal=150, L=25 (liquid set)

  16. Preliminary Experiments (2)

  17. Preliminary Experiments (3)

  18. Preliminary Experiments (4)

  19. Future Experiments • Much larger experiments are in preparation,including model selection • Uses novel powerful object oriented Matlab/C++ interface • Control over very large persistent C++ objects from Matlab • Faster transition: prototype (Matlab)  product (C++) • Powerful matrix classes (masking, LAPACK/BLAS) • Optimization code • Will be released into public domain

  20. Future Work • Experiments on much larger tasks • Model selection with independent, heavily parameterized kernels (ARD,…) • Present scheme cannot be used for large C

  21. Future Work (2) Gaussian process priors in large structured networks Gaussian process conditional random fields, … • Previous work adresses function “point estimation”.We aim for GP inference including uncertainty estimates • Have to deal with huge random field: correlations not only between datapoints, but also along time Automatic factorizations will be crucial • The multi-class scheme will be a major building block

