1 / 21

Sparse Gaussian Process Classification With Multiple Classes

Sparse Gaussian Process Classification With Multiple Classes. Matthias W. Seeger Michael I. Jordan University of California, Berkeley www.cs.berkeley.edu/~mseeger. Gaussian Processes are different. Kernel Machines : Estimate single “best” function to solve problem

stevie
Download Presentation

Sparse Gaussian Process Classification With Multiple Classes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sparse Gaussian Process ClassificationWith Multiple Classes Matthias W. SeegerMichael I. Jordan University of California, Berkeley www.cs.berkeley.edu/~mseeger

  2. Gaussian Processes are different • Kernel Machines:Estimate single “best” function to solve problem • Bayesian Gaussian Processes:Inference over random functions mean predictions and uncertainty estimates • Gives posterior distribution over functions • More expressive • Powerful empirical Bayesian model selection • Combination in larger probabilistic structure  Harder to run, but worth it!

  3. The Need for Linear Time • “So Gaussian Processes aim for more than Kernel Machines --- Do they run much slower then?” Not necessarily (anymore)! • GP multi-way classification: • Linear in number datapoints • Linear in number classes • No artificial “output coding” • Predictive uncertainties • Empirical Bayesian model selection

  4. Sparse GP Approximations • Lawrence, Seeger, Herbrich: IVM (NIPS 02) • Home in on active set , size • Replace likelihood by alikelihood approximation , a Gaussian functionof only • Use information criteria to find I greedily • Restricted to models with one process only (like other sparse GP methods)

  5. Multi-Class Models • Multinomial Likelihood (“Softmax”) • Use one process uc(¢) for each class • Processes independenta priori • Different kernels K(c)for each class

  6. “But That’s Easy…” • … we thought back then, but: Posterior covariance • Both are block-diagonal, but in different systems!Together: A has no simple structure!

  7. Second Order Approximation • u(c) should be coupled a posteriori  Diagonalnot useful • Hessian of has simple form • Allow for likelihood coupling to be represented exactly up to second order: , diagonal minus rank 1

  8. Subproblems • Efficient representation exploiting the prior independence and constrained form • ADF projections to constrained Gaussian to compute site precision blocks • Forward selection of I • Extensions of simple myopic scheme • Model selection based on conditional inference

  9. Representation • Exploits block-diagonal matrix structures • Nontrivial to get numerics right (Cholesky factors) • Dominating stub buffers , to compute marginal moments • Update after inclusion (stubs) in total

  10. Restricted ADF Projection • Hard (non-convex) because constrained • Use double-loop scheme: outer loop analytic, inner loop convex  very fast • Initialization matters. Our choice can be motivated from second order approximation (once more)

  11. Information Gain Criterion • Selection score measures “informativeness” of candidates, given current belief after inclusion of candidate i • Points close or wrong side of class boundaries • Requires marginal computed from stubs • Score candidates prior to each inclusion

  12. Active Set I Solid Set Liquid Set Inclusion Freezing i Extensions of Myopic Scheme • growing • fixed site parameters (for efficiency) • fixed size • site parameters iteratively updated using EP

  13. Overview Inference Algorithm Selection Phase:Compute marginals, score O(n/C) candidates. Select winner Inclusion Phase:Include pattern. Move oldest liquid to solid active set EP Phase:Run EP updates iteratively on liquid set site parameters

  14. Model Selection • Use variational bound on marginal likelihood based on inference approximation • Gradient costs inference plus • Minimize using Quasi Newton, reselecting I and site parameters for new search directions(non-standard optimization problem)

  15. Preliminary Experiments • Small part of MNIST (even digits, C=5, n=800) • No model selection (MS not yet tested), all K(c) the same: • dfinal=150, L=25 (liquid set)

  16. Preliminary Experiments (2)

  17. Preliminary Experiments (3)

  18. Preliminary Experiments (4)

  19. Future Experiments • Much larger experiments are in preparation,including model selection • Uses novel powerful object oriented Matlab/C++ interface • Control over very large persistent C++ objects from Matlab • Faster transition: prototype (Matlab)  product (C++) • Powerful matrix classes (masking, LAPACK/BLAS) • Optimization code • Will be released into public domain

  20. Future Work • Experiments on much larger tasks • Model selection with independent, heavily parameterized kernels (ARD,…) • Present scheme cannot be used for large C

  21. Future Work (2) Gaussian process priors in large structured networks Gaussian process conditional random fields, … • Previous work adresses function “point estimation”.We aim for GP inference including uncertainty estimates • Have to deal with huge random field: correlations not only between datapoints, but also along time Automatic factorizations will be crucial • The multi-class scheme will be a major building block

More Related