230 likes | 565 Views
Sparse Gaussian Process Classification With Multiple Classes. Matthias W. Seeger Michael I. Jordan University of California, Berkeley www.cs.berkeley.edu/~mseeger. Gaussian Processes are different. Kernel Machines : Estimate single “best” function to solve problem
E N D
Sparse Gaussian Process ClassificationWith Multiple Classes Matthias W. SeegerMichael I. Jordan University of California, Berkeley www.cs.berkeley.edu/~mseeger
Gaussian Processes are different • Kernel Machines:Estimate single “best” function to solve problem • Bayesian Gaussian Processes:Inference over random functions mean predictions and uncertainty estimates • Gives posterior distribution over functions • More expressive • Powerful empirical Bayesian model selection • Combination in larger probabilistic structure Harder to run, but worth it!
The Need for Linear Time • “So Gaussian Processes aim for more than Kernel Machines --- Do they run much slower then?” Not necessarily (anymore)! • GP multi-way classification: • Linear in number datapoints • Linear in number classes • No artificial “output coding” • Predictive uncertainties • Empirical Bayesian model selection
Sparse GP Approximations • Lawrence, Seeger, Herbrich: IVM (NIPS 02) • Home in on active set , size • Replace likelihood by alikelihood approximation , a Gaussian functionof only • Use information criteria to find I greedily • Restricted to models with one process only (like other sparse GP methods)
Multi-Class Models • Multinomial Likelihood (“Softmax”) • Use one process uc(¢) for each class • Processes independenta priori • Different kernels K(c)for each class
“But That’s Easy…” • … we thought back then, but: Posterior covariance • Both are block-diagonal, but in different systems!Together: A has no simple structure!
Second Order Approximation • u(c) should be coupled a posteriori Diagonalnot useful • Hessian of has simple form • Allow for likelihood coupling to be represented exactly up to second order: , diagonal minus rank 1
Subproblems • Efficient representation exploiting the prior independence and constrained form • ADF projections to constrained Gaussian to compute site precision blocks • Forward selection of I • Extensions of simple myopic scheme • Model selection based on conditional inference
Representation • Exploits block-diagonal matrix structures • Nontrivial to get numerics right (Cholesky factors) • Dominating stub buffers , to compute marginal moments • Update after inclusion (stubs) in total
Restricted ADF Projection • Hard (non-convex) because constrained • Use double-loop scheme: outer loop analytic, inner loop convex very fast • Initialization matters. Our choice can be motivated from second order approximation (once more)
Information Gain Criterion • Selection score measures “informativeness” of candidates, given current belief after inclusion of candidate i • Points close or wrong side of class boundaries • Requires marginal computed from stubs • Score candidates prior to each inclusion
Active Set I Solid Set Liquid Set Inclusion Freezing i Extensions of Myopic Scheme • growing • fixed site parameters (for efficiency) • fixed size • site parameters iteratively updated using EP
Overview Inference Algorithm Selection Phase:Compute marginals, score O(n/C) candidates. Select winner Inclusion Phase:Include pattern. Move oldest liquid to solid active set EP Phase:Run EP updates iteratively on liquid set site parameters
Model Selection • Use variational bound on marginal likelihood based on inference approximation • Gradient costs inference plus • Minimize using Quasi Newton, reselecting I and site parameters for new search directions(non-standard optimization problem)
Preliminary Experiments • Small part of MNIST (even digits, C=5, n=800) • No model selection (MS not yet tested), all K(c) the same: • dfinal=150, L=25 (liquid set)
Future Experiments • Much larger experiments are in preparation,including model selection • Uses novel powerful object oriented Matlab/C++ interface • Control over very large persistent C++ objects from Matlab • Faster transition: prototype (Matlab) product (C++) • Powerful matrix classes (masking, LAPACK/BLAS) • Optimization code • Will be released into public domain
Future Work • Experiments on much larger tasks • Model selection with independent, heavily parameterized kernels (ARD,…) • Present scheme cannot be used for large C
Future Work (2) Gaussian process priors in large structured networks Gaussian process conditional random fields, … • Previous work adresses function “point estimation”.We aim for GP inference including uncertainty estimates • Have to deal with huge random field: correlations not only between datapoints, but also along time Automatic factorizations will be crucial • The multi-class scheme will be a major building block