Learning Kernel Classifiers

Learning Kernel Classifiers Chap. 3.3 Relevance Vector Machine Chap. 3.4 Bayes Point Machines Summarized by Sang Kyun Lee 13th May, 2005

3.3 Relevance Vector Machine • [M.Tipping, JMLR 2001] • Modification to Gaussian process • GP • Prior • Likelihood • Posterior • RVM • Prior • Likelihood sameas GP • Posterior

Reasons • To get sparce representation of • Expected risk of classifier , • Thus, we favor weight vectors with a small number of non-zero coeffs. • One way to achieve this is to modify prior: • Consider • Then wi=0 is only possible • Computation of is easier than before

Prediction funcion • GP • RVM

How can we learn the sparce vector • To find the best , employ evidence maximizaion • The evidence is given explicitly by, • Derived update rules (App'x B.8):

Evidence Maximization • Interestingly, many of the decrease quickly toward zero which lead to a high sparsity in • For faster convergence, delete ith column from whenever < pre-def threshold • After termination, set the corresponding = 0 for which < thres. The remaining are set equal to corresponing values in

Application to Classification • Consider latent target variables • Training objects: • Test object: • Compute the predictive distribution of at the new object , • by applying a latent weight vectorto all the m+1 objects • and marginalizing over all , we get

Note • As in the case of GP, we cannot solve this analytically because is no longer Gaussian • Laplace approximaion: approx. this density by a Gaussian density w/ mean and cov.

Kernel trick • Think about a RKHS generated by • Then ith component of training objects is represented as • Now, think about regression. The concept of becomes the expansion coeff. of the desired hyperplane, such that • In this sense, all the training objects which have non-zero are termed relevance vectors

3.4 Bayes Point Machines • [R. Herbrich, JMLR 2000] • In GP and RVMs, we tried to solve classification problem via regression estimation • Before we assumed prior dist. and used logit transformations to model the likelihood distribution, • Now we try to model it directly

Prior • For classification, only the spatial direction of . Note that • Thus we consider only the vectors on unit sphere • Then assume a uniform prior over this ball-shaped hypothesis space

Likelihood • Use PAC likelihood (0-1 loss) • Posterior • Remark: using PAC likelihood,

Predictive distribution • In two class case, the Bayesian decision can be written as: • That is, the Bayes classification strategy performs majority voting involving all version space classifiers • However, the expectation is hard to solve • Hence we approximate it by a single classifier

That is, BP is the optimal projection of the Bayes classifiers to a single classifier w.r.t. generalization error • However this also is intractable because we need to know input distribution and posterior • Another reasonable approximation:

Now the Bayes classification of new object equals to the classification w.r.t. the single weight vector • Estimate by MCMC sampling (‘kernel billiard algorithm’)

Learning Kernel Classifiers

Learning Kernel Classifiers

Presentation Transcript

Considering Cost Asymmetry in Learning Classifiers

Machine Learning – Classifiers and Boosting

Classifiers

Classifiers

Multiple Kernel Learning

Multiple Kernel Learning

Multiple Kernel Learning

Classifiers

Active Learning of Binary Classifiers

Learning Classifiers from Distributional Data

Learning Classifiers For Non-IID Data

Frame, Reproducing Kernel and Learning

Classifiers

Kernel Density Estimation, Kernel Methods, and fast learning

“Classifiers”

Multiple Kernel Learning

Classifiers!!!

Machine Learning – Classifiers and Boosting

Classifiers