210 likes | 381 Views
Learning Kernel Classifiers. Chap. 3.3 Relevance Vector Machine Chap. 3.4 Bayes Point Machines Summarized by Sang Kyun Lee 13 th May, 2005. 3.3 Relevance Vector Machine. [M.Tipping, JMLR 2001] Modification to Gaussian process GP Prior Likelihood Posterior RVM Prior
E N D
Learning Kernel Classifiers Chap. 3.3 Relevance Vector Machine Chap. 3.4 Bayes Point Machines Summarized by Sang Kyun Lee 13th May, 2005
3.3 Relevance Vector Machine • [M.Tipping, JMLR 2001] • Modification to Gaussian process • GP • Prior • Likelihood • Posterior • RVM • Prior • Likelihood sameas GP • Posterior
Reasons • To get sparce representation of • Expected risk of classifier , • Thus, we favor weight vectors with a small number of non-zero coeffs. • One way to achieve this is to modify prior: • Consider • Then wi=0 is only possible • Computation of is easier than before
Prediction funcion • GP • RVM
How can we learn the sparce vector • To find the best , employ evidence maximizaion • The evidence is given explicitly by, • Derived update rules (App'x B.8):
Evidence Maximization • Interestingly, many of the decrease quickly toward zero which lead to a high sparsity in • For faster convergence, delete ith column from whenever < pre-def threshold • After termination, set the corresponding = 0 for which < thres. The remaining are set equal to corresponing values in
Application to Classification • Consider latent target variables • Training objects: • Test object: • Compute the predictive distribution of at the new object , • by applying a latent weight vectorto all the m+1 objects • and marginalizing over all , we get
Note • As in the case of GP, we cannot solve this analytically because is no longer Gaussian • Laplace approximaion: approx. this density by a Gaussian density w/ mean and cov.
Kernel trick • Think about a RKHS generated by • Then ith component of training objects is represented as • Now, think about regression. The concept of becomes the expansion coeff. of the desired hyperplane, such that • In this sense, all the training objects which have non-zero are termed relevance vectors
3.4 Bayes Point Machines • [R. Herbrich, JMLR 2000] • In GP and RVMs, we tried to solve classification problem via regression estimation • Before we assumed prior dist. and used logit transformations to model the likelihood distribution, • Now we try to model it directly
Prior • For classification, only the spatial direction of . Note that • Thus we consider only the vectors on unit sphere • Then assume a uniform prior over this ball-shaped hypothesis space
Likelihood • Use PAC likelihood (0-1 loss) • Posterior • Remark: using PAC likelihood,
Predictive distribution • In two class case, the Bayesian decision can be written as: • That is, the Bayes classification strategy performs majority voting involving all version space classifiers • However, the expectation is hard to solve • Hence we approximate it by a single classifier
That is, BP is the optimal projection of the Bayes classifiers to a single classifier w.r.t. generalization error • However this also is intractable because we need to know input distribution and posterior • Another reasonable approximation:
Now the Bayes classification of new object equals to the classification w.r.t. the single weight vector • Estimate by MCMC sampling (‘kernel billiard algorithm’)