400 likes | 520 Views
Learn about Maximum Margin Classifiers, Dual Representation, Classifying New Data, Soft Margin Solutions, and Alternative Formulations in Sparse Kernel Machines.
E N D
Ch 7. Sparse Kernel MachinesPattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by S. Kim Biointelligence Laboratory, Seoul National University http://bi.snu.ac.kr/
Contents • Maximum Margin Classifiers • Overlapping Class Distributions • Relation to Logistic Regression • Multiclass SVMs • SVMs for Regression • Relevance Vector Machines • RVM for Regression • Analysis of Sparsity • RVMs for Classification (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Maximum Margin Classifiers • Problem settings • Two-class classification using linear models • Assume that training data set is linearly separable • Support vector machine approaches • The decision boundary is chosen to be the one for which the margin is maximized support vectors (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Maximum Margin Solution • For all data points, • The distance of a point to the decision surface • The maximum margin solution (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Dual Representation • Introducing Lagrange multipliers, • Min. points satisfy the derivatives of L w.r.t. w and b equal 0 • Dual representation Find Appendix E for more details (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Classifying New Data • Optimization subjects to • Found by solving a quadratic programming problem Karush-Kuhn-Tucker (KKT) Conditions Appendix E : support vectors or • O(N3) (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Example of Separable Data Classification Figure 7.2 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Overlapping Class Distributions • Allow some misclassified examples soft margin • Introduce slack variables (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Soft Margin Solution • Minimize • KKT conditions: : trade-off between minimizing training errors and controlling model complexity or (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ : support vectors
Dual Representation • Dual representation • Classifying new data and obtaining b ( hard margin classifiers) (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Alternative Formulation • v-SVM (Schölkopf et al., 2000) - Upper bound on the fraction of margin errors - Lower bound on the fraction of support vectors (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Example of Nonseparable Data Classification (v-SVM) (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Solutions of the QP Problem • Chunking (Vapnik, 1982) • Idea: the value of Lagrangian is unchanged if we remove the rows and columns of the kernel matrix corresponding to Lagrange multipliers that have value zero • Protected conjugate gradients (Burges, 1998) • Decomposition methods (Osuna et al., 1996) • Sequential minimal optimization (Platt, 1999) (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Relation to Logistic Regression (Section 4.3.2) • For data points on the correct side, For the remaining points, : hinge error function (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Relation to Logistic Regression (Cont’d) • From maximum likelihood logistic regression • Error function with a quadratic regularizer (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Comparison of Error Functions Hinge error function Error function for logistic regression Misclassification error Squared error (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Multiclass SVMs • One-versus-the-rest: K separate SVMs • Can lead inconsistent results (Figure 4.2) • Imbalanced training sets • Positive class: +1, negative class: -1/(K-1) (Lee et al., 2001) • An objective function for training all SVMs simultaneously (Weston and Watkins, 1999) • One-versus-one: K(K-1)/2 SVMs • Based on error-correcting output codes (Allwein et al., 2000) • Generalization of the voting scheme of the one-versus-one (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
SVMs for Regression • Simple linear regression: minimize • ε-insensitive error function ε-insensitive error function quadratic error function (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
SVMs for Regression (Cont’d) • Minimize (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Dual Problem (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Predictions • KKT conditions: (from derivatives of the Lagrangian) (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Alternative Formulation • v-SVM (Schölkopf et al., 2000) fraction of points lying outside the tube (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Example of v-SVM Regression (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Relevance Vector Machines • SVM • Outputs are decisions rather than posterior probabilities • The extension to K>2 classes is problematic • There is a complexity parameter C • Kernel functions are centered on training data points and required to be positive definite • RVM • Bayesian sparse kernel technique • Much sparser models • Faster performance on test data (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
RVM for Regression • RVM is a linear form in Chapter 3 with a modified prior (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
RVM for Regression (Cont’d) From the result (3.49) for linear regression models • α and β are determined using evidence approximation (type-2 maximum likelihood) (Section 3.5) Maximize (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
RVM for Regression (Cont’d) • Two approaches • By derivatives of marginal likelihood • EM algorithm Section 9.3.4 • Predictive distribution Section 3.3.2 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Example of RVM Regression v-SVM regression RVM regression • More compact than SVM • Parameters are determined automatically • Require more training time than SVM (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Mechanism for Sparsity only isotropic noise, α = ∞ a finite value of α (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Sparse Solution • Pull out the contribution from αi in Using (C.7), (C.15) in Appendix C (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Sparse Solution (Cont’d) • For log marginal likelihood function L, • Stationary points of the marginal likelihood w.r.t.αi Sparsity: measures the extent to which overlaps with the other basis vectors Quality of : represents a measure of the alignment of the basis vector with the error between t and y-i (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Sequential Sparse Bayesian Learning Algorithm • Initialize • Initialize using , with , with the remaining • Evaluate and for all basis functions • Select a candidate • If ( is already in the model), update • If , add to the model, and evaluate • If , remove from the model, and set • Update • Go to 3 until converged (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
RVM for Classification • Probabilistic linear classification model (Chapter 4) • with ARD prior - Initialize - Build a Gaussian approximation to the posterior distribution - Obtain an approximation to the marginal likelihood - Maximize the marginal likelihood (re-estimate ) until converged (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
RVM for Classification (Cont’d) • The posterior distribution is obtained by maximizing Iterative reweighted least squares (IRLS) from Section 4.3.3 Resulting Gaussian approximation to the posterior distribution (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
RVM for Classification (Cont’d) • Marginal likelihood using Laplace approximation (Section 4.4) • Set the derivative of the marginal likelihood equal to zero, and rearranging then gives • If we define , Same in the regression case (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Example of RVM Classification (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/