Classification: Support Vector Machine

Classification: Support Vector Machine 10/10/07

What hyperplane (line) can separate the two classes of data?

What hyperplane (line) can separate the two classes of data? But there are many other choices! Which one is the best?

M: margin What hyperplane (line) can separate the two classes of data? But there are many other choices! Which one is the best?

Optimal separating hyperplane M M The best hyperplane is the one that maximizes the margin, M.

Computing the margin width A hyperplane is xTb + b0 = 1 Find x+ and x- on the “plus” and “minus” plane, so that x+ - x- is perpendicular to b. Then M = | x+ - x-| xTb + b0 = 0 xTb + b0 = -1 b x+ x-

Computing the margin width A hyperplane is Find x+ and x- on the “plus” and “minus” plane, so that x+ - x- is perpendicular to b. Then M = | x+ - x-| Since x+Tb + b0 = 1 x-Tb + b0 = -1 (x+ - x-)Tb = 2 xTb + b0 = 1 xTb + b0 = 0 xTb + b0 = -1 b x+ x- M = | x+ - x-| = 2/| b |

Computing the marginal width The hyperplane is separating if The maximizing problem is subject to M support vector

Optimal separating hyperplane Rewrite the problem as subject to Lagrange function To minimize, set partial derivatives to be 0 Can be solved by quadratic programming.

When the two classes are non-separable What is the best hyperplane? Idea: allow some points to lie on the wrong side, but not by much.

Support vector machine When the two classes are not separable, the problem is slightly modified: Find subject to Can be solved using quadratic programming.

Convert a nonseparable to separable case by nonlinear transformation non-separable in 1D

Convert a nonseparable to separable case by nonlinear transformation separable in 1D

Kernel function • Introduce nonlinear kernel functions h(x), and work on the transformed functions. Then the separating function is In fact, all you need is the kernel function: Common kernels:

Applications

Prediction of central nervous systems embryonic tumor outcome • 42 patient samples • 5 cancer types • Array contains 6817 genes • Question: are different tumors types distinguishable from gene expression pattern? (Pomeroy et al. 2002)

(Pomeroy et al. 2002)

Gene expressions within a cancer type cluster together (Pomeroy et al. 2002)

PCA based on all genes (Pomeroy et al. 2002)

PCA based on a subset of informational genes (Pomeroy et al. 2002)

Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks • Four different cancer types. • 88 samples • 6567 genes • Goal: to predict cancer types from gene expression data (Khan et al. 2001)

Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks (Khan et al. 2001)

Procedures • Filter out genes that have low expression values (retain 2308 genes) • Dimension reduction by using PCA --- select top 10 principle components • 3 fold cross-validation: (Khan et al. 2001)

Artificial Neural Network

(Khan et al. 2001)

Procedures • Filter out genes that have low expression values (retain 2308 genes) • Dimension reduction by using PCA --- select top 10 principle components • 3 fold cross-validation: • Repeat 1250 times. (Khan et al. 2001)

(Khan et al. 2001)

Acknowledgement • Sources of slides: • Cheng Li • http://www.cs.cornell.edu/johannes/papers/2001/kdd2001-tutorial-final.pdf • www.cse.msu.edu/~lawhiu/intro_SVM_new.ppt

Aggregating predictors • Sometimes aggregating several predictors can perform better than each single predictor alone. Aggregating is achieved by weighted sum of different predictors, which can be the same kind of predictors obtained from slightly perturbed training datasets. • Key to the improvement of accuracy is the instability of individual classifiers, such as the classification trees.

AdaBoost • Step 1: Initialization the observation weights • Step 2: For m = 1 to M, • Fit a classifier Gm(X) to the training data using weight wi • Compute • Compute • Set • Step 3: Output misclassified obs are given more weights

Boosting

Optimal separating hyperplane • Substituting, we get the Lagrange (Wolf) dual function subject to To complete the steps, see Burges et al. • If then These xi’s are called the support vectors. is only determined by the support vectors

Support vector machine The Lagrange function is Setting the partial derivatives to be 0. Substituting, we get Subject to

Classification: Support Vector Machine