An Introduction to Support Vector Machines

An Introduction to Support Vector Machines

Outline • What is a good decision boundary for binary classification problem? • From minimizing the misclassification error to maximize the margin • Two classes, linearly inseparable • How to deal with some noisy data • How to make SVM non-linear: kernel • Conclusion CSE 802. Prepared by Martin Law

Two Class Problem: Linear Separable Case • Many decision boundaries can separate these two classes without misclassification • Which one should we choose? • The problem of minimizing the misclassification: Class 2 Class 1 CSE 802. Prepared by Martin Law

Maximizing the margin • The decision boundary should be as far away from the data of both classes as possible • We should maximize the margin, m Class 2 m Class 1 CSE 802. Prepared by Martin Law

The Optimization Problem • Let {x1, ..., xn} be our data set and let yiÎ {1,-1} be the class label of xi • The decision boundary should classify all points correctly Þ • A constrained optimization problem CSE 802. Prepared by Martin Law

The dual Problem • We can transform the problem to its dual • This is a quadratic programming (QP) problem • Global maximum of ai can always be found • w can be recovered by CSE 802. Prepared by Martin Law

A Geometrical Interpretation Class 2 a10=0 a8=0.6 a7=0 a2=0 a5=0 a1=0.8 a4=0 a6=1.4 a9=0 a3=0 Class 1 CSE 802. Prepared by Martin Law

Characteristics of the Solution • Many of the ai are zero • w is a linear combination of a small number of data • Sparse representation • xi with non-zero ai are called support vectors (SV) • The decision boundary is determined only by the SV • Let tj (j=1, ..., s) be the indices of the s support vectors. We can write • For testing with a new data z • Compute and classify z as class 1 if the sum is positive, and class 2 otherwise CSE 802. Prepared by Martin Law

Some Notes • There are theoretical upper bounds on the error on unseen data for SVM • The larger the margin, the smaller the bound • The smaller the number of SV, the smaller the bound • Note that in both training and testing, the data are referenced only as inner product, xTy • This is important for generalizing to the non-linear case CSE 802. Prepared by Martin Law

How About Not Linearly Separable • We allow “error” xi in classification to tolerate some noisy data Class 2 Class 1 CSE 802. Prepared by Martin Law

Soft Margin Hyperplane • Define xi=0 if there is no error for xi • xi are just “slack variables” in optimization theory • We want to minimize • C : tradeoff parameter between error and margin • The optimization problem becomes CSE 802. Prepared by Martin Law

The Optimization Problem • The dual of the problem is • w is also recovered as • The only difference with the linear separable case is that there is an upper bound C on ai • Once again, a QP solver can be used to find ai CSE 802. Prepared by Martin Law

f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) Extension to Non-linear Decision Boundary • In most of the situation, the decision boundary we are looking for should NOT be a straight line. f(.) Feature space Input space CSE 802. Prepared by Martin Law

Extension to Non-linear Decision Boundary • Key idea: • Use a function f(x) Transform xi to a higher dimensional space to “make life easier” • Input space: the space xi are in • Feature space: the space of f(xi) after transformation • Searching a hyper plane in Feature space to maximize the margin. • The hyper plane in Feature space correspond to a curve in input space. • Why transform? • We still like the idea of maximizing the margin. • More powerful in mining knowledge, more flexible. CSE 802. Prepared by Martin Law

Transformation and Kernel CSE 802. Prepared by Martin Law

Kernel: Efficient computation • Define the kernel function K (x,y) as • Consider the following transformation • In practice we don’t need to worry about the transformation function f(x), what we have to do is to select a good kernel for our problem. CSE 802. Prepared by Martin Law

Examples of Kernel Functions • Polynomial kernel with degree d • Radial basis function kernel with width s • Closely related to radial basis function neural networks • Research on different kernel functions in different applications is very active CSE 802. Prepared by Martin Law

Summary: Steps for Classification • Prepare the data matrix • Select the kernel function to use • Select the parameter of the kernel function and the value of C • You can use the values suggested by the SVM software, or you can set apart a validation set to determine the values of the parameter • Execute the training algorithm and obtain the ai • Unseen data can be classified using the ai and the support vectors CSE 802. Prepared by Martin Law

Classification result of SVM CSE 802. Prepared by Martin Law

Conclusion • Most popular tools for numeric binary classification • Key ideas of SVM: • Maximizing the margin can lead to a “good” classifier • Transformation to higher space to make the classifier more flexible. • Kernel tricks for efficient computation • Weaknesses of SVM • Need a “good” kernel function CSE 802. Prepared by Martin Law

Resources • http://www.kernel-machines.org/ • http://www.support-vector.net/ • http://www.support-vector.net/icml-tutorial.pdf • http://www.kernel-machines.org/papers/tutorial-nips.ps.gz • http://www.clopinet.com/isabelle/Projects/SVM/applist.html CSE 802. Prepared by Martin Law

An Introduction to Support Vector Machines