210 likes | 233 Views
An Introduction to Support Vector Machines. Outline. What is a good decision boundary for binary classification problem? From minimizing the misclassification error to maximize the margin Two classes, linearly inseparable How to deal with some noisy data How to make SVM non-linear: kernel
E N D
Outline • What is a good decision boundary for binary classification problem? • From minimizing the misclassification error to maximize the margin • Two classes, linearly inseparable • How to deal with some noisy data • How to make SVM non-linear: kernel • Conclusion CSE 802. Prepared by Martin Law
Two Class Problem: Linear Separable Case • Many decision boundaries can separate these two classes without misclassification • Which one should we choose? • The problem of minimizing the misclassification: Class 2 Class 1 CSE 802. Prepared by Martin Law
Maximizing the margin • The decision boundary should be as far away from the data of both classes as possible • We should maximize the margin, m Class 2 m Class 1 CSE 802. Prepared by Martin Law
The Optimization Problem • Let {x1, ..., xn} be our data set and let yiÎ {1,-1} be the class label of xi • The decision boundary should classify all points correctly Þ • A constrained optimization problem CSE 802. Prepared by Martin Law
The dual Problem • We can transform the problem to its dual • This is a quadratic programming (QP) problem • Global maximum of ai can always be found • w can be recovered by CSE 802. Prepared by Martin Law
A Geometrical Interpretation Class 2 a10=0 a8=0.6 a7=0 a2=0 a5=0 a1=0.8 a4=0 a6=1.4 a9=0 a3=0 Class 1 CSE 802. Prepared by Martin Law
Characteristics of the Solution • Many of the ai are zero • w is a linear combination of a small number of data • Sparse representation • xi with non-zero ai are called support vectors (SV) • The decision boundary is determined only by the SV • Let tj (j=1, ..., s) be the indices of the s support vectors. We can write • For testing with a new data z • Compute and classify z as class 1 if the sum is positive, and class 2 otherwise CSE 802. Prepared by Martin Law
Some Notes • There are theoretical upper bounds on the error on unseen data for SVM • The larger the margin, the smaller the bound • The smaller the number of SV, the smaller the bound • Note that in both training and testing, the data are referenced only as inner product, xTy • This is important for generalizing to the non-linear case CSE 802. Prepared by Martin Law
How About Not Linearly Separable • We allow “error” xi in classification to tolerate some noisy data Class 2 Class 1 CSE 802. Prepared by Martin Law
Soft Margin Hyperplane • Define xi=0 if there is no error for xi • xi are just “slack variables” in optimization theory • We want to minimize • C : tradeoff parameter between error and margin • The optimization problem becomes CSE 802. Prepared by Martin Law
The Optimization Problem • The dual of the problem is • w is also recovered as • The only difference with the linear separable case is that there is an upper bound C on ai • Once again, a QP solver can be used to find ai CSE 802. Prepared by Martin Law
f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) Extension to Non-linear Decision Boundary • In most of the situation, the decision boundary we are looking for should NOT be a straight line. f(.) Feature space Input space CSE 802. Prepared by Martin Law
Extension to Non-linear Decision Boundary • Key idea: • Use a function f(x) Transform xi to a higher dimensional space to “make life easier” • Input space: the space xi are in • Feature space: the space of f(xi) after transformation • Searching a hyper plane in Feature space to maximize the margin. • The hyper plane in Feature space correspond to a curve in input space. • Why transform? • We still like the idea of maximizing the margin. • More powerful in mining knowledge, more flexible. CSE 802. Prepared by Martin Law
Transformation and Kernel CSE 802. Prepared by Martin Law
Kernel: Efficient computation • Define the kernel function K (x,y) as • Consider the following transformation • In practice we don’t need to worry about the transformation function f(x), what we have to do is to select a good kernel for our problem. CSE 802. Prepared by Martin Law
Examples of Kernel Functions • Polynomial kernel with degree d • Radial basis function kernel with width s • Closely related to radial basis function neural networks • Research on different kernel functions in different applications is very active CSE 802. Prepared by Martin Law
Summary: Steps for Classification • Prepare the data matrix • Select the kernel function to use • Select the parameter of the kernel function and the value of C • You can use the values suggested by the SVM software, or you can set apart a validation set to determine the values of the parameter • Execute the training algorithm and obtain the ai • Unseen data can be classified using the ai and the support vectors CSE 802. Prepared by Martin Law
Classification result of SVM CSE 802. Prepared by Martin Law
Conclusion • Most popular tools for numeric binary classification • Key ideas of SVM: • Maximizing the margin can lead to a “good” classifier • Transformation to higher space to make the classifier more flexible. • Kernel tricks for efficient computation • Weaknesses of SVM • Need a “good” kernel function CSE 802. Prepared by Martin Law
Resources • http://www.kernel-machines.org/ • http://www.support-vector.net/ • http://www.support-vector.net/icml-tutorial.pdf • http://www.kernel-machines.org/papers/tutorial-nips.ps.gz • http://www.clopinet.com/isabelle/Projects/SVM/applist.html CSE 802. Prepared by Martin Law