1 / 21

An Introduction to Support Vector Machines

An Introduction to Support Vector Machines. Outline. What is a good decision boundary for binary classification problem? From minimizing the misclassification error to maximize the margin Two classes, linearly inseparable How to deal with some noisy data How to make SVM non-linear: kernel

joserussell
Download Presentation

An Introduction to Support Vector Machines

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Introduction to Support Vector Machines

  2. Outline • What is a good decision boundary for binary classification problem? • From minimizing the misclassification error to maximize the margin • Two classes, linearly inseparable • How to deal with some noisy data • How to make SVM non-linear: kernel • Conclusion CSE 802. Prepared by Martin Law

  3. Two Class Problem: Linear Separable Case • Many decision boundaries can separate these two classes without misclassification • Which one should we choose? • The problem of minimizing the misclassification: Class 2 Class 1 CSE 802. Prepared by Martin Law

  4. Maximizing the margin • The decision boundary should be as far away from the data of both classes as possible • We should maximize the margin, m Class 2 m Class 1 CSE 802. Prepared by Martin Law

  5. The Optimization Problem • Let {x1, ..., xn} be our data set and let yiÎ {1,-1} be the class label of xi • The decision boundary should classify all points correctly Þ • A constrained optimization problem CSE 802. Prepared by Martin Law

  6. The dual Problem • We can transform the problem to its dual • This is a quadratic programming (QP) problem • Global maximum of ai can always be found • w can be recovered by CSE 802. Prepared by Martin Law

  7. A Geometrical Interpretation Class 2 a10=0 a8=0.6 a7=0 a2=0 a5=0 a1=0.8 a4=0 a6=1.4 a9=0 a3=0 Class 1 CSE 802. Prepared by Martin Law

  8. Characteristics of the Solution • Many of the ai are zero • w is a linear combination of a small number of data • Sparse representation • xi with non-zero ai are called support vectors (SV) • The decision boundary is determined only by the SV • Let tj (j=1, ..., s) be the indices of the s support vectors. We can write • For testing with a new data z • Compute and classify z as class 1 if the sum is positive, and class 2 otherwise CSE 802. Prepared by Martin Law

  9. Some Notes • There are theoretical upper bounds on the error on unseen data for SVM • The larger the margin, the smaller the bound • The smaller the number of SV, the smaller the bound • Note that in both training and testing, the data are referenced only as inner product, xTy • This is important for generalizing to the non-linear case CSE 802. Prepared by Martin Law

  10. How About Not Linearly Separable • We allow “error” xi in classification to tolerate some noisy data Class 2 Class 1 CSE 802. Prepared by Martin Law

  11. Soft Margin Hyperplane • Define xi=0 if there is no error for xi • xi are just “slack variables” in optimization theory • We want to minimize • C : tradeoff parameter between error and margin • The optimization problem becomes CSE 802. Prepared by Martin Law

  12. The Optimization Problem • The dual of the problem is • w is also recovered as • The only difference with the linear separable case is that there is an upper bound C on ai • Once again, a QP solver can be used to find ai CSE 802. Prepared by Martin Law

  13. f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) Extension to Non-linear Decision Boundary • In most of the situation, the decision boundary we are looking for should NOT be a straight line. f(.) Feature space Input space CSE 802. Prepared by Martin Law

  14. Extension to Non-linear Decision Boundary • Key idea: • Use a function f(x) Transform xi to a higher dimensional space to “make life easier” • Input space: the space xi are in • Feature space: the space of f(xi) after transformation • Searching a hyper plane in Feature space to maximize the margin. • The hyper plane in Feature space correspond to a curve in input space. • Why transform? • We still like the idea of maximizing the margin. • More powerful in mining knowledge, more flexible. CSE 802. Prepared by Martin Law

  15. Transformation and Kernel CSE 802. Prepared by Martin Law

  16. Kernel: Efficient computation • Define the kernel function K (x,y) as • Consider the following transformation • In practice we don’t need to worry about the transformation function f(x), what we have to do is to select a good kernel for our problem. CSE 802. Prepared by Martin Law

  17. Examples of Kernel Functions • Polynomial kernel with degree d • Radial basis function kernel with width s • Closely related to radial basis function neural networks • Research on different kernel functions in different applications is very active CSE 802. Prepared by Martin Law

  18. Summary: Steps for Classification • Prepare the data matrix • Select the kernel function to use • Select the parameter of the kernel function and the value of C • You can use the values suggested by the SVM software, or you can set apart a validation set to determine the values of the parameter • Execute the training algorithm and obtain the ai • Unseen data can be classified using the ai and the support vectors CSE 802. Prepared by Martin Law

  19. Classification result of SVM CSE 802. Prepared by Martin Law

  20. Conclusion • Most popular tools for numeric binary classification • Key ideas of SVM: • Maximizing the margin can lead to a “good” classifier • Transformation to higher space to make the classifier more flexible. • Kernel tricks for efficient computation • Weaknesses of SVM • Need a “good” kernel function CSE 802. Prepared by Martin Law

  21. Resources • http://www.kernel-machines.org/ • http://www.support-vector.net/ • http://www.support-vector.net/icml-tutorial.pdf • http://www.kernel-machines.org/papers/tutorial-nips.ps.gz • http://www.clopinet.com/isabelle/Projects/SVM/applist.html CSE 802. Prepared by Martin Law

More Related