1 / 57

Understanding Support Vector Machines: From Perceptrons to SVMs

This tutorial explores the evolution from Perceptrons to Support Vector Machines, discussing their differences, applications, and optimization. Learn about margins, kernels, linearization, and the role of support vectors in SVMs. Discover the importance of maximizing margins and the concept of simple classifiers based on large margins. Dive into the mathematical foundations, assumptions, and formulation of SVMs, making complex ideas more accessible. Uncover the power of feature expansion, Lagrange multipliers, and the Kernel Trick in improving SVM performance.

thunter
Download Presentation

Understanding Support Vector Machines: From Perceptrons to SVMs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Support Vector Machines Piyush Kumar

  2. Perceptrons revisited Class 2 : (-1) Class 1 : (+1) Is this unique?

  3. Which one is the best? • Perceptron outputs :

  4. Perceptrons: What went wrong? • Slow convergence • Can overfit • Cant do complicated functions easily • Theoretical guarantees are not as strong

  5. Perceptron: The first NN • Proposed by Frank Rosenblatt in 1956 • Neural net researchers accuse Rosenblatt of promising ‘too much’  • Numerous variants • Also helps to study LP  • One of the simplest Neural Network.

  6. From Perceptrons to SVMs • Margins • Linearization • Kernels • Core-Sets and support vectors • Solvers: As simple as perceptrons!

  7. Support Vector Machines Margin

  8. Classification Margin • Distance from example to the separator is • Examples closest to the hyperplane are support vectors. • Marginρof the separator is the width of separation between classes. ρ r

  9. Support Vector Machines • Maximizing the margin is good according to intuition and PAC theory.

  10. Support Vector Machines • Implies that only support vectors are important; other training examples are ignorable. • Leads to Simple classifiers and hence better? (Simple = large margin)

  11. Let’s start some math… N samples : Where y = +/- 1 are labels for the data. Can we find a hyperplane that separates the two classes? (labeled by y) i.e. : For all j such that y = +1 : For all j such that y = -1

  12. Which we will relax later! Further assumption 1 Lets assume that the hyperplane that we are looking for passes thru the origin

  13. Relax now!!  Further assumption 2 • Lets assume that we are looking for a halfspace that contains a set of points

  14. Lets Relax FA 1 now • “Homogenize” the coordinates by adding a new coordinate to the input. • Think of it as moving the whole red and blue points in one higher dimension • From 2D to 3D it is just the x-y plane shifted to z = 1. This takes care of the “bias” or our assumption that the halfspace can pass thru the origin.

  15. Relax now!  Further Assumption 3 • Assume all points on a unit sphere! • If they are not after applying transformations for FA 1 and FA 2 , make them so.

  16. What did we want? • Maximize the margin. • What does it mean in the new space?

  17. What’s the new optimization problem? • Max |ρ| • subject to • xi.w >= ρ • (Note that we have gotten rid of the y’s by mirroring around the origin). • Here w is a unit vector. ||w|| = 1.

  18. Same Problem • Min 1/ρ • subject to xi.((1/ρ)w) >= 1 • Let v = (1/ρ) w • Then the constraint becomes xi.v >= 1. • Objective = Min 1/ρ = Min || (1/ρ) w || = Min ||v|| is the same as Min ||v||2

  19. New formulation Min ||v||2 Subject to : v.xi >= 1 Using matlab, this is a piece of cake to solve. Decision boundary sign(w.xi) Only for support vectors v.xi = 1.

  20. Support Vector Machines • Linear Learning Machines like • perceptrons. • Map non-linearly to higher dimension to • overcome the linearity constraint. • Select between hyperplanes, Use margin • as a test • (This is what perceptrons don’t do) From learning theory, maximum margin is good

  21. We will revisit this soon… Another Reformulation Unlike Perceptrons SVMs have a unique solution but are harder to solve. <QP>

  22. Support Vector Machines • There are very simple algorithms to solve SVMs ( as simple as perceptrons ) • If you are interested in learning those, come and talk to me.

  23. Another twist : Linearization • If the data is separable with say a sphere, how would you use a svm to separate it? (Ellipsoids?)

  24. Delaunay!?? Linearization a.k.a Feature Expansion Lift the points to a paraboloid in one higher dimension, For instance if the data is in 2D, (x,y) -> (x,y,x2+y2)

  25. Linearization • Note that replacing x by (x) the decision boundary changes from w.x = 0 to w.(x) = 0 • This helps us get non-linear separators compared to linear separators when  is non-linear (as in the last example). • Another feature expansion example: • (x,y) -> (x^2, xy, y^2, x, y) • What kind of separators are there?

  26. Linearization • The more features, the more power. • There is a danger of overfitting. • When there are lot of features (sometimes even infinite), we can use the “kernel trick” to solve the optimization problem faster. • Lets look back at optimization for a moment again…

  27. Lagrange Multipliers

  28. Lagrangian function

  29. At optimum

  30. More precisely

  31. The optimization Problem Revisited

  32. Removing v

  33. Support Vectors v is a linear combination of ‘some examples’ or support vectors. More than likely if we see too many support vectors, we are overfitting. Simple and Short classifiers are preferable.

  34. Substitution

  35. Gram Matrix

  36. The decision surface Recovered

  37. What is Gram Matrix reduction good for? • The Kernel Trick • Even if the number of features is infinite, G might still be small and hence the optimization problem solvable. • We could compute G without computing X, at least sometimes (by redefining the dot product in the feature space).

  38. Recall

  39. The kernel Matrix • The trick that ML community uses for Linearization is to use a function that redefines distances between points. • Example : • The optimization problem no longer needs  to be explicitly evaluated. As long as we can figure out the distance between two mapped points, its enough.

  40. Example Kernels

  41. The decision Surface?

  42. A demo using libsvm • Some implementations of SVM • libsvm • svmlight • svmtorch

  43. Checkerboard Dataset

  44. k-Nearest Neighbor Algorithm

  45. LSVM on Checkerboard

  46. Conclusions • SVM is an step towards improving perceptrons • They use large margin for good genralization • In order to make large feature expansions, we can use the gram matrix formulation of the optimization problem (or use kernels). • SVMs are popular classifiers because they achieve good accuracy on real world data.

  47. Geometric Solvers for SVM • Frank Wolfe Algorithm • A.k.a. (Gilbert’s algorithm) d

  48. Recall • Minkowski sum • The sweeping of one convex object with another • Defined as:

More Related