1 / 45

Support Vector Machines (and Kernel Methods in general)

Support Vector Machines (and Kernel Methods in general). Machine Learning. Last Time. Multilayer Perceptron /Logistic Regression Networks Neural Networks Error Backpropagation. Today. Support Vector Machines Note : we’ll rely on some math from Optimality Theory that we won’t derive.

drew
Download Presentation

Support Vector Machines (and Kernel Methods in general)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Support Vector Machines (and Kernel Methods in general) Machine Learning

  2. Last Time • Multilayer Perceptron/Logistic Regression Networks • Neural Networks • Error Backpropagation

  3. Today • Support Vector Machines • Note: we’ll rely on some math from Optimality Theory that we won’t derive.

  4. Maximum Margin • Perceptron (and other linear classifiers) can lead to many equally valid choices for the decision boundary Are these really “equally valid”?

  5. Max Margin • How can we pick which is best? • Maximize the size of the margin. Small Margin Large Margin Are these really “equally valid”?

  6. Support Vectors • Support Vectors are those input points (vectors) closest to the decision boundary • 1. They are vectors • 2. They “support” the decision hyperplane

  7. Support Vectors • Define this as a decision problem • The decision hyperplane: • No fancy math, just the equation of a hyperplane.

  8. Support Vectors • Aside: Why do some cassifiers use or • Simplicity of the math and interpretation. • For probability density function estimation 0,1 has a clear correlate. • For classification, a decision boundary of 0 is more easily interpretable than .5.

  9. Support Vectors • Define this as a decision problem • The decision hyperplane: • Decision Function:

  10. Support Vectors • Define this as a decision problem • The decision hyperplane: • Margin hyperplanes:

  11. Support Vectors • The decision hyperplane: • Scale invariance

  12. Support Vectors • The decision hyperplane: • Scale invariance

  13. Support Vectors This scaling does not change the decision hyperplane, or the support vector hyperplanes. But we will eliminate a variable from the optimization • The decision hyperplane: • Scale invariance

  14. What are we optimizing? • We will represent the size of the margin in terms of w. • This will allow us to simultaneously • Identify a decision boundary • Maximize the margin

  15. How do we represent the size of the margin in terms of w? • There must at least one point that lies on each support hyperplanes Proof outline: If not, we could define a larger margin support hyperplane that does touch the nearest point(s).

  16. How do we represent the size of the margin in terms of w? • There must at least one point that lies on each support hyperplanes Proof outline: If not, we could define a larger margin support hyperplane that does touch the nearest point(s).

  17. How do we represent the size of the margin in terms of w? • There must at least one point that lies on each support hyperplanes • Thus: And:

  18. How do we represent the size of the margin in terms of w? • There must at least one point that lies on each support hyperplanes • Thus: And:

  19. How do we represent the size of the margin in terms of w? • The vector w is perpendicular to the decision hyperplane • If the dot product of two vectors equals zero, the two vectors are perpendicular.

  20. How do we represent the size of the margin in terms of w? • The margin is the projection of x1 – x2 onto w, the normal of the hyperplane.

  21. Aside: Vector Projection

  22. How do we represent the size of the margin in terms of w? • The margin is the projection of x1 – x2 onto w, the normal of the hyperplane. Projection: Size of the Margin:

  23. Maximizing the margin • Goal: maximize the margin Linear Separability of the data by the decision boundary

  24. Max Margin Loss Function • If constraint optimization then Lagrange Multipliers • Optimize the “Primal”

  25. Max Margin Loss Function • Optimize the “Primal” Partial wrtb

  26. Max Margin Loss Function • Optimize the “Primal” Partial wrtw

  27. Max Margin Loss Function • Optimize the “Primal” Partial wrtw Now have to find αi. Substitute back to the Loss function

  28. Max Margin Loss Function • Construct the “dual”

  29. Dual formulation of the error • Optimize this quadratic program to identify the lagrange multipliers and thus the weights There exist (rather) fast approaches to quadratic optimization in both C, C++, Python, Java and R

  30. Quadratic Programming • If Q is positive semi definite, then f(x) is convex. • If f(x) is convex, then there is a single maximum.

  31. Support Vector Expansion • When αi is non-zero then xi is a support vector • When αi is zero xi is not a support vector New decision Function Independent of the Dimension of x!

  32. Kuhn-Tucker Conditions • In constraint optimization: At the optimal solution • Constraint * Lagrange Multiplier = 0 Only points on the decision boundary contribute to the solution!

  33. Visualization of Support Vectors

  34. Interpretability of SVM parameters • What else can we tell from alphas? • If alpha is large, then the associated data point is quite important. • It’s either an outlier, or incredibly important. • But this only gives us the best solution for linearly separable data sets…

  35. Basis of Kernel Methods • The decision process doesn’t depend on the dimensionality of the data. • We can map to a higher dimensionality of the data space. • Note: data points only appear within a dot product. • The error is based on the dot product of data points – not the data points themselves.

  36. Basis of Kernel Methods • Since data points only appear within a dot product. • Thus we can map to another space through a replacement • The error is based on the dot product of data points – not the data points themselves.

  37. Learning Theory bases of SVMs • Theoretical bounds on testing error. • The upper bound doesn’t depend on the dimensionality of the space • The lower bound is maximized by maximizing the margin, γ, associated with the decision boundary.

  38. Why we like SVMs • They work • Good generalization • Easily interpreted. • Decision boundary is based on the data in the form of the support vectors. • Not so in multilayer perceptron networks • Principled bounds on testing error from Learning Theory (VC dimension)

  39. SVM vs. MLP • SVMs have many fewer parameters • SVM: Maybe just a kernel parameter • MLP: Number and arrangement of nodes and eta learning rate • SVM: Convex optimization task • MLP: likelihood is non-convex -- local minima

  40. Soft margin classification • There can be outliers on the other side of the decision boundary, or leading to a small margin. • Solution: Introduce a penalty term to the constraint function

  41. Soft Max Dual Still Quadratic Programming!

  42. Soft margin example • Points are allowed within the margin, but cost is introduced. Hinge Loss

  43. Probabilities from SVMs • Support Vector Machines are discriminant functions • Discriminant functions: f(x)=c • Discriminative models: f(x) = argmaxcp(c|x) • Generative Models: f(x) = argmaxcp(x|c)p(c)/p(x) • No (principled) probabilities from SVMs • SVMs are not based on probability distribution functions of class instances.

  44. Efficiency of SVMs • Not especially fast. • Training – n^3 • Quadratic Programming efficiency • Evaluation – n • Need to evaluate against each support vector (potentially n)

  45. Good Bye • Next time: • The Kernel “Trick” -> Kernel Methods • or • How can we use SVMs that are not linearly separable?

More Related