1 / 41

Support Vector Machines

Support Vector Machines. Lecturer : Yishay Mansour Itay Kirshenbaum. Lecture Overview. In this lecture we present in detail one of the most theoretically well motivated and practically most effective classification algorithms in modern machine learning: Support Vector Machines (SVMs). .

kaloni
Download Presentation

Support Vector Machines

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Support Vector Machines Lecturer: Yishay Mansour Itay Kirshenbaum

  2. Lecture Overview In this lecture we present in detail one of the most theoretically well motivated and practically most effective classification algorithms in modern machine learning: Support Vector Machines (SVMs).

  3. Lecture Overview – Cont. • We begin with building the intuition behind SVMs • continue to define SVM as an optimization problem and discuss how to efficiently solve it. • We conclude with an analysis of the error rate of SVMs using two techniques: Leave One Out and VC-dimension.

  4. Introduction • Support Vector Machine is a supervised learning algorithm • Used to learn a hyperplane that can solve the binary classification problem • Among the most extensively studied problems in machine learning.

  5. Binary Classification Problem • Input space: • Output space: • Training data: • S drawn i.i.d with distribution D • Goal: Select hypothesis that best predicts other points drawn i.i.d from D

  6. Binary Classification – Cont. • Consider the problem of predicting the success of a new drug based on a patient height and weight • m ill people are selected and treated • This generates m 2d vectors (height and weight) • Each point is assigned +1 to indicate successful treatment or -1 otherwise • This can be used as training data

  7. Binary classification – Cont. • Infinitely many ways to classify • Occam’s razor – simple classification rules provide better results • Linear classifier or hyperplane • Our class of linear classifiers:

  8. Choosing a Good Hyperplane • Intuition • Consider two cases of positive classification: • w*x + b = 0.1 • w*x + b = 100 • More confident in the decision made by the latter rather than the former • Choose a hyperplane with maximal margin

  9. Good Hyperplane – Cont. • Definition: Functional margin S • A linear classifier:

  10. Maximal Margin • w,b can be scaled to increase margin • sign(w*x + b) = sign(5w*x + 5b) for all x • (5w, 5b) is 5 times greater than (w,b) • Cope by adding an additional constraint: • ||w|| = 1

  11. Maximal Margin – Cont. • Geometric Margin • Consider the geometric distance between the hyperplane and the closest points

  12. Geometric Margin • Definition: • Definition: Geometric margin S • Relation to functional margin • Both are equal when

  13. The Algorithm • We saw: • Two definitions of the margin • Intuition behind seeking a maximizing hyperplane • Goal: Write an optimization program that finds such a hyperplan • We always look for (w,b) maximizing the margin

  14. The Algorithm – Take 1 • First try: • Idea • Maximize - For each sample the Functional margin is at least • Functional and geometric margin are the same as • Largest possible geometric margin with respect to the training set

  15. The Algorithm – Take 2 • The first try can’t be solved by any off-the-shelf optimization software • The constraint is non-linear • In fact, it’s even non-convex • How can we discard the constraint? • Use geometric margin!

  16. The Algorithm – Take 3 • We now have a non-convex objective function – The problem remains • Remember • We can scale (w,b) as we wish • Force the functional margin to be 1 • Objective function: • Same as: • Factor of 0.5 and power of 2 do not change the program – Make things easier

  17. The algorithm – Final version • The final program: • The objective is convex (quadratic) • All constraints are linear • Can solve efficiently using standard quadratic programing (QP) software

  18. Convex Optimization • We want to solve the optimization problem more efficiently than generic QP • Solution – Use convex optimization techniques

  19. Convex Optimization – Cont. • Definition: A convex function • Theorem

  20. Convex Optimization Problem • Convex optimization problem • We look for • a value of • Minimizes • Under the constraint

  21. Lagrange Multipliers • Used to find a maxima or a minima of a function subject to constraints • Use to solve out optimization problem • Definition

  22. Primal Program • Plan • Use the Lagrangian to write a program called the Primal Program • Equal to f(x) is all the constraints are met • Otherwise – • Definition – Primal Program

  23. Primal Progam – Cont. • The constraints are of the form • If they are met • is maximized when all are 0, and the summation is 0 • Otherwise • is maximized for

  24. Primal Progam – Cont. • Our convex optimization problem is now: • Define as the value of the primal program

  25. Dual Program • We define the Dual Program as: • We’ll look at • Same as our primal program • Order of min / max is different • Define the value of our Dual Program

  26. Dual Program – Cont. • We want to show • If we find a solution to one problem, we find the solution to the second problem • Start with • “max min” is always less then “min max” • Now on to

  27. Dual Program – Cont. • Claim • Proof • Conclude

  28. Karush-Kuhn-Tucker (KKT) conditions • KKT conditions derive a characterization of an optimal solution to a convex problem. • Theorem

  29. KKT Conditions – Cont. • Proof • The other direction holds as well

  30. KKT Conditions – Cont. • Example • Consider the following optimization problem: • We have • The Lagragian will be

  31. Optimal Margin Classifier • Back to SVM • Rewrite our optimization program • Following the KKT conditions • Only for points in the training set with a margin of exactly 1 • These are the support vectors of the training set

  32. Optimal Margin – Cont. • Optimal margin classifier and its support vectors

  33. Optimal Margin – Cont. • Construct the Lagragian • Find the dual form • First minimize to get • Do so by setting the derivatives to zero

  34. Optimal Margin – Cont. • Take the derivative with respect to • Use in the Lagrangian • We saw the last tem is zero

  35. Optimal Margin – Cont. • The dual optimization problem • The KKT conditions hold • Can solve by finding that maximize • Assuming we have – define • The solution to the primal problem

  36. Optimal Margin – Cont. • Still need to find • Assume is a support vector • We get

  37. Error Analysis Using Leave-One-Out • The Leave-One-Out (LOO) method • Remove one point at a time from the training set • Calculate an SVM for the remaining points • Test our result using the removed point • Definition • The indicator function I(exp) is 1 if exp is true, otherwise 0

  38. LOO Error Analysis – Cont. • Expected error • It follows the expected error of LOO for a training set of size m is the same as for a training set of size m-1

  39. LOO Error Analysis – Cont. • Theorem • Proof

  40. Generalization Bounds Using VC-dimension • Theorem • Proof

  41. Generalization Bounds Using VC-dimension – Cont. • Proof – Cont.

More Related