What is a Support Vector Machine

1. CS 540, University of Wisconsin-Madison, C. R. Dyer What is a Support Vector Machine? An optimally defined surface Typically nonlinear in the input space Linear in a higher dimensional space Implicitly defined by a kernel function

2. CS 540, University of Wisconsin-Madison, C. R. Dyer What are Support Vector Machines Used For? Classification Regression and data-fitting Supervised and unsupervised learning

3. CS 540, University of Wisconsin-Madison, C. R. Dyer Linear Classifiers

4. CS 540, University of Wisconsin-Madison, C. R. Dyer Linear Classifiers(aka Linear Discriminant Functions) Definition It is a function that is a linear combination of the components of the input x where w is the weight vector and b the bias A two-category classifier then uses the rule: Decide class c1 if f(x) > 0 and class c2 if f(x) < 0 ? Decide c1 if wTx > -b and c2 otherwise





9. CS 540, University of Wisconsin-Madison, C. R. Dyer Classifier Margin

10. CS 540, University of Wisconsin-Madison, C. R. Dyer Maximum Margin

11. CS 540, University of Wisconsin-Madison, C. R. Dyer Maximum Margin

12. CS 540, University of Wisconsin-Madison, C. R. Dyer Why Maximum Margin?

13. CS 540, University of Wisconsin-Madison, C. R. Dyer Specifying a Line and Margin How do we represent this mathematically? � in d input dimensions? An example x = (x1, �, xd)T

14. CS 540, University of Wisconsin-Madison, C. R. Dyer Specifying a Line and Margin Plus-plane = { wT � x + b = +1 } Minus-plane = { wT � x + b = -1 }

15. CS 540, University of Wisconsin-Madison, C. R. Dyer Computing the Margin Plus-plane = { w x + b = +1 } Minus-plane = { w x + b = -1 } Claim: The vector w is perpendicular to the Plus-Plane

16. CS 540, University of Wisconsin-Madison, C. R. Dyer Computing the Margin Plus-plane = { w x + b = +1 } Minus-plane = { w x + b = -1 } Claim: The vector w is perpendicular to the Plus Plane. Why?

17. CS 540, University of Wisconsin-Madison, C. R. Dyer

18. CS 540, University of Wisconsin-Madison, C. R. Dyer Computing the Margin Plus-plane = { w x + b = +1 } Minus-plane = { w x + b = -1 } The vector w is perpendicular to the Plus Plane Let x- be any point on the minus plane Let x+ be the closest plus-plane-point to x-

19. CS 540, University of Wisconsin-Madison, C. R. Dyer Computing the Margin Plus-plane = { w x + b = +1 } Minus-plane = { w x + b = -1 } The vector w is perpendicular to the Plus Plane Let x- be any point on the minus plane Let x+ be the closest plus-plane-point to x- Claim: x+ = x- + l w for some value of l. Why?

20. CS 540, University of Wisconsin-Madison, C. R. Dyer Computing the Margin Plus-plane = { w x + b = +1 } Minus-plane = { w x + b = -1 } The vector w is perpendicular to the Plus Plane Let x- be any point on the minus plane Let x+ be the closest plus-plane-point to x- Claim: x+ = x- + l w for some value of l. Why?

21. CS 540, University of Wisconsin-Madison, C. R. Dyer Computing the Margin What we know: w x+ + b = +1 w x- + b = -1 x+ = x- + l w |x+ - x- | = M It�s now easy to get M in terms of w and b

22. CS 540, University of Wisconsin-Madison, C. R. Dyer Computing the Margin What we know: w x+ + b = +1 w x- + b = -1 x+ = x- + l w |x+ - x- | = M It�s now easy to get M in terms of w and b

23. CS 540, University of Wisconsin-Madison, C. R. Dyer Computing the Margin What we know: w x+ + b = +1 w x- + b = -1 x+ = x- + l w |x+ - x- | = M

24. CS 540, University of Wisconsin-Madison, C. R. Dyer Learning the Maximum Margin Classifier Given a guess of w and b we can Compute whether all data points in the correct half-planes Compute the width of the margin So now we just need to write a program to search the space of w�s and b�s to find the widest margin that matches all the data points. How?

25. CS 540, University of Wisconsin-Madison, C. R. Dyer Learning via Quadratic Programming QP is a well-studied class of optimization algorithms to maximize a quadratic function of some real-valued variables subject to linear constraints Minimize subject to w x + b ? +1 if x in class 1 w x + b ? -1 if x in class 2

26. CS 540, University of Wisconsin-Madison, C. R. Dyer Learning the Maximum Margin Classifier Given guess of w, b, we can Compute whether all data points are in the correct half-planes Compute the margin width Assume N examples, each (xk , yk) where yk = +/- 1

27. CS 540, University of Wisconsin-Madison, C. R. Dyer Learning the Maximum Margin Classifier Given guess of w , b we can Compute whether all data points are in the correct half-planes Compute the margin width Assume N examples, each (xk , yk) where yk = +/- 1

28. CS 540, University of Wisconsin-Madison, C. R. Dyer Uh-oh!





33. CS 540, University of Wisconsin-Madison, C. R. Dyer Learning Maximum Margin with Noise Given guess of w, b, we can Compute sum of distances of points to their correct zones Compute the margin width Assume N examples, each (xk , yk) where yk = +/- 1

34. CS 540, University of Wisconsin-Madison, C. R. Dyer Learning Maximum Margin with Noise Given guess of w , b we can Compute sum of distances of points to their correct zones Compute the margin width Assume N examples, each (xk , yk) where yk = +/- 1

35. CS 540, University of Wisconsin-Madison, C. R. Dyer Learning Maximum Margin with Noise Given guess of w , b we can Compute sum of distances of points to their correct zones Compute the margin width Assume R datapoints, each (xk,yk) where yk = +/- 1

36. CS 540, University of Wisconsin-Madison, C. R. Dyer Learning Maximum Margin with Noise Given guess of w , b we can Compute sum of distances of points to their correct zones Compute the margin width Assume N examples, each (xk,yk) where yk = +/- 1

37. CS 540, University of Wisconsin-Madison, C. R. Dyer Learning Maximum Margin with Noise Given guess of w , b we can Compute sum of distances of points to their correct zones Compute the margin width Assume N examples, each (xk,yk) where yk = +/- 1

38. CS 540, University of Wisconsin-Madison, C. R. Dyer An Equivalent QP



41. CS 540, University of Wisconsin-Madison, C. R. Dyer Suppose we�re in 1 Dimension

42. CS 540, University of Wisconsin-Madison, C. R. Dyer Suppose we�re in 1 Dimension

43. CS 540, University of Wisconsin-Madison, C. R. Dyer Harder 1-Dimensional Dataset



46. CS 540, University of Wisconsin-Madison, C. R. Dyer

47. CS 540, University of Wisconsin-Madison, C. R. Dyer Project examples into some higher dimensional space where the data is linearly separable, defined by z = F(x) Training depends only on dot products of the form F(xi) � F(xj) Example: K(xi, xj) = F(xi) � F(xj) = (xi � xj)2 Dimensionality of z space is generally much larger than the dimension of input space x

48. CS 540, University of Wisconsin-Madison, C. R. Dyer Common SVM Basis Functions

49. CS 540, University of Wisconsin-Madison, C. R. Dyer SVM Kernel Functions K(a,b)=(a . b +1)d is an example of an SVM kernel function Beyond polynomials there are other very high dimensional basis functions that can be made practical by finding the right kernel function Radial-Basis-style Kernel Function: Neural-Net-style Kernel Function:

50. CS 540, University of Wisconsin-Madison, C. R. Dyer The Federalist Papers

51. CS 540, University of Wisconsin-Madison, C. R. Dyer Description of the Data

52. CS 540, University of Wisconsin-Madison, C. R. Dyer Function Words Based on Relative Frequencies

53. CS 540, University of Wisconsin-Madison, C. R. Dyer SLA Feature Selection for Classifying the Disputed Federalist Papers

54. CS 540, University of Wisconsin-Madison, C. R. Dyer Hyperplane Classifier Using 3 Words

55. CS 540, University of Wisconsin-Madison, C. R. Dyer Results: 3D Plot of Hyperplane

56. CS 540, University of Wisconsin-Madison, C. R. Dyer Multi-Class Classification SVMs can only handle two-class outputs What can be done? Answer: for N-class problems, learn N SVM�s: SVM 1, f1, learns �Output=1� vs �Output ? 1� SVM 2, f2, learns �Output=2� vs �Output ? 2� : SVM N, fN, learns �Output=N� vs �Output ? N�

57. CS 540, University of Wisconsin-Madison, C. R. Dyer Multi-Class Classification Ideally, only one fi(x) > 0 and all others <0, but this is not often the case in practice Instead, to predict the output for a new input, just predict with each SVM and find out which one puts the prediction the furthest into the positive region: Classify as class Ci if fi(x) = max { fj(x) } for all j

58. CS 540, University of Wisconsin-Madison, C. R. Dyer Summary Learning linear functions Pick separating plane that maximizes margin Separating plane defined in terms of support vectors only Learning non-linear functions Project examples into higher dimensional space Use kernel functions for efficiency Generally avoids over-fitting problem Global optimization method; no local optima Can be expensive to apply, especially for multi-class problems

What is a Support Vector Machine

What is a Support Vector Machine

Presentation Transcript

Support Vector Machine

Support vector machine

Support vector machine

Support vector machine

Support Vector Machine

Support Vector Machine

Support Vector Machine (SVM)

Support Vector Machine

Support Vector Machine

Support Vector Machine (SVM)

Support Vector Machine (SVM)

Support Vector Machine

Support Vector Machine

Support Vector Machine

Classification: Support Vector Machine

Support Vector Machine

Support Vector Machine (SVM)

Support Vector Machine

Support Vector Machine