570 likes | 880 Views
CS 540, University of Wisconsin-Madison, C. R. Dyer. What are Support Vector Machines Used For?. Classification Regression and data-fitting Supervised and unsupervised learning. CS 540, University of Wisconsin-Madison, C. R. Dyer. Linear Classifiers. f . . x. . y. denotes 1denotes -1. .
E N D
1. CS 540, University of Wisconsin-Madison, C. R. Dyer What is a Support Vector Machine? An optimally defined surface
Typically nonlinear in the input space
Linear in a higher dimensional space
Implicitly defined by a kernel function
2. CS 540, University of Wisconsin-Madison, C. R. Dyer What are Support Vector Machines Used For? Classification
Regression and data-fitting
Supervised and unsupervised learning
3. CS 540, University of Wisconsin-Madison, C. R. Dyer Linear Classifiers
4. CS 540, University of Wisconsin-Madison, C. R. Dyer Linear Classifiers(aka Linear Discriminant Functions) Definition
It is a function that is a linear combination of the components of the input x
where w is the weight vector and b the bias
A two-category classifier then uses the rule:
Decide class c1 if f(x) > 0 and class c2 if f(x) < 0
? Decide c1 if wTx > -b and c2 otherwise
5. CS 540, University of Wisconsin-Madison, C. R. Dyer Linear Classifiers
6. CS 540, University of Wisconsin-Madison, C. R. Dyer Linear Classifiers
7. CS 540, University of Wisconsin-Madison, C. R. Dyer Linear Classifiers
8. CS 540, University of Wisconsin-Madison, C. R. Dyer Linear Classifiers
9. CS 540, University of Wisconsin-Madison, C. R. Dyer Classifier Margin
10. CS 540, University of Wisconsin-Madison, C. R. Dyer Maximum Margin
11. CS 540, University of Wisconsin-Madison, C. R. Dyer Maximum Margin
12. CS 540, University of Wisconsin-Madison, C. R. Dyer Why Maximum Margin?
13. CS 540, University of Wisconsin-Madison, C. R. Dyer Specifying a Line and Margin How do we represent this mathematically?
… in d input dimensions?
An example x = (x1, …, xd)T
14. CS 540, University of Wisconsin-Madison, C. R. Dyer Specifying a Line and Margin Plus-plane = { wT · x + b = +1 }
Minus-plane = { wT · x + b = -1 }
15. CS 540, University of Wisconsin-Madison, C. R. Dyer Computing the Margin Plus-plane = { w x + b = +1 }
Minus-plane = { w x + b = -1 }
Claim: The vector w is perpendicular to the Plus-Plane
16. CS 540, University of Wisconsin-Madison, C. R. Dyer Computing the Margin Plus-plane = { w x + b = +1 }
Minus-plane = { w x + b = -1 }
Claim: The vector w is perpendicular to the Plus Plane. Why?
17. CS 540, University of Wisconsin-Madison, C. R. Dyer
18. CS 540, University of Wisconsin-Madison, C. R. Dyer Computing the Margin Plus-plane = { w x + b = +1 }
Minus-plane = { w x + b = -1 }
The vector w is perpendicular to the Plus Plane
Let x- be any point on the minus plane
Let x+ be the closest plus-plane-point to x-
19. CS 540, University of Wisconsin-Madison, C. R. Dyer Computing the Margin Plus-plane = { w x + b = +1 }
Minus-plane = { w x + b = -1 }
The vector w is perpendicular to the Plus Plane
Let x- be any point on the minus plane
Let x+ be the closest plus-plane-point to x-
Claim: x+ = x- + l w for some value of l. Why?
20. CS 540, University of Wisconsin-Madison, C. R. Dyer Computing the Margin Plus-plane = { w x + b = +1 }
Minus-plane = { w x + b = -1 }
The vector w is perpendicular to the Plus Plane
Let x- be any point on the minus plane
Let x+ be the closest plus-plane-point to x-
Claim: x+ = x- + l w for some value of l. Why?
21. CS 540, University of Wisconsin-Madison, C. R. Dyer Computing the Margin What we know:
w x+ + b = +1
w x- + b = -1
x+ = x- + l w
|x+ - x- | = M
It’s now easy to get M in terms of w and b
22. CS 540, University of Wisconsin-Madison, C. R. Dyer Computing the Margin What we know:
w x+ + b = +1
w x- + b = -1
x+ = x- + l w
|x+ - x- | = M
It’s now easy to get M in terms of w and b
23. CS 540, University of Wisconsin-Madison, C. R. Dyer Computing the Margin What we know:
w x+ + b = +1
w x- + b = -1
x+ = x- + l w
|x+ - x- | = M
24. CS 540, University of Wisconsin-Madison, C. R. Dyer Learning the Maximum Margin Classifier Given a guess of w and b we can
Compute whether all data points in the correct half-planes
Compute the width of the margin
So now we just need to write a program to search the space of w’s and b’s to find the widest margin that matches all the data points. How?
25. CS 540, University of Wisconsin-Madison, C. R. Dyer Learning via Quadratic Programming
QP is a well-studied class of optimization algorithms to maximize a quadratic function of some real-valued variables subject to linear constraints
Minimize subject to
w x + b ? +1 if x in class 1
w x + b ? -1 if x in class 2
26. CS 540, University of Wisconsin-Madison, C. R. Dyer Learning the Maximum Margin Classifier Given guess of w, b, we can
Compute whether all data points are in the correct half-planes
Compute the margin width
Assume N examples, each (xk , yk) where yk = +/- 1
27. CS 540, University of Wisconsin-Madison, C. R. Dyer Learning the Maximum Margin Classifier Given guess of w , b we can
Compute whether all data points are in the correct half-planes
Compute the margin width
Assume N examples, each (xk , yk) where yk = +/- 1
28. CS 540, University of Wisconsin-Madison, C. R. Dyer Uh-oh!
29. CS 540, University of Wisconsin-Madison, C. R. Dyer Uh-oh!
30. CS 540, University of Wisconsin-Madison, C. R. Dyer Uh-oh!
31. CS 540, University of Wisconsin-Madison, C. R. Dyer Uh-oh!
32. CS 540, University of Wisconsin-Madison, C. R. Dyer Uh-oh!
33. CS 540, University of Wisconsin-Madison, C. R. Dyer Learning Maximum Margin with Noise Given guess of w, b, we can
Compute sum of distances of points to their correct zones
Compute the margin width
Assume N examples, each (xk , yk) where yk = +/- 1
34. CS 540, University of Wisconsin-Madison, C. R. Dyer Learning Maximum Margin with Noise Given guess of w , b we can
Compute sum of distances of points to their correct zones
Compute the margin width
Assume N examples, each (xk , yk) where yk = +/- 1
35. CS 540, University of Wisconsin-Madison, C. R. Dyer Learning Maximum Margin with Noise Given guess of w , b we can
Compute sum of distances of points to their correct zones
Compute the margin width
Assume R datapoints, each (xk,yk) where yk = +/- 1
36. CS 540, University of Wisconsin-Madison, C. R. Dyer Learning Maximum Margin with Noise Given guess of w , b we can
Compute sum of distances of points to their correct zones
Compute the margin width
Assume N examples, each (xk,yk) where yk = +/- 1
37. CS 540, University of Wisconsin-Madison, C. R. Dyer Learning Maximum Margin with Noise Given guess of w , b we can
Compute sum of distances of points to their correct zones
Compute the margin width
Assume N examples, each (xk,yk) where yk = +/- 1
38. CS 540, University of Wisconsin-Madison, C. R. Dyer An Equivalent QP
39. CS 540, University of Wisconsin-Madison, C. R. Dyer An Equivalent QP
40. CS 540, University of Wisconsin-Madison, C. R. Dyer An Equivalent QP
41. CS 540, University of Wisconsin-Madison, C. R. Dyer Suppose we’re in 1 Dimension
42. CS 540, University of Wisconsin-Madison, C. R. Dyer Suppose we’re in 1 Dimension
43. CS 540, University of Wisconsin-Madison, C. R. Dyer Harder 1-Dimensional Dataset
44. CS 540, University of Wisconsin-Madison, C. R. Dyer Harder 1-Dimensional Dataset
45. CS 540, University of Wisconsin-Madison, C. R. Dyer Harder 1-Dimensional Dataset
46. CS 540, University of Wisconsin-Madison, C. R. Dyer
47. CS 540, University of Wisconsin-Madison, C. R. Dyer Project examples into some higher dimensional space where the data is linearly separable, defined by z = F(x)
Training depends only on dot products of the form
F(xi) · F(xj)
Example:
K(xi, xj) = F(xi) · F(xj) = (xi · xj)2
Dimensionality of z space is generally much larger than the dimension of input space x
48. CS 540, University of Wisconsin-Madison, C. R. Dyer Common SVM Basis Functions
49. CS 540, University of Wisconsin-Madison, C. R. Dyer SVM Kernel Functions K(a,b)=(a . b +1)d is an example of an SVM kernel function
Beyond polynomials there are other very high dimensional basis functions that can be made practical by finding the right kernel function
Radial-Basis-style Kernel Function:
Neural-Net-style Kernel Function:
50. CS 540, University of Wisconsin-Madison, C. R. Dyer The Federalist Papers
51. CS 540, University of Wisconsin-Madison, C. R. Dyer Description of the Data
52. CS 540, University of Wisconsin-Madison, C. R. Dyer Function Words Based on Relative Frequencies
53. CS 540, University of Wisconsin-Madison, C. R. Dyer SLA Feature Selection for Classifying the Disputed Federalist Papers
54. CS 540, University of Wisconsin-Madison, C. R. Dyer Hyperplane Classifier Using 3 Words
55. CS 540, University of Wisconsin-Madison, C. R. Dyer Results: 3D Plot of Hyperplane
56. CS 540, University of Wisconsin-Madison, C. R. Dyer Multi-Class Classification SVMs can only handle two-class outputs
What can be done?
Answer: for N-class problems, learn N SVM’s:
SVM 1, f1, learns “Output=1” vs “Output ? 1”
SVM 2, f2, learns “Output=2” vs “Output ? 2”
:
SVM N, fN, learns “Output=N” vs “Output ? N”
57. CS 540, University of Wisconsin-Madison, C. R. Dyer Multi-Class Classification Ideally, only one fi(x) > 0 and all others <0, but this is not often the case in practice
Instead, to predict the output for a new input, just predict with each SVM and find out which one puts the prediction the furthest into the positive region:
Classify as class Ci if fi(x) = max { fj(x) } for all j
58. CS 540, University of Wisconsin-Madison, C. R. Dyer Summary Learning linear functions
Pick separating plane that maximizes margin
Separating plane defined in terms of support vectors only
Learning non-linear functions
Project examples into higher dimensional space
Use kernel functions for efficiency
Generally avoids over-fitting problem
Global optimization method; no local optima
Can be expensive to apply, especially for multi-class problems