450 likes | 671 Views
Support Vector Machines (and Kernel Methods in general). Machine Learning. Last Time. Multilayer Perceptron /Logistic Regression Networks Neural Networks Error Backpropagation. Today. Support Vector Machines Note : we’ll rely on some math from Optimality Theory that we won’t derive.
E N D
Support Vector Machines (and Kernel Methods in general) Machine Learning
Last Time • Multilayer Perceptron/Logistic Regression Networks • Neural Networks • Error Backpropagation
Today • Support Vector Machines • Note: we’ll rely on some math from Optimality Theory that we won’t derive.
Maximum Margin • Perceptron (and other linear classifiers) can lead to many equally valid choices for the decision boundary Are these really “equally valid”?
Max Margin • How can we pick which is best? • Maximize the size of the margin. Small Margin Large Margin Are these really “equally valid”?
Support Vectors • Support Vectors are those input points (vectors) closest to the decision boundary • 1. They are vectors • 2. They “support” the decision hyperplane
Support Vectors • Define this as a decision problem • The decision hyperplane: • No fancy math, just the equation of a hyperplane.
Support Vectors • Aside: Why do some cassifiers use or • Simplicity of the math and interpretation. • For probability density function estimation 0,1 has a clear correlate. • For classification, a decision boundary of 0 is more easily interpretable than .5.
Support Vectors • Define this as a decision problem • The decision hyperplane: • Decision Function:
Support Vectors • Define this as a decision problem • The decision hyperplane: • Margin hyperplanes:
Support Vectors • The decision hyperplane: • Scale invariance
Support Vectors • The decision hyperplane: • Scale invariance
Support Vectors This scaling does not change the decision hyperplane, or the support vector hyperplanes. But we will eliminate a variable from the optimization • The decision hyperplane: • Scale invariance
What are we optimizing? • We will represent the size of the margin in terms of w. • This will allow us to simultaneously • Identify a decision boundary • Maximize the margin
How do we represent the size of the margin in terms of w? • There must at least one point that lies on each support hyperplanes Proof outline: If not, we could define a larger margin support hyperplane that does touch the nearest point(s).
How do we represent the size of the margin in terms of w? • There must at least one point that lies on each support hyperplanes Proof outline: If not, we could define a larger margin support hyperplane that does touch the nearest point(s).
How do we represent the size of the margin in terms of w? • There must at least one point that lies on each support hyperplanes • Thus: And:
How do we represent the size of the margin in terms of w? • There must at least one point that lies on each support hyperplanes • Thus: And:
How do we represent the size of the margin in terms of w? • The vector w is perpendicular to the decision hyperplane • If the dot product of two vectors equals zero, the two vectors are perpendicular.
How do we represent the size of the margin in terms of w? • The margin is the projection of x1 – x2 onto w, the normal of the hyperplane.
How do we represent the size of the margin in terms of w? • The margin is the projection of x1 – x2 onto w, the normal of the hyperplane. Projection: Size of the Margin:
Maximizing the margin • Goal: maximize the margin Linear Separability of the data by the decision boundary
Max Margin Loss Function • If constraint optimization then Lagrange Multipliers • Optimize the “Primal”
Max Margin Loss Function • Optimize the “Primal” Partial wrtb
Max Margin Loss Function • Optimize the “Primal” Partial wrtw
Max Margin Loss Function • Optimize the “Primal” Partial wrtw Now have to find αi. Substitute back to the Loss function
Max Margin Loss Function • Construct the “dual”
Dual formulation of the error • Optimize this quadratic program to identify the lagrange multipliers and thus the weights There exist (rather) fast approaches to quadratic optimization in both C, C++, Python, Java and R
Quadratic Programming • If Q is positive semi definite, then f(x) is convex. • If f(x) is convex, then there is a single maximum.
Support Vector Expansion • When αi is non-zero then xi is a support vector • When αi is zero xi is not a support vector New decision Function Independent of the Dimension of x!
Kuhn-Tucker Conditions • In constraint optimization: At the optimal solution • Constraint * Lagrange Multiplier = 0 Only points on the decision boundary contribute to the solution!
Interpretability of SVM parameters • What else can we tell from alphas? • If alpha is large, then the associated data point is quite important. • It’s either an outlier, or incredibly important. • But this only gives us the best solution for linearly separable data sets…
Basis of Kernel Methods • The decision process doesn’t depend on the dimensionality of the data. • We can map to a higher dimensionality of the data space. • Note: data points only appear within a dot product. • The error is based on the dot product of data points – not the data points themselves.
Basis of Kernel Methods • Since data points only appear within a dot product. • Thus we can map to another space through a replacement • The error is based on the dot product of data points – not the data points themselves.
Learning Theory bases of SVMs • Theoretical bounds on testing error. • The upper bound doesn’t depend on the dimensionality of the space • The lower bound is maximized by maximizing the margin, γ, associated with the decision boundary.
Why we like SVMs • They work • Good generalization • Easily interpreted. • Decision boundary is based on the data in the form of the support vectors. • Not so in multilayer perceptron networks • Principled bounds on testing error from Learning Theory (VC dimension)
SVM vs. MLP • SVMs have many fewer parameters • SVM: Maybe just a kernel parameter • MLP: Number and arrangement of nodes and eta learning rate • SVM: Convex optimization task • MLP: likelihood is non-convex -- local minima
Soft margin classification • There can be outliers on the other side of the decision boundary, or leading to a small margin. • Solution: Introduce a penalty term to the constraint function
Soft Max Dual Still Quadratic Programming!
Soft margin example • Points are allowed within the margin, but cost is introduced. Hinge Loss
Probabilities from SVMs • Support Vector Machines are discriminant functions • Discriminant functions: f(x)=c • Discriminative models: f(x) = argmaxcp(c|x) • Generative Models: f(x) = argmaxcp(x|c)p(c)/p(x) • No (principled) probabilities from SVMs • SVMs are not based on probability distribution functions of class instances.
Efficiency of SVMs • Not especially fast. • Training – n^3 • Quadratic Programming efficiency • Evaluation – n • Need to evaluate against each support vector (potentially n)
Good Bye • Next time: • The Kernel “Trick” -> Kernel Methods • or • How can we use SVMs that are not linearly separable?