Support Vector Machines (and Kernel Methods in general)

Support Vector Machines (and Kernel Methods in general) Machine Learning March 23, 2010

Last Time • Multilayer Perceptron/Logistic Regression Networks • Neural Networks • Error Backpropagation

Today • Support Vector Machines • Note: we’ll rely on some math from Optimality Theory that we won’t derive.

Maximum Margin • Perceptron (and other linear classifiers) can lead to many equally valid choices for the decision boundary Are these really “equally valid”?

Max Margin • How can we pick which is best? • Maximize the size of the margin. Small Margin Large Margin Are these really “equally valid”?

Support Vectors • Support Vectors are those input points (vectors) closest to the decision boundary • 1. They are vectors • 2. They “support” the decision hyperplane

Support Vectors • Define this as a decision problem • The decision hyperplane: • No fancy math, just the equation of a hyperplane.

Support Vectors • Aside: Why do some cassifiers use or • Simplicity of the math and interpretation. • For probability density function estimation 0,1 has a clear correlate. • For classification, a decision boundary of 0 is more easily interpretable than .5.

Support Vectors • Define this as a decision problem • The decision hyperplane: • Decision Function:

Support Vectors • Define this as a decision problem • The decision hyperplane: • Margin hyperplanes:

Support Vectors • The decision hyperplane: • Scale invariance

Support Vectors This scaling does not change the decision hyperplane, or the support vector hyperplanes. But we will eliminate a variable from the optimization • The decision hyperplane: • Scale invariance

What are we optimizing? • We will represent the size of the margin in terms of w. • This will allow us to simultaneously • Identify a decision boundary • Maximize the margin

How do we represent the size of the margin in terms of w? • There must at least one point that lies on each support hyperplanes Proof outline: If not, we could define a larger margin support hyperplane that does touch the nearest point(s).

How do we represent the size of the margin in terms of w? • There must at least one point that lies on each support hyperplanes • Thus: And:

How do we represent the size of the margin in terms of w? • The vector w is perpendicular to the decision hyperplane • If the dot product of two vectors equals zero, the two vectors are perpendicular.

How do we represent the size of the margin in terms of w? • The margin is the projection of x1 – x2 onto w, the normal of the hyperplane.

Aside: Vector Projection

How do we represent the size of the margin in terms of w? • The margin is the projection of x1 – x2 onto w, the normal of the hyperplane. Projection: Size of the Margin:

Maximizing the margin • Goal: maximize the margin Linear Separability of the data by the decision boundary

Max Margin Loss Function • If constraint optimization then Lagrange Multipliers • Optimize the “Primal”

Max Margin Loss Function • Optimize the “Primal” Partial wrtb

Max Margin Loss Function • Optimize the “Primal” Partial wrtw

Max Margin Loss Function • Optimize the “Primal” Partial wrtw Now have to find αi. Substitute back to the Loss function

Max Margin Loss Function • Construct the “dual”

Dual formulation of the error • Optimize this quadratic program to identify the lagrange multipliers and thus the weights There exist (extremely) fast approaches to quadratic optimization in both C, C++, Python, Java and R

Quadratic Programming • If Q is positive semi definite, then f(x) is convex. • If f(x) is convex, then there is a single maximum.

Support Vector Expansion • When αi is non-zero then xi is a support vector • When αi is zero xi is not a support vector New decision Function Independent of the Dimension of x!

Kuhn-Tucker Conditions • In constraint optimization: At the optimal solution • Constraint * Lagrange Multiplier = 0 Only points on the decision boundary contribute to the solution!

Visualization of Support Vectors

Interpretability of SVM parameters • What else can we tell from alphas? • If alpha is large, then the associated data point is quite important. • It’s either an outlier, or incredibly important. • But this only gives us the best solution for linearly separable data sets…

Basis of Kernel Methods • The decision process doesn’t depend on the dimensionality of the data. • We can map to a higher dimensionality of the data space. • Note: data points only appear within a dot product. • The error is based on the dot product of data points – not the data points themselves.

Basis of Kernel Methods • Since data points only appear within a dot product. • Thus we can map to another space through a replacement • The error is based on the dot product of data points – not the data points themselves.

Learning Theory bases of SVMs • Theoretical bounds on testing error. • The upper bound doesn’t depend on the dimensionality of the space • The lower bound is maximized by maximizing the margin, γ, associated with the decision boundary.

Why we like SVMs • They work • Good generalization • Easily interpreted. • Decision boundary is based on the data in the form of the support vectors. • Not so in multilayer perceptron networks • Principled bounds on testing error from Learning Theory (VC dimension)

SVM vs. MLP • SVMs have many fewer parameters • SVM: Maybe just a kernel parameter • MLP: Number and arrangement of nodes and eta learning rate • SVM: Convex optimization task • MLP: likelihood is non-convex -- local minima

Soft margin classification • There can be outliers on the other side of the decision boundary, or leading to a small margin. • Solution: Introduce a penalty term to the constraint function

Soft Max Dual Still Quadratic Programming!

Soft margin example • Points are allowed within the margin, but cost is introduced. Hinge Loss

Probabilities from SVMs • Support Vector Machines are discriminant functions • Discriminant functions: f(x)=c • Discriminative models: f(x) = argmaxcp(c|x) • Generative Models: f(x) = argmaxcp(x|c)p(c)/p(x) • No (principled) probabilities from SVMs • SVMs are not based on probability distribution functions of class instances.

Efficiency of SVMs • Not especially fast. • Training – n^3 • Quadratic Programming efficiency • Evaluation – n • Need to evaluate against each support vector (potentially n)

Good Bye • Next time: • The Kernel “Trick” -> Kernel Methods • or • How can we use SVMs that are not linearly separable?

Support Vector Machines (and Kernel Methods in general)