Support Vector Machines

Support Vector Machines Reminder of perceptron Large-margin linear classifier Non-separable case

Linearly separable case Every vector in the grey region is a solution vector. The region is called the “solution region”. A vector in the middle looks better. We can impose conditions to select it.

Gradient descent procedure

Perceptron Y(a) is the set of samples mis-classified by a. When Y(a) is empty, define J(a)=0. Because aty <0 when yi is misclassified, J(a) is non-negative. The gradient is simple: The update rule is: Learning rate

Perceptron

Perceptron The perceptron adjusts a only according to misclassified samples; correctly classified samples are ignored. The final a is a linear combination of the training points. To have good testing-sample performance, a large set of training samples is needed; however, it is almost certain that a large set of training samples is not linearly separable. In the case of linearly non-separable, the iteration doesn’t stop. To make sure it converges, we can let η(k)  0 as k∞. However, how to choose the rate of change?

Large-margin linear classifier • Let’s assume the linearly separable case. • The optimal separating hyperplane separates the two classes and maximizes the distance to the closest point. f(x)=wtx+w0 • Unique solution • Better test sample performance

Large-margin linear classifier • {x1, ..., xn}: our training dataset in d-dimension • yiÎ {1,-1}: class label • Our goal: Among all f(x) such that • Find the optimal separating hyperplane  • Find the largest margin M,

Large-margin linear classifier • The border is M away from the hyperplane. M is called “margin”. • Drop the ||β||=1 requirement, Let M=1 / ||β||, then the easier version is:

Large-margin linear classifier

Non-separable case • When two classes are not linearly separable, allow slack variables for the points on the wrong side of the border:

Non-separable case • The optimization problem becomes: • ξ=0 when the point is on the correct side of the margin; • ξ>1 when the point passes the hyperplane to the wrong side; • 0<ξ<1 when the point is in the margin but still on the correct side.

Non-separable case • When a point is outside the boundary, ξ=0. It doesn’t play a big role in determining the boundary ---- not forcing any special class of distribution.

Computation • equivalent • C replaces the constant. • For separable case, C=∞.

Computation • A quadratic programming problem. The Lagrange function is: • Take derivatives of β, β0, ξi, set to zero: • And positivity constraints:

Computation • Substitute the three lower equations into the top one, the Lagrangian dual objective function: • Karush-Kuhn-Tucker conditions include

Computation • From , The solution of β has the form: • Non-zero coefficients only for those points i for which • These are called “support vectors”. • Some will lie on the edge of the margin • the remainder have , They are on the wrong side of the margin.

Computation

Computation • Smaller C. 85% of the points are support points.

Support Vector Machines • Enlarge the feature space to make the procedure more flexible. • Basis functions • Use the same procedure to construct SV classifier • The decision is made by

SVM • Recall in linear space: • With new basis:

SVM When domain knowledge is available, sometimes we could use explicit transformations. But often we cannot.

SVM • h(x) is involved ONLY in the form of inner product! • So as long as we define the kernel function • Which computes the inner product in the transformed space, we don’t need to know what h(x) itself is! “Kernel trick” • Some commonly used Kernels:

SVM • Recall αi=0 for non-support vectors, f(x) depends only on the support vectors.

SVM • K(x,x’) can be seen as a similarity measure between x and x’. • The decision is made essentially by a weighted sum of similarity of the object to all the support vectors.

SVM

SVM Bayes error: 0.029 When noise features are present, SVM suffered from not being able to concentrate on a subspace - all terms of the form 2XjXj′ are given equal weight

SVM • How to select kernel and parameters ? • Domain knowledge. • How complex should the space partition be? • Should the surface be smooth? • Compare the models by their approximate testing error rate cross-validation • - Fit data using multiple kernels/parameters • - Estimate error rate for each setting • - Select the best-performing one • Parameter optimization methods

Support Vector Machines