Support Vector Machines

Support Vector Machines Text Book Slides

Find a linear hyperplane (decision boundary) that will separate the data Support Vector Machines

One Possible Solution Support Vector Machines

Another possible solution Support Vector Machines

Other possible solutions Support Vector Machines

Which one is better? B1 or B2? How do you define better? Support Vector Machines

Find a hyperplane that maximizes the margin => B1 is better than B2 Support Vector Machines

Support Vector Machines

Support Vector Machines • We want to maximize: • Which is equivalent to minimizing: • But subjected to the following constraints: • This is a constrained optimization problem • Numerical approaches to solve it (e.g., quadratic programming)

Support Vector Machines • What if the problem is not linearly separable?

Support Vector Machines • What if the problem is not linearly separable? • Introduce slack variables • Need to minimize: • Subject to:

Nonlinear Support Vector Machines • What if decision boundary is not linear?

Nonlinear Support Vector Machines • Transform data into higher dimensional space

Support Vector Machines • To understand the power and elegance of SVMs, one must grasp three key ideas: • Margins • Duality • Kernels

Support Vector Machines • Consider the simple case of linear classification • Binary classification task: • Xi (i=1,2,…m) • Class labels Yi (1) • d-dimensional attribute space • Let the classification function be: f(x)=sign(w·x-b) where vector w determines the orientation of a discriminant plane, scalar b is the offset of the plane from the origin • Assume that the two sets are linearly separable, i.e., there exists a plane that correctly classifies all the points in the two sets

Support Vector Machines • Solid line is preferred • Geometrically we can characterize the solid plane as being “furthest” from both classes • How can we construct the plane “furthest’’ from both classes?

Support Vector Machines • Examine the convex hull of each class’ training data (indicated by dotted lines) and then find the closest points in the two convex hulls (circles labeled d and c). • The convex hull of a set of points is the smallest convex set containing the points. • If we construct the plane that bisects these two points (w=d-c), the resulting classifier should be robust in some sense. Figure – Best plane bisects closest points in the convex hulls

Convex Sets Convex Set Non-Convex or Concave Set A function (in blue) is convex if and only if the region above its graph (in green) is a convex set.

Convex Hulls Convex hull: elastic band analogy For planar objects, i.e., lying in the plane, the convex hull may be easily visualized by imagining an elastic band stretched open to encompass the given object; when released, it will assume the shape of the required convex hull.

SVM: Margins Best Plane Maximizes the margin

SVM: Duality - Maximize margin - Best plane bisects closest points in the convex hulls - Dulaity!!!

SVM: Mathematics behind it! • 2 class problem (multi-class problem) • linearly separable Linearly inseparable) • line (plane, hyper-plane) • Maximal Margin Hyper-plane (MMH) • 2 equidistant parallel hyper-planes on either side of the hyper-plane • Separating Hyper-plane Equation • where is the weight vector = {w1, w2, …,wn} • b is a scalar (called bias) • - Consider 2 input attributes A1 & A2. =(x1,x2)

SVM: Mathematics behind it! • Separating hyper-plane • Any point lying above the SH satisfies • Any point lying below the SH satisfies • Adjusting the weights • Combining, we get • Any training tuple that falls on H1 or H2 satisfies above inequality is called Support Vectors (SVs) • SVs are most difficult tuples to classify & give most important information regarding classification

SVM: Size of maximal margin • Distance of any point on H1 or H2 from SH is

SVM: Some Important Points • Complexity of the learned classifier is characterized by the no. of SVs rather than on no. of dimensions • SVs are critical training tuples • If all other training tuples are removed and training were repeated, the same SH would be found • No. of SVs can be used to compute the upper bound on the expected error rate • An SVM with small no. of SVs can have good generalization, even if the dim. Of the data is high

Support Vector Machines