Support Vector Machines and Kernel Methods

Support Vector Machines and Kernel Methods Kenan Gençol Department of Electrical and Electronics Engineering Anadolu University submitted in the course MAT592 Seminar Advisor: Prof. Dr. Yalçın Küçük Department of Mathematics

Agenda • Linear Discriminant Functions and Decision Hyperplanes • Introduction to SVM • Support Vector Machines • Introduction to Kernels • Nonlinear SVM • Kernel Methods

Linear Discriminant Functions and Decision Hyperplanes Figure 1. Two classes of patterns and a linear decision function

Linear Discriminant Functions and Decision Hyperplanes • Each pattern is represented by a vector • Linear decision function has the equation • where w1,w2 are weights and w0 is the bias term

Linear Discriminant Functions and Decision Hyperplanes • The general decision hyperplane equation in d-dimensional space has the form: • where w = [w1 w2 ....wd] is the weight vector and w0 is the bias term.

Introduction to SVM • There are many hyperplanes that separates two classes Figure 2. An example of two possible classifiers

Introduction to SVM • THE GOAL: • Our goal is to search for direction w and bias w0 that gives the maximum possible margin, or in other words, to orientate this hyperplane in such a way as to be as far as possible from the closest members of both classes.

SVM: Linearly Separable Case Figure 3. Hyperplane through two linearly separable classes

SVM: Linearly Separable Case • Our training data is of the form: • This hyperplane can be described by • and called separating hyperplane.

SVM: Linearly Separable Case • Select variables w and b so that: • These equations can be combined into:

SVM: Linearly Separable Case • The points that lie closest to the separating hyperplane are called support vectors (circled points in diagram) and • are called supporting hyperplanes.

SVM: Linearly Separable Case Figure 3. Hyperplane through two linearly separable classes (repeated)

SVM: Linearly Separable Case • The hyperplane’s equidistance from H1 and H2means that d1= d2and this quantity is known as SVM Margin: • d1+ d2= • d1= d2=

SVM: Linearly Separable Case • Maximizing  Minimizing • min such that yi(xi . w + b) -1 >= 0 • Minimizing is equivalent to minimizing • to perform Quadratic Programming (QP) optimization

SVM: Linearly Separable Case • Optimization problem: • Minimize • subject to

SVM: Linearly Separable Case • This is an inequality constrained optimization problem with Lagrangian function: • where αi >= 0 i=1,2,....,L are Lagrange multipliers. (1)

SVM • The corresponding KKT conditions are: (2) (3)

SVM • This is a convex optimization problem.The cost function is convex and the set of constraints are linear and define a convex set of feasible solutions. Such problems can be solved by considering the so called Lagrangian Duality

SVM • Substituing (2) and (3) gives a new formulation which being dependent on α, we need to maximize.

SVM • This is called Dual form (Lagrangian Dual) of the primary form. Dual form only requires the dot product of each input vector to be calculated. • This is important for the Kernel Trick which will be described later.

SVM • So the problem becomes a dual problem: • Maximize • subject to

SVM • Differentiating with respect to αi‘s and using the constraint equation, a system of equations is obtained. Solving the system, the Lagrange multipliers are found and optimum hyperplane is given according to the formula:

SVM • Some Notes: • SUPPORT VECTORS are the feature vectors for αi > 0 i=1,2,....,L • The cost function is strictly convex. • Hessian matrix is positive definite. • Any local minimum is also global and unique. The optimal hyperplane classifier of a SVM is UNIQUE. • Although the solution is unique, the resulting Lagrange multipliers are not unique.

Kernels: Introduction • When applying our SVM to linearly separable data we have started bycreating a matrix H from the dot product of our input variables: • being known as Linear Kernel, an example of a family of functions called Kernel functions.

Kernels: Introduction • The set of kernelfunctions are all based on calculating inner products of two vectors. • This means if the function is mapped to a higher dimensionality space by a nonlinear mapping function only the inner products of the mapped inputs need to be determined without needing to explicitly calculate Ф . • This is called “Kernel Trick”

Kernels: Introduction • Kernel Trick is useful because there are many classification/regression problems that are not fully separable/regressable in the input space but separable/regressable in a higher dimensional space.

Kernels: Introduction • Popular Kernel Families: • Radial Basis Function (RBF) Kernel • Polynomial Kernel • Sigmodial (Hyperbolic Tangent) Kernel

Nonlinear Support Vector Machines • The support vector machine with kernel functions becomes: • and the resulting classifier:

Nonlinear Support Vector Machines Figure 4. The SVM architecture employing kernel functions.

Kernel Methods • Recall that a kernel function computes the inner product of the images under an embedding of two data points • is a kernel if • 1. k is symmetric: k(x,y) = k(y,x) • 2. k is positive semi-definite, i.e., the “Gram Matrix” Kij= k(xi,xj)is positive semi-definite.

Kernel Methods • The answer for which kernels does there exist a pair {H,φ}, with the properties described above, and for which does there not is given by Mercer’s condition.

Mercer’s condition • Let be a compact subset of and let and a mapping • where H is an Euclidean space. Then the inner product operation has an equivalent representation • and is a symmetric function satisfying the following condition • for any , such that

Mercer’s Theorem • Theorem. Suppose K is a continuous symmetricnon-negative definite kernel. Then there is anorthonormal basis {ei}iofL2[a, b] consisting of eigenfunctions of TK • such that the corresponding sequence of eigenvalues {λi}iis nonnegative. The eigenfunctions corresponding to non-zero eigenvalues are continuous on [a, b] andKhas the representation • where the convergence is absolute and uniform.

Kernel Methods • Suppose k1and k2are valid (symmetric, positive definite) kernels on X. Then the following are valid kernels: • 1. • 2. • 3.

Kernel Methods • 4. • 5. • 6. • 7.

References • [1] C.J.C. Burges, “Tutorial on support vector machines for pattern recognition”, Data Mining and Knowledge Discovery 2, 121-167, 1998. • [2] Marques de Sa, J.P., “Pattern Recognition Concepts,Methods and Applications”, Springer, 2001. • [3] S. Theodoridis, “Pattern Recognition”, Elsevier Academic Press, 2003.

References • [4] T. Fletcher, “Support Vector Machines Explained”, UCL, March,2005. • [5] Cristianini,N., Shawe-Taylor,J., “Kernel Methods for Pattern Analysis”, Cambridge University Press, 2004. • [6] “Subject Title: Mercer’s Theorem”, Wikipedia: http://en.wikipedia.org/wiki/Mercer’s_theorem

Thank You

Support Vector Machines and Kernel Methods