1 / 38

Support Vector Machines and Kernel Methods

Support Vector Machines and Kernel Methods. Kenan Gençol Department of Electrical and Electronics Engineering Anadolu University submitted in the course MAT592 Seminar Advisor: Prof. Dr. Yalçın Küçük Department of Mathematics. Agenda.

Download Presentation

Support Vector Machines and Kernel Methods

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Support Vector Machines and Kernel Methods Kenan Gençol Department of Electrical and Electronics Engineering Anadolu University submitted in the course MAT592 Seminar Advisor: Prof. Dr. Yalçın Küçük Department of Mathematics

  2. Agenda • Linear Discriminant Functions and Decision Hyperplanes • Introduction to SVM • Support Vector Machines • Introduction to Kernels • Nonlinear SVM • Kernel Methods

  3. Linear Discriminant Functions and Decision Hyperplanes Figure 1. Two classes of patterns and a linear decision function

  4. Linear Discriminant Functions and Decision Hyperplanes • Each pattern is represented by a vector • Linear decision function has the equation • where w1,w2 are weights and w0 is the bias term

  5. Linear Discriminant Functions and Decision Hyperplanes • The general decision hyperplane equation in d-dimensional space has the form: • where w = [w1 w2 ....wd] is the weight vector and w0 is the bias term.

  6. Introduction to SVM • There are many hyperplanes that separates two classes Figure 2. An example of two possible classifiers

  7. Introduction to SVM • THE GOAL: • Our goal is to search for direction w and bias w0 that gives the maximum possible margin, or in other words, to orientate this hyperplane in such a way as to be as far as possible from the closest members of both classes.

  8. SVM: Linearly Separable Case Figure 3. Hyperplane through two linearly separable classes

  9. SVM: Linearly Separable Case • Our training data is of the form: • This hyperplane can be described by • and called separating hyperplane.

  10. SVM: Linearly Separable Case • Select variables w and b so that: • These equations can be combined into:

  11. SVM: Linearly Separable Case • The points that lie closest to the separating hyperplane are called support vectors (circled points in diagram) and • are called supporting hyperplanes.

  12. SVM: Linearly Separable Case Figure 3. Hyperplane through two linearly separable classes (repeated)

  13. SVM: Linearly Separable Case • The hyperplane’s equidistance from H1 and H2means that d1= d2and this quantity is known as SVM Margin: • d1+ d2= • d1= d2=

  14. SVM: Linearly Separable Case • Maximizing  Minimizing • min such that yi(xi . w + b) -1 >= 0 • Minimizing is equivalent to minimizing • to perform Quadratic Programming (QP) optimization

  15. SVM: Linearly Separable Case • Optimization problem: • Minimize • subject to

  16. SVM: Linearly Separable Case • This is an inequality constrained optimization problem with Lagrangian function: • where αi >= 0 i=1,2,....,L are Lagrange multipliers. (1)

  17. SVM • The corresponding KKT conditions are: (2) (3)

  18. SVM • This is a convex optimization problem.The cost function is convex and the set of constraints are linear and define a convex set of feasible solutions. Such problems can be solved by considering the so called Lagrangian Duality

  19. SVM • Substituing (2) and (3) gives a new formulation which being dependent on α, we need to maximize.

  20. SVM • This is called Dual form (Lagrangian Dual) of the primary form. Dual form only requires the dot product of each input vector to be calculated. • This is important for the Kernel Trick which will be described later.

  21. SVM • So the problem becomes a dual problem: • Maximize • subject to

  22. SVM • Differentiating with respect to αi‘s and using the constraint equation, a system of equations is obtained. Solving the system, the Lagrange multipliers are found and optimum hyperplane is given according to the formula:

  23. SVM • Some Notes: • SUPPORT VECTORS are the feature vectors for αi > 0 i=1,2,....,L • The cost function is strictly convex. • Hessian matrix is positive definite. • Any local minimum is also global and unique. The optimal hyperplane classifier of a SVM is UNIQUE. • Although the solution is unique, the resulting Lagrange multipliers are not unique.

  24. Kernels: Introduction • When applying our SVM to linearly separable data we have started bycreating a matrix H from the dot product of our input variables: • being known as Linear Kernel, an example of a family of functions called Kernel functions.

  25. Kernels: Introduction • The set of kernelfunctions are all based on calculating inner products of two vectors. • This means if the function is mapped to a higher dimensionality space by a nonlinear mapping function only the inner products of the mapped inputs need to be determined without needing to explicitly calculate Ф . • This is called “Kernel Trick”

  26. Kernels: Introduction • Kernel Trick is useful because there are many classification/regression problems that are not fully separable/regressable in the input space but separable/regressable in a higher dimensional space.

  27. Kernels: Introduction • Popular Kernel Families: • Radial Basis Function (RBF) Kernel • Polynomial Kernel • Sigmodial (Hyperbolic Tangent) Kernel

  28. Nonlinear Support Vector Machines • The support vector machine with kernel functions becomes: • and the resulting classifier:

  29. Nonlinear Support Vector Machines Figure 4. The SVM architecture employing kernel functions.

  30. Kernel Methods • Recall that a kernel function computes the inner product of the images under an embedding of two data points • is a kernel if • 1. k is symmetric: k(x,y) = k(y,x) • 2. k is positive semi-definite, i.e., the “Gram Matrix” Kij= k(xi,xj)is positive semi-definite.

  31. Kernel Methods • The answer for which kernels does there exist a pair {H,φ}, with the properties described above, and for which does there not is given by Mercer’s condition.

  32. Mercer’s condition • Let be a compact subset of and let and a mapping • where H is an Euclidean space. Then the inner product operation has an equivalent representation • and is a symmetric function satisfying the following condition • for any , such that

  33. Mercer’s Theorem • Theorem. Suppose K is a continuous symmetricnon-negative definite kernel. Then there is anorthonormal basis {ei}iofL2[a, b] consisting of eigenfunctions of TK • such that the corresponding sequence of eigenvalues {λi}iis nonnegative. The eigenfunctions corresponding to non-zero eigenvalues are continuous on [a, b] andKhas the representation • where the convergence is absolute and uniform.

  34. Kernel Methods • Suppose k1and k2are valid (symmetric, positive definite) kernels on X. Then the following are valid kernels: • 1. • 2. • 3.

  35. Kernel Methods • 4. • 5. • 6. • 7.

  36. References • [1] C.J.C. Burges, “Tutorial on support vector machines for pattern recognition”, Data Mining and Knowledge Discovery 2, 121-167, 1998. • [2] Marques de Sa, J.P., “Pattern Recognition Concepts,Methods and Applications”, Springer, 2001. • [3] S. Theodoridis, “Pattern Recognition”, Elsevier Academic Press, 2003.

  37. References • [4] T. Fletcher, “Support Vector Machines Explained”, UCL, March,2005. • [5] Cristianini,N., Shawe-Taylor,J., “Kernel Methods for Pattern Analysis”, Cambridge University Press, 2004. • [6] “Subject Title: Mercer’s Theorem”, Wikipedia: http://en.wikipedia.org/wiki/Mercer’s_theorem

  38. Thank You

More Related