Support Vector Machines

Support Vector Machines 1. Introduction to SVMs 2. Linear SVMs 3. Non-linear SVMs • References: • 1. S.Y. Kung, M.W. Mak, and S.H. Lin. Biometric Authentication: A Machine Learning Approach, Prentice Hall, to appear. • 2. S.R. Gunn, 1998. Support Vector Machines for Classification and Regression. (http://www.isis.ecs.soton.ac.uk/resources/svminfo/) • 3. Bernhard Schölkopf. Statistical learning and kernel methods. MSR-TR 2000-23, Microsoft Research, 2000. (ftp://ftp.research.microsoft.com/pub/tr/tr-2000-23.pdf) • 4. For more resources on support vector machines, see http://www.kernel-machines.org/

Introduction • SVMs were developed by Vapnik in 1995 and are becoming popular due to their attractive features and promising performance. • Conventional neural networks are based on empirical risk minimization where network weights are determined by minimizing the mean squares error between the actual outputs and the desired outputs. • SVMs are based on the structural risk minimization principle where parameters are optimized by minimizing classification error. • SVMs have been shown to posses better generalization capability than conventional neural networks.

Introduction (Cont.) • Given N labeled empirical data: (1) where X is the set of input data in and yi are the labels. Domain X

Introduction (Cont.) • We construct a simple classifier by computing the means of the two classes (2) where N1 and N2 are the number of data in the class with positive and negative labels, respectively. • We assign a new point x to the class whose mean is closer to it. • To achieve this, we compute

Introduction (Cont.) • Then, we determine the class of x by checking whether the vector connecting x and cencloses an angle smaller than /2 with the vector Domain X where x

Introduction (Cont.) • In the special case where b = 0, we have (3) • This means that we use ALL data point xi, each being weighted equally by 1/N1or 1/N2, to define the decision plane.

Introduction (Cont.) x Decision plan Domain X

Introduction (Cont.) • However, we might want to remove the influence of patterns that are far away from the decision boundary, because their influence is usually small. • We may also select only a few important data point (called support vectors) and weight them differently. • Then, we have a support vector machine.

Introduction (Cont.) Margin Support vectors x Decision plane Domain X • We aim to find a decision plane that maximizes the margin.

Linear SVMs • Assume that all training data satisfy the constraints: (4) which means (5) • Training data points for which the above equality holds lie on hyperplanes parallel to the decision plane.

Linear SVMs (Conts.) Margin: d • Therefore, maximizing the margin is equivalent to minimizing ||w||2.

Linear SVMs (Lagrangian) • We minimize ||w||2 subject to the constraint that (6) • This can be achieved by introducing Lagrange multipliers and a Lagrangian (7) • The Lagrangian has to be minimized with respect to w and b and maximized with respect to

Linear SVMs (Lagrangian) • Setting • We obtain (8) • Patterns for which are called Support Vectors. These vectors lie on the margin and satisfy where S contains the indexes to the support vectors. • Patterns for which are considered to be irrelevant to the classification.

Linear SVMs (Wolfe Dual) • Substituting (8) into (7), we obtain the Wolfe dual: (9) • The hyper-decision plane is thus

Linear SVMs (Example) • Analytical example (3-point problem): • Objective function:

Linear SVMs (Example) • We introduce another Lagrange multiplier λ to obtain the Lagrangian • Differentiating F(α, λ) with respect to λ and αiand set the results to zero, we obtain

Linear SVMs (Example) • Substitute the Lagrange multipliers into Eq. 8

Linear SVMs (Example) • 4-point linear separable problem: 4 SVs 3 SVs

Linear SVMs (Non-linearly separable) • Non-linearly separable: patterns that cannot be separated by a linear decision boundary without incurring classification error. Data that causes classification error in linear SVMs

Linear SVMs (Non-linearly separable) • We introduce a set of slack variables with • The slack variables allow some data to violate the constraints defined for the linearly separable case (Eq. 6): • Therefore, for some we have

Linear SVMs (Non-linearly separable) • E.g. because x10 and x19 are inside the margins, i.e. they violate the constraint (Eq. 6).

Linear SVMs (Non-linearly separable) • For non-separable cases: where C is a user-defined penalty parameter to penalize any violation of the margins. • The Lagrangian becomes

Linear SVMs (Non-linearly separable) • Wolfe dual optimization: • The output weight vector and bias term are

2. Linear SVMs (Types of SVs) • Three types of support vectors • On the margin: 2. Inside the margin: 3. Outside the margin:

2. Linear SVMs (Types of SVs)

2. Linear SVMs (Types of SVs) Swapping Class 1 and Class 2

2. Linear SVMs (Types of SVs) • Effect of varying C: C = 0.1 C = 100

3. Non-linear SVMs • In case the training data X are not linearly separable, we may use a kernel function to map the data from the input space to a feature space where data become linearly separable. Decision boundary Decision boundary Input Space (Domain X) Feature Space

3. Non-linear SVMs (Conts.) • The decision function becomes (a)

3. Non-linear SVMs (Conts.)

3. Non-linear SVMs (Conts.) • The decision function becomes • For RBF kernels • For polynomial kernels

3. Non-linear SVMs (Conts.) • The optimization problem becomes: (9) • The decision function becomes

3. Non-linear SVMs (Conts.) • The effect of varying C on RBF-SVMs: C = 1000 C = 10

3. Non-linear SVMs (Conts.) • The effect of varying C on Polynomial-SVMs: C = 1000 C = 10

Support Vector Machines