Tutorial: Interior Point Optimization Methods in Support Vector Machines Training

Tutorial:Interior Point Optimization Methodsin Support Vector Machines Training Part 1: Fundamentals of SVMs Theodore Trafalis email: trafalis@ecn.ou.edu ANNIE’99, St. Louis, Missouri, U.S.A, Nov. 7, 1999

Outline • Statistical Learning Theory • Empirical Risk Minimization • Structural Risk Minimization • Linear SVM and Linear Separable Case • Primal Optimization Problem • Dual Optimization Problem • Non-Linear Case • Support Vector Regression • Dual Problem for Regression • Kernel Functions in SVMs • Open Problem

Statistical Learning Theory (Vapnik 1995,1998) Empirical Risk Minimization • Given a set of decision functions {f(x): l} , f : n [-1,1] where is set of abstract parameters. • Suppose (x1,y1), (x2,y2), ..., (xl, yl) are such that x n, y {1,-1} are taken from an unknown distribution P(x,y). We want to find a f* which minimizes the expected risk functional where f(x), { f(x): l}are called hypothesis and hypothesis space, respectively.

Empirical Risk Minimization • The problem is that the distribution function P(x,y) is unknown. We can not compute the expected risk. Instead we compute the empirical risk • The idea behind minimizing empirical risk is that if Remp converges to the expected risk, then minimum of Remp may converge to the minimum of expected risk. • A typical uniform VC bound, which holds with probability 1-, has the following form

Structural Risk Minimization • A small value of the empirical risk does not necessarily imply a small value of expected risk. • Structural Risk Minimization Principle (SRM) (Vapnik 1982,1995): VC dimension and empirical risk should be minimized at the same time. • Need a nested structure of hypothesis space • H1 H2 H3 ...  Hn..... • with the property that h(n)  h(n+1), where h(n) is the VC dimension of the Hn. • Need to solve the following problem

Linear SVM and Linear Separable Case • Assume that we are given a set S of points xin where each xi belongs to either of two classes defined by yi {1,-1}. The objective is to find a hyperplane that divides S leaving all the points of the same class on the same side while maximizing the minimum distance between either of the two classes and the hyperplane [Vapnik 1995]. • Definition 1. The set S is linearly separable if there exists a w n and a b  such that • In order to make each decision surface corresponding to one unique pair (w,b), the following constraint is imposed.

Relationship between VC dimension and the canonical hyperplane. • Suppose all points x1, x2, x3, ... ,x1 lie in the n-unit dimensional sphere. The set has a VC dimension h that satisfies the following bound h  min {A2, L} + 1 • Maximizing margin minimizes the function complexity

continued • The distance from a point x to the hyperplane associated to the pair (w,b) is • The distance between canonical hyperplane and the closest point is • The goal of the SVM is to find, among all the hyperplanes that correctly classify the data with the minimum norm, or minimum ||w||2. Minimizing ||w||2 is equivalent to finding separating hyperplane for which the distance between two classes, is maximized. This distance is called margin.

Separating hyperplane and optimal separating hyperplane.

Primal Optimization Problem • Primal problem

Computing Saddle Points • The Lagrangian is • Optimality conditions

Optimal point • Support vector: a training vector for which

Dual Optimization Problem

KKT Conditions

Linearly Non-separable Case(Soft Margin Optimal Hyperplane)

Lagrangian

Saddle Point Optimality Conditions

Dual Problem

Dual Problem for k=1

The Idea of SVM input space feature space     

Non-Linear Case • If the data are nonlinear separable, we map the input variable x into a higher dimensional feature space. • If we map the input space to the feature space, then we will obtain a hyperplane that separates the data into two groups in the feature space. • Kernel function

Dual problem in nonlinear case • replace the dot product of the inputs with the kernel function in linearly non separable case

Support Vector Regression • The e- insensitive support vector regression:find a function f(x) that has e deviation from the actually obtained target yi for all training data and at the same time is as flat as possible.If • Primal Regression Problem

Soft Margin Formulation • Soft Margin Formulation • C determines the trade off between the flatness of the f(x) and the amount up to which deviations larger than e are tolerated. • The e-insensitive loss function ||e (Vapnik 1995) is defined as

e -insensitive case

Saddle Point Optimality Conditions • Lagrangian function will help us to formulate the dual problem • Optimality Conditions

Dual Problem for Regression • Dual Problem • Solving

KKT Optimality Conditions and b* • KKT Optimality Conditions • only samples (xi,yi) with corresponding li = C lie outside the e-insensitive tube around f. If li is nonzero, then l*i is zero and vice versa. Finally if li is in (0,C) then corresponding  is zero. • b can be computed as follows

QP SV Regression Problem in Feature Space • Mapping in the feature space we obtain the following quadratic SV regression problem • At the optimal solution, we obtain

Kernel Functions in SVMs • An inner product in feature space has an equivalent kernel in input space • Any symmetric positive semi-definite function (Smola 1998), which satisfies the Mercer's Conditions can be used as kernel function in the SVM context. Mercer's Conditions can be written as

Some kernel functions • Polynomial type: • Gaussian Radial Basis Function (GRBF): • Exponential Radial Basis Function: • Multi-Layer Perceptron: • Fourier Series:

Open Problem • We have more than one kernel to map the input space into feature space. • Question: which kernel functions provide good generalization for a particular problem? • Some validation techniques, such as bootstrapping, and cross-validation can be used to determine a good kernel • Even when we decide for a kernel function, we have to compute the parameters of the kernel (e.g RBF has a parameter s and one has to decide the value of the s before the experiment). • No theory yet for selection of optimal kernels (Smola 1988, Amari 1999) • For a more extensive literature and software in SVMs check the web page http://svm.first.gmd.de/

Tutorial: Interior Point Optimization Methods in Support Vector Machines Training

Tutorial: Interior Point Optimization Methods in Support Vector Machines Training

Presentation Transcript

Support Vector Machines

Support Vector Machines

Support Vector Machines (and Kernel Methods in general)

Support Vector Machines and Kernel Methods

Support Vector Machines

Support Vector Machines

Support Vector Machines

Support Vector Machines

Support Vector Machines

Support Vector Machines

Support Vector Machines

Kernel Methods: Support Vector Machines

Support Vector Machines and Kernel Methods

Support Vector Machines

Support Vector Machines

Support Vector Machines

Support Vector Machines