320 likes | 653 Views
Tutorial: Interior Point Optimization Methods in Support Vector Machines Training . Part 1: Fundamentals of SVMs Theodore Trafalis email: trafalis@ecn.ou.edu ANNIE’99, St. Louis, Missouri, U.S.A, Nov. 7, 1999. Outline. Statistical Learning Theory Empirical Risk Minimization
E N D
Tutorial:Interior Point Optimization Methodsin Support Vector Machines Training Part 1: Fundamentals of SVMs Theodore Trafalis email: trafalis@ecn.ou.edu ANNIE’99, St. Louis, Missouri, U.S.A, Nov. 7, 1999
Outline • Statistical Learning Theory • Empirical Risk Minimization • Structural Risk Minimization • Linear SVM and Linear Separable Case • Primal Optimization Problem • Dual Optimization Problem • Non-Linear Case • Support Vector Regression • Dual Problem for Regression • Kernel Functions in SVMs • Open Problem
Statistical Learning Theory (Vapnik 1995,1998) Empirical Risk Minimization • Given a set of decision functions {f(x): l} , f : n [-1,1] where is set of abstract parameters. • Suppose (x1,y1), (x2,y2), ..., (xl, yl) are such that x n, y {1,-1} are taken from an unknown distribution P(x,y). We want to find a f* which minimizes the expected risk functional where f(x), { f(x): l}are called hypothesis and hypothesis space, respectively.
Empirical Risk Minimization • The problem is that the distribution function P(x,y) is unknown. We can not compute the expected risk. Instead we compute the empirical risk • The idea behind minimizing empirical risk is that if Remp converges to the expected risk, then minimum of Remp may converge to the minimum of expected risk. • A typical uniform VC bound, which holds with probability 1-, has the following form
Structural Risk Minimization • A small value of the empirical risk does not necessarily imply a small value of expected risk. • Structural Risk Minimization Principle (SRM) (Vapnik 1982,1995): VC dimension and empirical risk should be minimized at the same time. • Need a nested structure of hypothesis space • H1 H2 H3 ... Hn..... • with the property that h(n) h(n+1), where h(n) is the VC dimension of the Hn. • Need to solve the following problem
Linear SVM and Linear Separable Case • Assume that we are given a set S of points xin where each xi belongs to either of two classes defined by yi {1,-1}. The objective is to find a hyperplane that divides S leaving all the points of the same class on the same side while maximizing the minimum distance between either of the two classes and the hyperplane [Vapnik 1995]. • Definition 1. The set S is linearly separable if there exists a w n and a b such that • In order to make each decision surface corresponding to one unique pair (w,b), the following constraint is imposed.
Relationship between VC dimension and the canonical hyperplane. • Suppose all points x1, x2, x3, ... ,x1 lie in the n-unit dimensional sphere. The set has a VC dimension h that satisfies the following bound h min {A2, L} + 1 • Maximizing margin minimizes the function complexity
continued • The distance from a point x to the hyperplane associated to the pair (w,b) is • The distance between canonical hyperplane and the closest point is • The goal of the SVM is to find, among all the hyperplanes that correctly classify the data with the minimum norm, or minimum ||w||2. Minimizing ||w||2 is equivalent to finding separating hyperplane for which the distance between two classes, is maximized. This distance is called margin.
Primal Optimization Problem • Primal problem
Computing Saddle Points • The Lagrangian is • Optimality conditions
Optimal point • Support vector: a training vector for which
The Idea of SVM input space feature space
Non-Linear Case • If the data are nonlinear separable, we map the input variable x into a higher dimensional feature space. • If we map the input space to the feature space, then we will obtain a hyperplane that separates the data into two groups in the feature space. • Kernel function
Dual problem in nonlinear case • replace the dot product of the inputs with the kernel function in linearly non separable case
Support Vector Regression • The e- insensitive support vector regression:find a function f(x) that has e deviation from the actually obtained target yi for all training data and at the same time is as flat as possible.If • Primal Regression Problem
Soft Margin Formulation • Soft Margin Formulation • C determines the trade off between the flatness of the f(x) and the amount up to which deviations larger than e are tolerated. • The e-insensitive loss function ||e (Vapnik 1995) is defined as
Saddle Point Optimality Conditions • Lagrangian function will help us to formulate the dual problem • Optimality Conditions
Dual Problem for Regression • Dual Problem • Solving
KKT Optimality Conditions and b* • KKT Optimality Conditions • only samples (xi,yi) with corresponding li = C lie outside the e-insensitive tube around f. If li is nonzero, then l*i is zero and vice versa. Finally if li is in (0,C) then corresponding is zero. • b can be computed as follows
QP SV Regression Problem in Feature Space • Mapping in the feature space we obtain the following quadratic SV regression problem • At the optimal solution, we obtain
Kernel Functions in SVMs • An inner product in feature space has an equivalent kernel in input space • Any symmetric positive semi-definite function (Smola 1998), which satisfies the Mercer's Conditions can be used as kernel function in the SVM context. Mercer's Conditions can be written as
Some kernel functions • Polynomial type: • Gaussian Radial Basis Function (GRBF): • Exponential Radial Basis Function: • Multi-Layer Perceptron: • Fourier Series:
Open Problem • We have more than one kernel to map the input space into feature space. • Question: which kernel functions provide good generalization for a particular problem? • Some validation techniques, such as bootstrapping, and cross-validation can be used to determine a good kernel • Even when we decide for a kernel function, we have to compute the parameters of the kernel (e.g RBF has a parameter s and one has to decide the value of the s before the experiment). • No theory yet for selection of optimal kernels (Smola 1988, Amari 1999) • For a more extensive literature and software in SVMs check the web page http://svm.first.gmd.de/