New Horizon in Machine Learning — Support Vector Machine for non-Parametric Learning

New Horizon in Machine Learning — Support Vector Machine for non-Parametric Learning Zhao Lu, Ph.D. Associate Professor Department of Electrical Engineering, Tuskegee University

Introduction • As an innovative non-parametric learning strategy, Support Vector Machine (SVM) gained increasing popularity in late 1990s. Currently it is among the best performers for various tasks, such as pattern recognition, regression and signal processing, etc. • Support vector learning algorithms • Support vector classification for nonlinear pattern recognition; • Support vector regression for highly nonlinear function approximation;

Part I. Support Vector Learning for Classification

Overfitting in linear separable classification

Class 2 Class 1 What is a good Decision Boundary? • Consider a two-class, linearly separable classification problem. Construct the hyperplane to make • Many decision boundaries! Are all decision boundaries equally good?

Examples of Bad Decision Boundaries Class 2 Class 2 Class 1 Class 1 For linearly separable classes, the data from the same class should be close to the training data.

Optimal separating hyperplane • The optimal separating hyperplane (OSH) is defined as It can be proved that OSH is unique and locate halfway between margin hyperplanes. Class 2 m Class 1

Canonical separating hyperplane • A hyperplane is in canonical form with respect to all training data if : • Margin hyperplanes: • A canonical hyperplane having a maximal margin is the ultimate learning goal, i.e. the optimal separating hyperplane.

Margin in terms of the norm of • According to the conclusions from the statistical learning theory, the large-margin decision boundary has the excellent generalization capability. • For the canonical hyperplane, it can be proved that the margin m is Hence, maximizing margin is equivalent to minimizing the square of the norm of .

Finding the optimal decision boundary • Let {x1, ..., xn} be our data set and let yiÎ {1,-1} be the class label of xi • The optimal decision boundary should classify all points correctly Þ • The decision boundary can be found by solving the following constrained optimization problem • This is a quadratic optimization problem with linear inequality constraints.

Generalized Lagrangian Function • Consider the general (primal) optimization problem where the functions are defined on a domain . The generalized Lagrangian was defined as

Dual Problem and Strong Duality Theorem • Given the primal optimization problem, the dual problem of it was defined as • Strong Duality Theorem: Given the primal optimization problem, where the domain is convex and the constraints are affine functions. Then the optimum of the primal problem occurs at the same values as the optimum of the dual problem .

Karush-Kuhn-Tucker Conditions • Given the primal optimization problem with the objective function convex and , affine. Necessary and sufficient conditions for to be an optimum are the existence of , such that (KKT complementarity condition)

Lagrangian of the optimization problem • The Lagrangian is • Setting the gradient of w.r.t. and b to zero, we have

The Dual Problem • If we substitute into Lagrangian , we have • Note that , and the data points appear in terms of their inner product; this is a quadratic function of ai only.

The Dual Problem • The new objective function is in terms of ai only • The original problem is known as the primal problem • The objective function of the dual problem needs to be maximized! • The dual problem is therefore: Properties of ai when we introduce the Lagrange multipliers The result when we differentiate the original Lagrangian w.r.t. b

The Dual Problem • This is a quadratic programming (QP) problem, and therefore aglobal minimum of can always be found • can be recovered by , so the decision function can be written in the following non-parametric form

Conception of Support Vectors (SVs) • According to the Karush-Kuhn-Tucker (KKT) complementarity condition, the solution must satisfy Thus, only for those points that are closest to the classifying hyperplane. These points are called support vectors. • From the KKT complementarity condition, the bias term b can be calculated by using the support vectors

Sparseness of the solution Class 2 a10=0 a8=0.6 a7=0 a2=0 a5=0 a1=0.8 a4=0 a6=1.4 a9=0 a3=0 Class 1

Class 2 Class 1 The use of slack variables • We allow “errors”xi in classification for noisy data;

Soft Margin Hyperplane • The use of slack variables xi enable the soft margin classifier • xi are “slack variables” in optimization • Note that xi=0 if there is no error for • The objective function C : tradeoff parameter between error and margin • The primal optimization problem becomes

Dual Soft-Margin Optimization Problem • The dual of this new constrained optimization problem is • can be recovered as • This is very similar to the optimization problem in the hard-margin case, except that there is an upper bound C on ai now. • Once again, a QP solver can be used to find ai

Nonlinear separable problems

Extension to Non-linear Decision Boundary • How to extend the linear large-margin classifier to nonlinear case? • Cover’s theorem • Consider a space made up of nonlinearly separable patterns. • Cover’s theorem states that such a multi-dimensional space can be transformed into a new feature space where the patterns are linearly separable with a high probability, provided two conditions are satisfied: (1) The transform is nonlinear; (2) The dimensionality of the feature space is high enough;

Non-linear SVMs: Feature spaces • General idea: the data in original input space can always be mapped into some higher-dimensional feature space where the training data become linearly separable by using a nonlinear transformation: Φ: x→φ(x) kernel visualization: http://www.youtube.com/watch?v=9NrALgHFwTo

Transforming the data • Key idea: transform to a higher dimensional space by using a nonlinear transformation • Input space: the space the point are located • Feature space: the space of after transformation • Curse of dimensionality: Computation in the feature space can be very costly because it is high dimensional, and the feature space is typically infinite-dimensional! • This problem of ‘curse of dimensionality’ can be surmounted on the strength of kernel function because the inner product is just a scalar, which is the most appealing characteristic of SVM.

Kernel trick • Recall the SVM dual optimization problem • The data points only appear asinner product • With the aid of inner product representation in the feature space, the nonlinear mapping can be used implicitly by defining the kernel function K by

What functions can be used as kernels? • Mercer’s theorem in operator theory: Every semi-positive definite symmetric function is a kernel • Semi-positive definite symmetric functions correspond to a semi-positive definite symmetric Gram matrix on data points: K=

An Example for f (.) and K(.,.) • Suppose the nonlinear mapping f (.): is as follows • An inner product in the feature space is • So, if we define the kernel function as follows, there is no need to carry out f (.) explicitly • This use of kernel function to avoid carrying out f (.) explicitly is known as the kernel trick

Kernel functions • In practical use of SVM, the user specifies the kernel function; the transformation f(.) is not explicitly stated • Given a kernel function , the transformation f(.) is given by its eigenfunctions (a concept in functional analysis) • Eigenfunctions can be difficult to construct explicitly • This is why people only specify the kernel function without worrying about the exact transformation

Examples of kernel functions • Polynomial kernel with degree d • Radial basis function kernel with width s • Closely related to radial basis function neural networks • The feature space induced is infinite-dimensional • Sigmoid function with parameter k and q • It does not satisfy the Mercer condition on all k and q • Closely related to feedforward neural networks

Kernel: Bridge from linear to nonlinear • Change all inner products to kernel functions • For training, the optimization problem is linear nonlinear

Kernel expansion for decision function • For classifying the new data z, it belongs to the class 1 if f ³0, and to class 2 if f <0 linear nonlinear

Compared to neural networks • SVMs are explicitly based on a theoretical model of learning rather than on loose analogies with natural learning systems or other heuristics. • Modularity: Any kernel-based learning algorithm is composed of two modules: • A general purpose learning machine • A problem specific kernel function • SVMs are not affected by the problem of local minima because their training amounts to convex optimization.

Key features in SV classifier • All features were already present and had been used in machine learning since 1960s: • Maximum (large) margin; • Kernel method; • Duality in nonlinear programming; • Sparseness of the solution; • Slack variables; • However, not until 1995 all features were combined together, and it is so surprising how naturally and elegantly they fit together and complement each other in SVM.

SVM classification for 2D data Figure. Visualization of SVM classification.

Part II. Support Vector Learning for Regression

Overfitting in nonlinear regression

The linear regression problem

Linear regression • The problem of linear regression is much older than the classification one. Least squares linear interpolation was first used by Gauss in the 18th century for astronomical problems. • Given a training set , with , , the problem of linear regression is to find a linear function that models the data

Least squares • The least squares approach prescribes choosing the parameters to minimize the sum of the squared derivation of the data, • Setting and where

Least squares • The square loss function can be written as Taking derivatives of the loss and setting them equal to zero, yields the well-known ‘normal equations’ and, if the inverse of exists, the solution is:

Ridge regression • If the matrix in the least squares problem is not of full rank, or in other situations where numerical stability problems occur, one can use the following solution, where is the identity matrix with the entry set to zero. This solution is called ridge regression. The ridge regression minimizes the penalized loss function regularizer

e-insensitiveloss function • Instead of using the square loss function, the ε-insensitive loss function is used in SV regression which leads to sparsity of the solution. e-insensitive loss function Square loss function Penalty Penalty Value off target Value off target -e e

The linear regression problem

Primal problem in SVR (e-SVR) • Given a data set with values The e-SVR was formulated as the following (primal) convex optimization problem: • The constant determines the trade-off between the flatness of and the amount up to which deviations larger than are tolerated.

Lagrangian • Construct the Lagrange function from the objective function and the corresponding constraints: where the Lagrange multipliers satisfy positivity constraints

Karush-Kuhn-Tucker Conditions • It follows from the saddle point condition that the partial derivatives of L with respect to the primal variables have to vanish for optimality, • The 2nd equation indicates that can be written as a linear combination of training patterns .

Dual problem • Substituting the equations above into the Lagrangian yields the following dual problem • The function can be written in a non-parametric form by substituting into

KKT complementarity conditions • At the optimal solution the following Karush-Kuhn-Tucker complementarity condition must be fulfilled • Obviously, for , holds, and in similar for , .

New Horizon in Machine Learning — Support Vector Machine for non-Parametric Learning

New Horizon in Machine Learning — Support Vector Machine for non-Parametric Learning

Presentation Transcript