-- classifier, forward neural network, supervised learning

CH. 13: Kernel Machines (A) Support Vector Machine (SVM) -- classifier, forward neural network, supervised learning Difficulties with SVM: i) binary classifier, ii) linearly separable patterns

SVM finds optimal separating hyperplane (OSH) With the maximal margin between two support hyperplaneswhich are formed bysupport vectors.

Data points: Let the equation of the OSH be : normal vectorpoints toward positive data : distance to the origin e.g.,

Let : support hyperplanes Distances between them Then Rewrite Likewise, Margin:

Replace Minimizing Maximizing margin subject to ( ) are called satisfying support vectors Lagrange Multiplier Method – converts a constrained to an unconstrained problem.

The objective function: The optimal solution is given by the saddle point of , which is minimized w.r.t. w and b, while maximized w.r.t. i.e., . ThroughKarush–Kuhn–Tucker (KKT) conditions, L defined in the primal space of w, b, is translated to the dual space of

--- (A) --- (B) --- (C)

From (B), From (A), The problem becomes

After solving by letting find w by (A) . For non-support vectors, From (C), Support vectorsare those whose Determine b using any support vector. Consider any support vector : # support vectors

Overlapping patterns: the patterns that violate Define the constraint as Soft margin: : slack variables Two ways of violation:

Problem: Find a separating hyperplane for which • minimal (ii) (iii) minimal (soft error) Lagrange objective function in the primal space, C:penalty factor

ThroughKKT conditions, Dual space: the space of subject to Different from the separable case in that

e.g., 2D 3D (B) Kernel Machines 13.5 Kernel Trick Cover’s theorem: Make nonlinearly separable data linearly separable by mapping them from low to high dimensional space x : a vector in the original N-D space

: a set of functions that transform x to a space of infinite dimensionality. Let The OSH in the new space where 14

Substitute (2), (3) into (1), : kernel function Let 15

Mercer conditions: requirements of a kernel function A kernel function can be considered as a a measure of similarity between data points. 1. Symmetric 2. 3. 4.

13.6 Examples of Kernel Functions i) Linear kernel: ii) Polynomial kernel with degree d : e.g., d = 2

iii) Perceptron kernel: iv) Sigmoidal kernel: v) Radial basis functionkernel: 13.8 Multiple Kernel Learning A new kernel can be constructed by combining simpler kernels, e.g.,

(K > 2 classes) 13.9 Multiclass Kernel Machines • Train K 2-class classifiers , each one distinguishing one class from all other classes combined. During testing, 2. Train K(K-1)/2 pairwise classifiers 3. Train a single multiclass classifier

13.10 Kernel Machines for Regression • Consider a linear model Define constraints: : slack variables Problem: subject to constraints

The Lagrangian Through KKT conditions:

The dual: subject to

(a) The examples that fall in the tube have (b) The support vectors satisfy

The fitted line kernel function

-- classifier, forward neural network, supervised learning