Support Vector Machine

Support Vector Machine - SVM

Outline • Background: Classification Problem • SVM • Linear Separable SVM • Lagrange Multiplier Method • Karush-Kuhn-Tucker (KKT) Conditions • Non-linear SVM: Kernel • Non-Linear Separable SVM • Lagrange Multiplier Method • Karush-Kuhn-Tucker (KKT) Conditions (X) (X) (X) (X)

Background – Classification Problem • The goal of classification is to organize and categorize data into distinct classes • A model is first created based on the previous data (training samples) • This model is then used to classify new data (unseen samples) • A sample is characterized by a set of features • Classification is essentially finding the best boundary between classes

Classification Formulation • Given • an input space • a set of classes ={ } • the Classification Problem is • to define a mapping f: g where each xin  is assigned to one class • This mapping function is called a Decision Function

Decision Function • The basic problem in classification problem is to find c decision functions with the property that, if a pattern x belongs to class i, then di(x) is some similarity measure between x and class i, such as distance or probability concept

Decision Function • Example d1=d3 Class 1 d2,d3<d1 Class 3 d1,d2<d3 d1=d2 d3=d2 Class 2 d1,d3<d2

Single Classifier • Most popular single classifiers: • Minimum Distance Classifier • Bayes Classifier • K-Nearest Neighbor • Decision Tree • Neural Network • Support Vector Machine

Minimum Distance Classifier • Simplest approach to selection of decision boundaries • Each class is represented by a prototype (or mean) vector: where = the number of pattern vectors from • A new unlabelled sample is assigned to a class whose prototype is closest to the sample

Bayes Classifier • Bayes rule • is the same for each class, therefore • Assign x to class j if • for all i

Bayes Classifier • The following information must be known: • The probability density functions of the patterns in each class • The probability of occurrence of each class • Training samples may be used to obtain estimations on these probability functions • Samples assumed to follow a known distribution pattern

10 8 6 4 2 0 2 4 6 8 10 K-Nearest Neighbor • K-Nearest Neighbor Rule (k-NNR) • Examine the labels of the k-nearest samples and classify by using a majority voting scheme (7, 3) 投票結果 <8> <9> 1NN <10> <3> <2> 3NN <7> <1> 5NN <6> 7NN <4> 9NN <4> 物以類聚概念

Decision Tree • The decision boundaries are hyper-planes parallel to the feature-axis • A sequential classification procedure may be developed by considering successive partitions of R

Decision Trees • Example

Connection Node Neural Network • A Neural Network generally maps a set of inputs to a set of outputs • Number of inputs/outputs vary • The network itself is composed of an arbitrary number of nodes with an arbitrary topology • It is an universal approximator

Neural Network • A popular NN is the feed forward neural network • E.g. • Multi-layer Perception (MLP) • Radial-Based Function (RBF) • Learning algorithm: Back Propagation • Weights of nodes are adjusted based on how well the current weights match an objective Feed forward NN RBF

What is SVM ? • SVM is a kind of classification algorithm. • Based on Statistical Learning Theory. • Similar to NN • Widely used in Binary Classification • Users do not have any rules for classification. • When a new data comes, SVM can predict which set it should belong to.

SVM for classification Given a sequence of training vector

Hyperplane (最佳的超平面) • SVM想要解決以下的問題：找出一個超平面(hyperplane)，將兩個不同的集合分開。 • 超平面意指在高維中的平面。 • 希望能找出一個方程式，能將Class 1和Class2分開。 • 這條線距離這兩個集合的邊界(margin)越大越好

Hyperplane (cont.) (座標原點)

Hyperplane (cont.) Given a set Find out a 內積/dot product/inner product

Hyperplane (cont.) • We can classify data set according to the function f(x). • 超平面: Separating Hyperplane • 使兩邊邊界距離最大的超平面: Optimal Separating Hyperplane (OSH)

Example

Support Hyperplane • 與optimal separating hyperplane平行，並且最靠近兩邊的超平面。 • Support Hyperplane is defined as • 要scale為1,等式乘上常數 (縮小解範圍,直線方程為無限多解) w is normal to hyperplane Support Hyperplane <原點> perpendicular distance

Support Hyperplane (cont.) • D: Distance between separating hyperplane and two support hyperplane • Margin = distance between H1 and H2 • = 2D = 2/||w|| • , ||w||越小,D越大

Support Hyperplane (cont.) • After scaling, the constraint function can be defined as

SVM Problem • Goal: Find a separating hyperplane with largest margin. A SVM is to find w and b that satisfy

SVM Problem (cont.) • Switch the above problem to a Lagrangian formulation for two reason • Easier to handle by transforming into quadratic eq. • Training data only appear in form of dot products (向量內積/ inner product) between vectors => can be generalized to nonlinear case

Langrange Muliplier Method • A method to find the extremum of a multivariate function f(x1,x2,…xn) subject to the constraint g(x1,x2,…xn) = 0 • For an extremum of f to exist on g, the gradient of f must line up with the gradient of g . • for all k = 1, ...,n , where the constant λis called the Lagrange multiplier • The Lagrangian transformation of the problem is

Langrange Muliplier Method • To have , we need to find the gradient of L with respect to w and b. (偏微分) (1) (2) • Substitute them into Lagrangian form, we have a dual problem Inner product form => Can be generalize to nonlinear case by applying kernel

KKT Conditions • Since the problems for SVM is convex, the KKT conditions are necessary and sufficient for w, b and αto be a solution. • w is determined by training procedure. • b is easily found by using KKT complementary conditions, by choosing any i for which αi≠ 0 Complementary slackness

Lagrange Multiplier Method • Karush-Kuhn-Tucker conditions (KTT)

Support Vector • A training data • Satisfy KKT • Locate on Support Hyperplan • αi > 0

Non-Linear Separable SVM : Kernel • To extend to non-linear case, we need to the data to some other Euclidean space.

Kernel (cont.) • 把非線性資料投射到更高維度的空間或是特徵空間(feature space)

Kernel Function • Since the training algorithm only depend on data through dot products. We can use a “kernel function” K such that • One commonly used example is radial based function (RBF) • A RBF is a real-valued function whose value depends only on the distance from the origin, so that Φ(x)= Φ(||x||) ; or alternatively on the distance from some other point c, called a center, so that Φ(x,c)= Φ(||x-c||).

Non- Separable Cases

Non-separable SVM • Real world application usually have no OSH. We need to add an error termζ. => • To give penalty to error term, define • New Lagrangian form is <對照線性SVM>

Non-separable SVM • New KKT Conditions

Support Vector Machine - SVM