580 likes | 731 Views
机器学习. 陈昱 北京大学计算机科学技术研究所 信息安全工程研究中心. 课程基本信息. 主讲教师:陈昱 chen_yu@pku.edu.cn Tel : 82529680 助教:程再兴, Tel : 62763742 wataloo@hotmail.com 课程网页: http://www.icst.pku.edu.cn/course/jiqixuexi/jqxx2011.mht. Ch9 Maximum Margin Classifier. Support vector machine Overlapping class distribution
E N D
机器学习 陈昱 北京大学计算机科学技术研究所 信息安全工程研究中心
课程基本信息 • 主讲教师:陈昱 chen_yu@pku.edu.cn Tel:82529680 • 助教:程再兴,Tel:62763742 wataloo@hotmail.com • 课程网页: http://www.icst.pku.edu.cn/course/jiqixuexi/jqxx2011.mht
Ch9 Maximum Margin Classifier • Support vector machine • Overlapping class distribution • Multiclass SVM • SVM for regression • Computational learning theory
Preview • Recall what we have Learned in Ch8: • Linear parametric models (e.g. that for regression) can be recast into an equivalent ‘dual representation’ for which the whole model is dependent on kernel functions evaluated at training input variables. • Some benefits of using kernel functions • Can represent infinite dimension feature space, such as Gaussian kernel • Can handle symbolic objects • Provide a way of extending many well-known algorithms via kernel tricks
Preview (2) • A major drawback of kernel method: During both training and prediction we have to evaluate kernel function at every training sample, which is quite computationally expensive. It is desirable that we only need to evaluate kernel function at a subset of training data → sparse kernel • One kind of sparse kernel machine starts from the idea of maximum margin → support vector and support vector machine (SVM) • SVM is a decision machine rather than a probabilistic distribution
Preview (3) • Another kind of sparse kernel based on Bayesian viewpoint, such as relevance vector machine, typically sparser than SVM.
Ch9 Maximum Margin Classifier • Support vector machine • Overlapping class distribution • Multiclass SVM • SVM for regression • Computational learning theory
Introduction: Margin and Support Vector • Consider a two-class classification problem using linear model of the form: • Note: In the above formula we make explicit the bias parameter. • Assume that training set consisting of N samples {<x1,t(x1)>, … <xN,t(xN)>}, where t(xn)∈{1,-1}, and a new instance x will be classified according to sign of t(x). • Furthermore, we assume that the training set is linear separable, i.e. there exists a function of the form (7.1) s.t. y(xn)>0 for all xn satisfying t(xn)=1, and y(xn)<0 otherwise.
Introduction (2) • There may exist many such linear functions, if it does exist, we will try to find one that minimize certain error function, and the SVM approaches this problem through the concept of the margin, which is defined as the distance from the training set to the hyperplane defined by y(x)=0. • In SVM the decision boundary is chosen to be the one for which the margin is maximized • One motivation for the idea of maximizing the margin comes from computational learning theory.
An Illustration • Left figure: The margin is defined as the perpendicular distance between the decision boundary and the closest of the data points • Right figure: Maximizing the margin leads to a particular choice of a decision boundary; it is determined by a subset of data points known as support vector (indicated by circle in the picture).
Finding Maximum Margin Solution • Recall that the distance from a point (x,y) to a hyperplane induced by y(x)=0, where y(x) takes the form of (7.1), is |y(x)|/norm(w). Since we are only interested in solutions for which all training points are correctly classified, i.e. y(xn)tn>0, for n=1, …N, thus we can rewrite the distance as • It follows that the maximum margin solution is found by solving
Finding Maximum Margin Solution (2) • Observe that if we rescale both w & b by k, the distance is unchanged, therefore we can rescale w & b when necessary s.t. for points closest to the hyperplane. For such w & b it follows that all training examples <xn,tn> satisfy the constraint tny(xn)≥1. Such representation of hyperplane is called the canonical representation of the decision plane. • Notice that for a finite training set, at least one point makes the equality true in the constraint.
Equivalent Quadratic Programming Problem • Notice that maximizing 1/norm(w) is equivalent to minimizing (norm(w))2. It follows that the optimization problem determined by (7.3) is equivalent to the following quadratic programming problem: • The constrained quadratic programming problem is equivalent to minimizing the following Lagrangian function w.r.t. w and b, and maximizing the same function w.r.t. a:
Dual Representation • Setting the partial derivative of L w.r.t. components of w, and b to zero, respectively, we obtain • Plug (7.8) into L and utilize (7.9), we can eliminate w and b, and rewrite L in terms of a (therefore maximizing L) as
Dual Representation (2) • (7.10) together with constraints (7.11) & (7.12) are called dual representation of the original maximum margin problem (7.6) subject to constraint (7.5) • The solution to a quadratic programming problem in M variables in general has computational complexity of O(M3). • When kernel function k is positive definite, function (in term of a) is bounded below, giving rise to a well-defined optimization problem.
Prediction • To classify a new input variable x using training data, apply the following equation:
Support Vectors • It can be shown (Appendix E in the book) a constrained optimization to the maximum margin problem satisfies the Karush-Kuhn-Tucker (KKT) conditions: • It follows that for every data point, either an=0 or y(xn)=1. Any point s.t. an=0 plays no role in prediction of a new instance, and the remaining points are called support vectors, and they lie on the maximum margin hyperplanes in the feature space (Fig 7.1)
Determine the Parameter b • Notice that any support vector x satisfies t(x)y(x)=1. Let S denote the set of indices of support vectors, for any n∈S , plugging (7.13) into (7.16), we obtain • For the sake of numerical stability, we solve for b by multiplying both sides of (7.17) by tn, making use of tn2=1, then averaging the equations over all support vectors, and finally solve for b:
Error Function • We can represent maximum margin classifier in terms of minimizing an error function containing a quadratic regularizer: where function E∞(z)=0 if z≥0, and ∞ otherwise.
An Illustrative Example • Example of synthetic data from two classes in 2-dim input space, showing contours of constant y(X) obtained from SVM with a Gaussian kernel. Also shown are the decision boundary, margin boundary, and support vectors.
Summary of this Section • Consider a two-class classification problem and assume that training points are linearly separable in the feature space • Maximize margin → minimize length of parameter vectorwithconstraint→ dual representation (containing kernel functions) → the optimal solutions satisfy KKT conditions → kernel functions evaluated at sparse points (sparse Gram matrix) and support vectors
Ch9 Maximum Margin Classifier • Support vector machine • Overlapping class distribution • Multiclass SVM • SVM for regression • Computational learning theory
Overlapping Class Distribution • In previous section, we have assumed that training data points are linearly separable in feature space • If in reality the above assumption doesn’t hold, exact separation could lead to poor generalization • Recall that the maximum margin classifier is equivalent to minimizing the following error function: • We might lesson the penalty imposed by E∞ on misclassification so that data points are allowed to be on wrong side of the margin boundary, but with a penalty that increases with the distance from the boundary.
Slack Variable • For each training data point <xn,tn> introduce a slack variableξn defined as follows: ξn=0 if <xn,y(xn)> is on or inside the correct margin boundary, and |tn-y(xn)| otherwise. • The following figure illustrates the change of values of ξn in terms of <xn,y(xn)>:
Error Function • The error function corresponding to relaxation of the hard constraint is where parameter C isapositive constant controlling the trade-off between the slack variable penalty and the margin, i.e. classification error and model complexity. • Furthermore, the constraint (7.5): tny(xn)≥1 becomes tny(xn)≥1-ξn (7.20) and ξn≥0, for n=1,…N.
Lagrangian of the Optimization • The corresponding Lagrangian of minimizing (7.21) together with constraints (7.20) and ξn≥0, is given by where both {μn} & {ξn}are Lagrange multipliers. • The corresponding set of KKT conditions are given by
Solving the Lagrangian • Take partial derivatives of L w.r.t. wi, b, and ξn, and set them equal to 0, we obtain: • Plug (7.29) into (7.22) and utilize (7.30) & (7.31) to simplify the resulting equation, we can eliminate not only w, but also b and ξn fromL, and obtain
Solving the Lagrangian (2) • In summary, we want to maximize (7.32) w.r.t. a, subject to the following constraints: for n=1,…N. • (7.33) are known as box constraints. • Notice that (7.32) is the same as (7.10), however, (7.33) is “more demanding than” (7.11) • However, for prediction, (7.13) still holds, since we still utilize (7.29) to derive it.
Interpreting the Solution • Similarly, data points with an=0 don’t contribute to the prediction, and the remaining ones consist of support vectors; Furthermore, support vectors satisfy • We now utilize the additional constraints induced by relaxing the hard margin constraints: • If an<C, it follows that μn>0, and consequently ξn=0, tny(xn)=1,i.e. the corresponding <xn,y(xn)> lies on margin • If an=C, then tny(xn)≤1, the corresponding <xn,y(xn)> can lie anywhere, and <xn,tn> can be labeled correctly or incorrectly, depending on sign of tny(xn), i.e. whether ξn>1 or not.
Determine the Parameter b • To compute parameter b, we only consider those support vectors s.t. an<C. For these <xn,tn>, tny(xn)=1, and we can apply the same trick as we did in previous section to computer b (both cases have the same prediction formula). Therefore we obtain the following formula for b:
v-SVM • It involves minimizing the following Lagrangian: subject to the following constraints: • Here the parameter v (replacing C) can be interpreted as both an upper bound on the fraction of margin errors and a lower bound foron the fraction of support vectors
An Illustration of v-SVM • The following figure illustrates an example of applying v-SVM to a synthetic data. The v-SVM takes Gaussian kernel of the form exp(-γ||x-x’||2), where γ=0.45, and in the figure support vectors are indicated by circles.
Algorithms for the Quadratic Programming • Notice that the Lagrangian function L is a quadratic function in terms of ai, and the constraints define a convex region, therefore any local optimum will also be a global optimum. • Practical approaches to the constrained quadratic programming problems: • Chunking: Since the value of L is unchanged if we remove rows and columns from the kernel matrix corresponding to zero-Lagrange multipliers, the original problem can be broken into a series of small ones. Such idea can be implemented using protected conjugated gradient.
Algorithms (2) • Decomposition methods • It still break the original problem into a series of small ones, however, all of these quadratic programming problems are of fixed size, so that the method can be applied to any size training data set. • Sequential minimal optimization (SMO) (popular method) • It takes the idea of chunking to the extreme and consider just two Lagrange multipliers at a time, therefore the sub-problems can be solved analytically. At each step the choice of a pair of Lagrange multipliers is given by heuristics (The original heuristics is based on KKT conditions) • See http://kernel-machines.org for a collection of software dealing with SVM and Gaussian processes.
Some Remarks to SVM • Dimensionality of feature space in kernel methods • Consider a kernel function on 2-dim space given by k(X,Z)=(XTZ)2 • In the example original 2-dim space can be regarded as a 2-dim manifold embedded in a 3-dim feature space. • SVM can be modified so that it provides a probabilistic output for the prediction of a new instance, e.g.
Ch9 Maximum Margin Classifier • Support vector machine • Overlapping class distribution • Multiclass SVM • SVM for regression • Computational learning theory
Multiple 2-class SVM for Multiclass SVM Consider K-class classification • One-versus-the-rest: The k-th model yk(x) is trained using data from k-th class as positive examples, and the rest data as negative examples • Some challenges • An input might be assigned to multiple labels • The resulting training set are imbalanced. One solution is to modify the target values • We might also define a single objective function for training all K SVMs simultaneously, based on maximizing the margin from each to the remaining classes. • One-versus-one: train on all possible pairs of classes
Multiple 2-class SVM for Multiclass SVM Consider K-class classification • One-versus-the-rest: The k-th model yk(x) is trained using data from k-th class as positive examples, and the rest data as negative examples • Some challenges • An input might be assigned to multiple labels • The resulting training set are imbalanced. One solution is to modify the target values • We might also define a single objective function for training all K SVMs simultaneously, based on maximizing the margin from each to the remaining classes. • One-versus-one: train on all possible pairs of classes
Single-Class SVM • Consider an unsupervised learning problem related to probability density estimation. Instead of modeling the density distribution, in certain situations it is to find a smooth boundary of high density, i.e. for a data point drawn according to the density distribution, it will have a (predetermined) probability (desirably closed to 1) of inside the boundary. • Sample application scenario: abnormality detection where training data consist of mostly normal samples plus a few abnormal ones. • Approach 1: find a hyperplane that separates all but a fixed fraction of the training data from the original while at the same time maximizing the distance of the hyperplane to the original (Scholkopf et al. 2001) • Approach 2: find the smallest sphere in feature space that contains all but a fraction of the data points (Tax & Duin, 1998)
Ch9 Maximum Margin Classifier • Support vector machine • Overlapping class distribution • Multiclass SVM • SVM for regression • Computational learning theory
Recall: Regularized Least Squares Consider the error function: With the sum-of-squares error function and a quadratic regularizer, we get data term + regularizer
Error Function for Support vector Regression • To obtain sparse solutions, the quadratic error function in previous page is replaced by an ε-insensitive function defined as follows: Red: ε-insensitive error function Green: quadratic one
Error Function (2) • Therefore the error function for support vector regression is given by • Furthermore, we can re-express the optimization problem by introducing slack variables. • Since the ε-insensitive error function has two branches, we need to introduce two slack variables for each xn: • ηn ≥0, where ηn >0 corresponds to the case that tn>y(xn)+ε • ῆn ≥0, where ῆn>0 corresponds to the case that tn<y(xn)-ε
Error Function (3) • Such introducing of Slack variables allow points to lie outside the tube:
Error Function (4) • Therefore error function can be written as which must be minimized, subject to the constraints ηn ≥0 and ῆn ≥0, as well as (7.53) & (7.54), for n=1,…N.
Lagrangian of the Optimization • The corresponding Lagrangian of the constrained minimization is given by
Solving the Lagrangian • Take partial derivatives of L w.r.t. wi, b, ηn and ῆn, and set them equal to 0, we obtain:
Solving the Lagrangian (2) • Plug (7.57) into (7.56) and utilize (7.58)--(7.60) to simplify the resulting equation, we obtain again with the following box constraints:
Prediction • Substituting (7.57) into (7.1), the prediction for a new input variable x using training data, is given by following equation:
KTT Conditions • Similarly, the corresponding KKT conditions are given by To save the space, here we omit inequalities from the set of KKT conditions. • Notice that • Similarly we also can estimate b utilizing above KKT conditions …