1 / 30

By :-

Support Vector Machine. By :- . Classifiers Difference b/w Classification and Clustering What is SVM. Feature Y. Feature X. Classifiers. The  the goal of Classifiers is to use an object's characteristics to identify which class (or group) it belongs to. Have labels for some points

sibley
Download Presentation

By :-

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Support Vector Machine By :-

  2. Classifiers • Difference b/w Classification and Clustering • What is SVM

  3. Feature Y Feature X Classifiers • The the goal of Classifiers is to use an object's characteristics to identify which class (or group) it belongs to. • Have labels for some points • Supervised learning Genes Proteins

  4. Difference b/w Classificationand Clustering • In general, in classification you have a set of predefined classes and want to know which class a new object belongs to. • Clustering tries to group a set of object. • In the context of machine learning, classification is supervised learning and clustering is unsupervised learning.

  5. What Is SVM? • Support Vector Machines are based on the concept of decision planes that define decision boundaries. •  A decision plane is one that separates between a set of objects having different class memberships.

  6. the objects belong either to class GREEN or RED. • The separating line defines a boundary on the right side of which all objects are GREEN and to the left of which all objects are RED. Any new object (white circle) falling to the right is labelled, i.e., classified, as GREEN (or classified as RED should it fall to the left of the separating line). • This is a classic example of a linear classifier

  7. Most classification tasksare not as simple, as we have seen in previous example • More complex structures are needed in order to make an optimal separation • Full separation of the GREEN and RED objects would require a curve (which is more complex than a line).

  8. In fig. we can see the original objects (left side of the schematic) mapped, i.e., rearranged, using a set of mathematical functions, known as kernels. • The process of rearranging the objects is known as mapping (transformation). Note that in this new setting, the mapped objects (right side of the schematic) is linearly separable and, thus, instead of constructing the complex curve (left schematic), all we have to do is to find an optimal line that can separate the GREEN and the RED objects.

  9. Support Vector Machine (SVM) is primarily a classier method that performs classification tasks by constructing hyper planes in a multidimensional space that separates cases of different class labels.  • SVM supports both regression and classification tasks and can handle multiple continuous and categorical variables. For categorical variables a dummy variable is created with case values as either 0 or 1. Thus, a categorical dependent variable consisting of three levels, say (A, B, C), is represented by a set of three dummy variables: • A: {1 0 0}, B: {0 1 0}, C: {0 0 1}

  10. Support Vector Machines

  11. Supervised Learning • Training set: a number of expression profiles with known labels which represent the true population. Difference to clustering: there you don't know the labels, you have to find a structure on your own. • Learning/Training: find a decision rule which explains the training set well. This is the easy part, because we know the labels of the training set! • Generalisation ability: how does the decision rule learned from the training set generalize to new specimen? • Goal: find a decision rule with high generalisation ability.

  12. Linear Separators • Binary classification can be viewed as the task of separating classes in feature space: wTx + b = 0 wTx + b > 0 wTx + b < 0 f(x) = sign(wTx+ b)

  13. Linear separation of the training set • A separating hyper plane is defined by - the normal vector w and - the offset b: • Hyper plane ={x |<w,x>+ b = 0} • <.,.> is called inner product, scalar product or dot product. • Training: Choose w and b from the labelled examples in the training set.

  14. Predict the label of a new point • Prediction: On which side of the hyper-plane does the new point lie? Points in the direction of the normal vector are classified as POSITIVE. Points in the opposite direction are classified as NEGATIVE.

  15. Which of the linear separators is optimal?

  16. Classification Margin • Distance from example xi to the separator is • Examples closest to the hyper plane are support vectors. • Marginρof the separator is the distance between support vectors. ρ r

  17. Maximum Margin Classification • Maximizing the margin is good according to intuition and PAC theory. • Implies that only support vectors matter; other training examples are ignorable.

  18. Linear SVM Mathematically • Let training set {(xi, yi)}i=1..n, xiRd, yi{-1, 1}be separated by a hyperplane withmargin ρ. Then for each training example (xi, yi): • For every support vector xs the above inequality is an equality. After rescaling w and b by ρ/2in the equality, we obtain that distance between each xsand the hyper plane is • Then the margin can be expressed through (rescaled) w and b as: wTxi+ b ≤ - ρ/2 if yi= -1 wTxi+ b≥ ρ/2if yi= 1  yi(wTxi+ b)≥ρ/2

  19. Linear SVMs Mathematically (cont.) • Then we can formulate the quadratic optimization problem: Which can be reformulated as: Find w and b such that is maximized and for all (xi, yi), i=1..n : yi(wTxi+ b)≥ 1 Find w and b such that Φ(w) = ||w||2=wTw is minimized and for all (xi, yi), i=1..n : yi (wTxi+ b)≥ 1

  20. Solving the Optimization Problem Find w and b such that Φ(w) =wTw is minimized and for all (xi, yi),i=1..n: yi (wTxi+ b)≥ 1 • Need to optimize a quadratic function subject to linear constraints. • Quadratic optimization problems are a well-known class of mathematical programming problems for which several (non-trivial) algorithms exist. • The solution involves constructing a dual problem where a Lagrange multiplierαiis associated with every inequality constraint in the primal (original) problem: Find α1…αnsuch that Q(α) =Σαi- ½ΣΣαiαjyiyjxiTxjis maximized and (1)Σαiyi= 0 (2) αi≥ 0 for all αi

  21. The Optimization Problem Solution • Given a solution α1…αnto the dual problem, solution to the primal is: • Each non-zero αi indicates that corresponding xi is a support vector. • Then the classifying function is (note that we don’t need w explicitly): • Notice that it relies on an inner product between the test point xand the support vectors xi – we will return to this later. • Also keep in mind that solving the optimization problem involved computing the inner products xiTxjbetween all training points. w =Σαiyixib = yk - ΣαiyixiTxk for any αk > 0 f(x) = ΣαiyixiTx + b

  22. Soft Margin Classification • What if the training set is not linearly separable? • Slack variablesξican be added to allow misclassification of difficult or noisy examples, resulting margin called soft. ξi ξi

  23. Soft Margin Classification Mathematically Find w and b such that Φ(w) =wTw is minimized and for all (xi,yi),i=1..n: yi (wTxi+ b)≥ 1 • The old formulation: • Modified formulation incorporates slack variables: • Parameter C can be viewed as a way to control overfitting: it “trades off” the relative importance of maximizing the margin and fitting the training data. Find w and b such that Φ(w) =wTw + CΣξi is minimized and for all (xi,yi),i=1..n: yi (wTxi+ b)≥ 1 – ξi, , ξi≥ 0

  24. Soft Margin Classification – Solution • Dual problem is identical to separable case (would not be identical if the 2-norm penalty for slack variables CΣξi2 was used in primal objective, we would need additional Lagrange multipliers for slack variables): • Again, xiwith non-zero αiwill be support vectors. • Solution to the dual problem is: Find α1…αNsuch that Q(α) =Σαi- ½ΣΣαiαjyiyjxiTxjis maximized and (1)Σαiyi= 0 (2) 0 ≤αi≤ C for all αi Again, we don’t need to compute w explicitly for classification: w =Σαiyixi b= yk(1- ξk) - ΣαiyixiTxk for any ks.t. αk>0 f(x) = ΣαiyixiTx + b

  25. Theoretical Justification for Maximum Margins • Vapnik has proved the following: The class of optimal linear separators has VC dimension h bounded from above as where ρ is the margin, D is the diameter of the smallest sphere that can enclose all of the training examples, and m0is the dimensionality. • Intuitively, this implies that regardless of dimensionality m0 we can minimize the VC dimension by maximizing the margin ρ. • Thus, complexity of the classifier is kept small regardless of dimensionality.

  26. Linear SVMs: Overview • The classifier is a separating hyperplane. • Most “important” training points are support vectors; they define the hyperplane. • Quadratic optimization algorithms can identify which training points xiare support vectors with non-zero Lagrangian multipliers αi. • Both in the dual formulation of the problem and in the solution training points appear only inside inner products: f(x) = ΣαiyixiTx + b Find α1…αNsuch that Q(α) =Σαi- ½ΣΣαiαjyiyjxiTxjis maximized and (1)Σαiyi= 0 (2) 0 ≤αi≤ C for all αi

  27. Non-linear SVMs • Datasets that are linearly separable with some noise work out great: • But what are we going to do if the dataset is just too hard? • How about… mapping data to a higher-dimensional space: x 0 x 0 x2 x 0

  28. Non-linear SVMs: Feature spaces • General idea: the original feature space can always be mapped to some higher-dimensional feature space where the training set is separable: Φ: x→φ(x)

More Related