320 likes | 496 Views
Support Vector Machine. Ke Chen. COMP24111 Machine Learning. Outline. Review of Linear Classifiers Motivation and Concept SVM Learning Nonlinear SVM Kernel-Based SVM SVM Demo Conclusions. Review of Linear Classifiers. x2. Linear classifiers One of the simplest classifiers
E N D
Support Vector Machine Ke Chen COMP24111 Machine Learning
COMP24111 Machine Learning Outline • Review of Linear Classifiers • Motivation and Concept • SVM Learning • Nonlinear SVM • Kernel-Based SVM • SVM Demo • Conclusions
COMP24111 Machine Learning Review of Linear Classifiers x2 • Linear classifiers • One of the simplest classifiers • Linear decision boundary • Applicable to linearly separable tasks • Perceptron • One of the most popular linear classifiers • Perceptron learning • Given a training set containing a number of examples with labels • Iteratively update weights of linear equation until convergence • Sensitive to initialisation and example input orders during learning • Could generate decision boundaries of different generalisation capabilities +1 -1 x1
COMP24111 Machine Learning Motivation and Concept • Perceptron learning denotes +1 denotes -1 Q: How would you classify this data set with the perceptron?
COMP24111 Machine Learning denotes +1 denotes -1 Motivation and Concept • Perceptron learning Q: How would you classify this data set? A: using perceptron learning rule to learn weights w, b
COMP24111 Machine Learning denotes +1 denotes -1 Motivation and Concept • Perceptron learning Q: How would you classify this data set? A: Using perceptron learning rule to learn weights w, b Q: With different initial values of w, b and example orders what happens?
COMP24111 Machine Learning denotes +1 denotes -1 Motivation and Concept • Perceptron learning Q: How would you classify this data set? A: Using perceptron learning rule to learn weights w, b Q: With different initial values of w, b and example orders what happens? A: Leading to many decision boundaries
COMP24111 Machine Learning denotes +1 denotes -1 Motivation and Concept Q: How would you classify this data set? A: Using perceptron learning rule to learn weights w, b Q: With different initial values of w, b and example orders what happens? A: Leading to many decision boundaries Q: Which decision boundary is the best for generalisation? • Perceptron learning
COMP24111 Machine Learning denotes +1 denotes -1 Motivation and Concept • Margin of linear classifier Definition: The margin of a linear classifier as the width that the boundary could be increased by before hitting a data point.
COMP24111 Machine Learning denotes +1 denotes -1 Motivation and Concept • Maximum margin The maximum margin is the one with the widest width to a data point. The maximum margin linear classifier is linear Support Vector Machines. margin
COMP24111 Machine Learning denotes +1 denotes -1 Support Vectors Motivation and Concept • Support Vectors Support Vectors are those data points that the margin pushes up against margin
COMP24111 Machine Learning denotes +1 denotes -1 Support Vectors Motivation and Concept • SVM: the best solution in terms of generalisation • Intuitively this feels safest; if we’ve made a small error in the location of the boundary, this gives us least chance of causing a misclassification. • The model is immune to removal of any non-support- vector data points. • There’s some theory (VC dimension) that is related to the proposition to guarantee the best generalisation. • Empirically it works very well. margin
Where does a machine learning algorithm come from? A mechanism that can encode knowledge for problem solving, which is often a parametric model. Learning is going to find “appropriate” parameters. “Model” The performance criterion: the function defined based on a learning model to judge how well the parameters of this model are set. “Error/Cost function” The algorithm comes from an optimisation process that minimises the error/cost function with respect to parameters, which derives a learning rule for finding “appropriate” parameters with a given training data set. “Learning algorithm”
COMP24111 Machine Learning SVM Learning • Objectives: finding appropriate weights w and bias b to • minimise training errors (similar to Perceptron) • maximise the margin for the best generalisation • What is the relationship between weights and margin? • with knowledge of analytic geometry, we obtain
COMP24111 Machine Learning SVM Learning • Learning via optimisation Given a set of linearly separable training examples, Learning is to solve the following constrained minimization problem, denotes +1 denotes -1
COMP24111 Machine Learning SVM Learning • Support vectors: property (its signature) • Learning rule is no longer that simple like perceptron! • need to search the space of w’s and b’s to find the widest margin that matches all the data points or support vectors • How? Using a Quadratic Programming (QP) algorithm!
COMP24111 Machine Learning Quadratic criterion Subject to n additional linear inequality constraints And subject to e additional linear equality constraints SVM Learning There exist QP algorithms for finding such constrained quadratic optima much more efficiently and reliably than gradient ascent used in Perceptron. (But they are very fiddly…you probably don’t want to know details but can use them as if they are blackboxes) • Quadratic programming Find
COMP24111 Machine Learning SVM Learning • Training Data Given a set of linearly separable training examples, • QP algorithm for SVM learning • Actually a search procedureto find out all support vectors in D Input: Output: αnfor each • Solution • Decision Boundary
COMP24111 Machine Learning x 0 x 0 x2 Nonlinear SVM • Datasets that are linearly separable work out great • But what are we going to do if the dataset is just too hard? • How about mapping data to a two-dimensional space 0 x
COMP24111 Machine Learning Φ: x→Φ(x) Nonlinear SVM • General idea • the original input space can always be mapped to some higher-dimensional feature space with an appropriate function so that the training set is linearly separable: x→Φ(x)
COMP24111 Machine Learning Nonlinear SVM Learning • Training Data • QP algorithm for SVM learning • Actually a search procedure to find out all support vectors in D Input: Output: αnfor each • Solution • Decision Boundary
COMP24111 Machine Learning Kernel SVM • Motivation of Kernel SVM • Directly computing nonlinear mapping is time-consuming and sometime computationally intractable! • Fortunately, SVM learning doesn’t need to know but demands only dot product two points • For certain nonlinear mappings, there is a function working on the original space but equal to the dot product in a new high-dimensional feature space, i.e.,
COMP24111 Machine Learning Kernel SVM • Motivation of Kernel SVM(Cont.) • Illustrative example:
COMP24111 Machine Learning Kernel SVM • Kernel Functions • A kernel function is some function that corresponds to a vector dot product in the new feature space • Linear kernel: • Polynomial kernel of order p: • Radial Basis Function (RBF) kernel: • Sigmoid kernel: only required by QP algorithm
Kernel SVM Kernel trick: solution and decision boundary Solution to weights and bias Decision Boundary COMP24111 Machine Learning In practice, we never use a transformation but a kernel function for kernel SVM Learning. 25 IEEE CIS/Surry Summer School on Computational Intelligence 2010
COMP24111 Machine Learning Kernel SVM • Example: SVM with polynomial kernel of order 5
COMP24111 Machine Learning SVM Demo SVM Demo
COMP24111 Machine Learning Conclusions • (Linear) SVM is a state-of-the-art linear classifier • Developed based on statistical learning theory • Learning process seeking support vectors of maximum margin • Best generalisation performance guaranteed for linearly separable data sets • Kernel “trick”: extending linear SVM to non-linear SVM • In principle, data points are mapped onto a higher dimensional feature space so that they are linearly separable in that space • In reality, a kernel function directly works on the dot product of data points in a new feature space • Nonlinear SVM learning is very efficient due to the kernel “trick” • SVM can be extended to multi-category classification • Decompose the problem into multiple binary classification tasks • Variants of SVM that can tackle multi-category classification in a straightforward way