Support Vector Machines and Kernel Methods (a very short introduction)

Support Vector MachinesandKernel Methods(a very short introduction) Laurent Orseau AgroParisTech laurent.orseau@agroparistech.fr à partir des transparents d'Antoine Cornuéjols

Introduction • Linear separation is well understood • Efficient algorithms (quadratic problem) • Non-linear separation • More difficult • Neural Networks • Mostly heuristic • Prone to local minima • Support Vector Machines • Uses linear separation methods for non-linear separation in an optimal way

SVM: How it works • We want a non-linear separation in the input space • Called the "primal" representation of the problem • Difficult to do as is • Feature space • Idea: projectthe input space into a higher dimensionalspacewhere to do linear separation • = non-linear separation in the input space • Still difficult: complexity depends on the number of dimensions • Kernel trick • Make complexity depend on the number of examples instead • Use a "dual" representation of the problem

SVM: Properties • Global optimum! • No local minimum • Fast! • quadratic optimization • Can safely replace 3-layer perceptrons • Kernels • Generic kernels exist for solving a wide class of problems • Kernels can be combined to create new kernels

Illustration Polynomial separation class 1 class 1 class 2 1 2 4 5 6 {x=2, x=5, x=6} are support vectors

Projection in higher dimension • Higher dimension means that the problem is reformulatedin another description spaceso as to express a solution more succinctly • Can turn combinatorial explosion into polynomial expression • Ex: parity, majority, … More succinct solutions generalize better!

Linear separation in the feature space

Kernels • Kernels are scalar product functions of two input examplesin the feature space • K(xi, xj) • But no need to really enter the full feature space! • The scalar product is sufficient • = Kernel trick

SVM: What it does • Optimal linear separation • Maximizes the margin between the positive/negative example sets • Margin defined by closest points/examples • Only a small number of examples • Examples "supporting" the margin are called Support Vectors • Kernel: Scalar product between an example to classify and support vectors

Illustration : the XOR case

Hyperplane of widest margin

Illustration : the XOR case Polynomial kernel function of d° 2: K(x,x') = [1 + (xT . x')]2 K(x,xi ) = 1 + x12xi12 + 2 x1x2xi1xi2 + x22xi22 + 2x1xi1 + 2x2xi2 Corresponding to projection in feature space F: [1, x12, √2 x1x2, x22, √2 x1, √2 x2 ] T

Illustration : the XOR case Separation in input space D(x) = -x1x2 Separation in feature space F(X) (6-dimensional space)

Applications • Text categorization • Recognition of handwritten characters • Face detection • Breast cancer diagnostic • Protein classification • Electric consumption prevision • … Trained SVM classifiers for pedestrian and face object detection (Papageorgiou, Oren, Osuna and Poggio, 1998)

SVM : Limits • Can only separate 2 classes • For multi-class, must make all combinations • Must choose kernel carefully • Not always easy • Overfitting problem • Limits to compacity • There is a limit to the redescription capacity of SVMs • Only 1 projection phase • Deep Belief Networks • Can represent some solutions more compactly than SVMs • Similar to MLP with # hidden layers > 1 • Each layer is a projection of the previous feature space into a higher space • Can represent even more compact solutionsand find interesting intermediate representations • Learning?

Support Vector Machines and Kernel Methods (a very short introduction)