200 likes | 440 Views
Support Vector Machines and Kernel Methods (a very short introduction). Laurent Orseau AgroParisTech laurent.orseau@agroparistech.fr à partir des transparents d'Antoine Cornuéjols. Introduction. Linear separation is well understood Efficient algorithms (quadratic problem)
E N D
Support Vector MachinesandKernel Methods(a very short introduction) Laurent Orseau AgroParisTech laurent.orseau@agroparistech.fr à partir des transparents d'Antoine Cornuéjols
Introduction • Linear separation is well understood • Efficient algorithms (quadratic problem) • Non-linear separation • More difficult • Neural Networks • Mostly heuristic • Prone to local minima • Support Vector Machines • Uses linear separation methods for non-linear separation in an optimal way
SVM: How it works • We want a non-linear separation in the input space • Called the "primal" representation of the problem • Difficult to do as is • Feature space • Idea: projectthe input space into a higher dimensionalspacewhere to do linear separation • = non-linear separation in the input space • Still difficult: complexity depends on the number of dimensions • Kernel trick • Make complexity depend on the number of examples instead • Use a "dual" representation of the problem
SVM: Properties • Global optimum! • No local minimum • Fast! • quadratic optimization • Can safely replace 3-layer perceptrons • Kernels • Generic kernels exist for solving a wide class of problems • Kernels can be combined to create new kernels
Illustration Polynomial separation class 1 class 1 class 2 1 2 4 5 6 {x=2, x=5, x=6} are support vectors
Projection in higher dimension • Higher dimension means that the problem is reformulatedin another description spaceso as to express a solution more succinctly • Can turn combinatorial explosion into polynomial expression • Ex: parity, majority, … More succinct solutions generalize better!
Kernels • Kernels are scalar product functions of two input examplesin the feature space • K(xi, xj) • But no need to really enter the full feature space! • The scalar product is sufficient • = Kernel trick
SVM: What it does • Optimal linear separation • Maximizes the margin between the positive/negative example sets • Margin defined by closest points/examples • Only a small number of examples • Examples "supporting" the margin are called Support Vectors • Kernel: Scalar product between an example to classify and support vectors
Illustration : the XOR case Polynomial kernel function of d° 2: K(x,x') = [1 + (xT . x')]2 K(x,xi ) = 1 + x12xi12 + 2 x1x2xi1xi2 + x22xi22 + 2x1xi1 + 2x2xi2 Corresponding to projection in feature space F: [1, x12, √2 x1x2, x22, √2 x1, √2 x2 ] T
Illustration : the XOR case Separation in input space D(x) = -x1x2 Separation in feature space F(X) (6-dimensional space)
Applications • Text categorization • Recognition of handwritten characters • Face detection • Breast cancer diagnostic • Protein classification • Electric consumption prevision • … Trained SVM classifiers for pedestrian and face object detection (Papageorgiou, Oren, Osuna and Poggio, 1998)
SVM : Limits • Can only separate 2 classes • For multi-class, must make all combinations • Must choose kernel carefully • Not always easy • Overfitting problem • Limits to compacity • There is a limit to the redescription capacity of SVMs • Only 1 projection phase • Deep Belief Networks • Can represent some solutions more compactly than SVMs • Similar to MLP with # hidden layers > 1 • Each layer is a projection of the previous feature space into a higher space • Can represent even more compact solutionsand find interesting intermediate representations • Learning?