220 likes | 413 Views
SVM Support Vectors Machines. Based on Statistical Learning Theory of Vapnik, Chervonenkis, Burges, Scholkopf, Smola, Bartlett, Mendelson, Cristianini Presented By: Tamer Salman. The addressed Problems. SVM can deal with three kinds of problems: Pattern Recognition / Classification.
E N D
SVMSupport Vectors Machines Based on Statistical Learning Theory of Vapnik, Chervonenkis, Burges, Scholkopf, Smola, Bartlett, Mendelson, Cristianini Presented By: Tamer Salman
The addressed Problems • SVM can deal with three kinds of problems: • Pattern Recognition / Classification. • Regression Estimation. • Density Estimation.
Pattern Recognition • Given: • A set of M labeled patterns: • The patterns are drawn i.i.d from an unknown P(X,Y). • A set of functions F. • Chose a function f in F, such that an unseen pattern x will be correctly classified with high probability? • Binary classification: Two classes, +1 and -1.
The Actual Risk • What is the probability for error of a function f? where c is some cost function on errors. • The risk is not computable due to dP(x,y). • A proper estimation must be found.
Linear Neural Network Linear SVM Linear SVMLinearly Separable Case • Linear SVM produces the maximal margin hyper plane, which is as far as possible from the closest training points.
Linearly Separable Case. Cont. • Given the training set, we seek w and b such that: • In Addition, we seek the maximal margin hyperplane. • What is the margin? • How do we maximize it?
Margin Maximization • The margin is the sum of distances of the two closest points from each side to the hyper plane. • The distance of the hyper plane (w,b) from the origin is w/b. • The margin is 2/||w||. • Maximizing the margin is equivalent to minimizing ½||w||².
Linear SVM. Cont. • The LaGrangian is:
Linear SVM. Cont. • Requiring the derivatives with respect to w,b to vanish yields: • KKT conditions yield: • Where:
Linear SVM. Cont. • The resulting separating function is: • Notes: • The points with α=0 do not affect the solution. • The points with α≠0 are called support vectors. • The equality conditions hold true only for the SVs.
Linear SVM. Non-separable case. • We introduce slack variables ξi and allow mistakes. • We demand: • And minimize:
Non-separable case. Cont. • The modifications yield the following problem:
Non Linear SVM • Note that the training data appears in the solution only in inner products. • If we pre-map the data into a higher and sparser space we can get more separability and a stronger separation family of functions. • The pre-mapping might make the problem infeasible. • We want to avoid pre-mapping and still have the same separation ability. • Suppose we have a simple function that operates on two training points and implements an inner product of their pre-mappings, then we achieve better separation with no added cost.
Mercer Kernels • A Mercer kernel is a function: for which there exists a function: such that: • A funtion k(.,.) is a Mercer kernel if for any function g(.), such that: the following holds true:
Some Mercer Kernels • Homogeneous Polynomial Kernels: • Non-homogeneous Polynomial Kernels: • Radial Basis Function (RBF) Kernels:
Solution of non-linear SVM • The problem: • The separating function:
Notes • The solutions of non-linear SVM is linear in H (Feature Space). • In non-linear SVM w exists in H. • The complexity of computing the kernel values is not higher than the complexity of the solution and can be done a priory in a kernel matrix. • SVM is suitable for large scale problems due to chunking ability.
Error Estimates • Due to the fact that the actual risk is not computable, we seek to estimate the error rate of a machine given a finite set of m patterns. • Empirical Risk. • Training and Testing. • k-fold cross validation. • Leave One out.
Error Bounds • We seek faster estimates of the solution. • The bound should be tight and informative. • Theoretical VC bound: Risk < Empirical Risk + Complexity (VC-dimension / m) Loose and not always informative. • Margin Radius bound: Risk < R² / margin² Where R is the radius of the smallest enclosing sphere of the data in feature space. Tight and informative.
Error Bounds. Cont. Error Bound LOO Error Parameter
Rademacher Complexity • One of the tightest sample-based bounds depend on the Rademacher Complexity term defined as follows: where: F is the class of functions mapping the domain of the input into R. Ep(x) expectation with respect to the probability distribution of the input data. Eσexpectation with respect to σi: independent uniform random variable of {±1} • Rademacher complexity is a measure of the ability of the class of resulting functions to classify the input samples if associated with a random class.
Rademacher Risk Bound • The following bound holds true with probability (1-δ): Where: Êm is the error on the input data measured through a loss function h(.) with Lipshitz constant L. That is: And the loss function can be one of: Vapnik’s: Bartlett & Mendelson’s: