850 likes | 1.06k Views
SVM and SVR as Convex Optimization Techniques. Mohammed Nasser Department of Statistics Rajshahi University Rajshahi 6205. Acknowledgement. Andrew W. Moore Professor School of Computer Science Carnegie Mellon University. Kenji Fukumizu Institute of Statistical Mathematics, ROIS
E N D
SVM and SVR as Convex Optimization Techniques Mohammed Nasser Department of Statistics Rajshahi University Rajshahi 6205
Acknowledgement Andrew W. Moore Professor School of Computer Science Carnegie Mellon University Kenji Fukumizu Institute of Statistical Mathematics, ROIS Department of Statistical Science, Graduate University for Advanced Studies • GeorgiNalbantov • Econometric Institute, School of Economics, Erasmus University Rotterdam
Contents • Glimpses of Historical Development • Optimal Separating Hyperplane • Soft Margin Support Vector Machine • Support Vector Regression • Convex Optimization • Use of Lagrange and Duality Theory • Example • Conclusion
Early History • In 1900 Karl Pearson published his famous article on goodness of fit, judged as one of first best twelve scientific articles in twentieth century. • In 1902 Jacques Hadamard pointed that mathematical models of physical phenomena should have the properties that • A solution exists • The solution is unique • The solution depends continuously on the data, in some reasonable topology • ( Well-Posed Problem)
Early History • In 1940 Fréchet, PhD student of Hadamard highly criticized mean and standard deviation as measures of location and scale respectively. But he did express his belief in development of statistics without proposing any alternative. • During sixties and seventies Tukey, Huber and Hampel tried to develop Robust Statistics in order to remove ill-posedness of classical statistics. • Robustness means insensitivity to minor change in both model and sample, high tolerance to major changes and good performance at model. • Data Mining onslaught and the problem of non-linearity and nonvectorial data have made robust statistics somewhat nonattractive. Let Us See What KM present…………….
6 Recent History • Support Vector Machines (SVM) introduced in COLT-92 (conference on learning theory) greatly developed since then. • Result: a class of algorithms for Pattern Recognition • (Kernel Machines) • Now: a large and diverse community, from machine • learning, optimization, statistics, neural networks, • functional analysis, etc • Centralized website: www.kernel-machines.org • First Text book (2000): see www.support-vector.net • Now ( 2012): At least twenty books of different taste are avialable in international market • The book, “ The Elements of Statistical Learning”(2001) by Hastie,Tibshirani and Friedman went into second edition within seven years.
We consider linear combinations of input vector: 7 Kernel methods: Heuristic View The common characteristic (structure) among the following statistical methods? 1. Principal Components Analysis 2. (Ridge ) regression 3. Fisher discriminant analysis 4. Canonical correlation analysis 5.Singular value decomposition 6. Independent component analysis We make use concepts of length and dot product available in Euclidean space.
Linear learning typically has nice properties Unique optimal solutions, Fast learning algorithms Better statistical analysis But one big problem Insufficient capacity 8 Kernel methods: Heuristic View That means, in many data sets it fails to detect nonlinearship among the variables. • The other demerits • - Cann’t handle non-vectorial data
9 Kernel Methods Outliers detection Test of independence Data depth function Test of equality of distributions More…………………………………….
In Modern Multivariate Analysis we consider linear combinations of feature vector: : In Classical Multivariate Analysis we consider linear combinations of input vector: 10 Kernel methods: Heuristic View We make use concepts of length and dot product available in Euclidean space. We make use concepts of length and dot product/inner product available in Euclidean/non-Euclidean space.
Some Review of College Geometry y+x-1>0 (1,1) y+x-1=0 ky+kx-k=0 90 y+x-1<0 Different effect of k on two signed regions
Some Review of College GeometryIn General Form wx+b>0 w wx+b=0 kwx+kb=0 90 wx+b<0 Different effect of k on two signed regions
Some Review of College GeometryIn General Form Effect of change in b w wx+b=0 90 Effect of change in w
Linear Kernel = , , Let Its RKHS, . It can be shown,
a Linear Classifiers x f yest f(x,w,b) = sign(w x+ b) denotes +1 denotes -1 w x+ b>0 w x+ b=0 How would you classify this data? w x+ b<0
a Linear Classifiers x f yest f(x,w,b) = sign(w x +b) denotes +1 denotes -1 How would you classify this data?
a Linear Classifiers x f yest f(x,w,b) = sign(w x+ b) denotes +1 denotes -1 How would you classify this data?
a Linear Classifiers x f yest f(x,w,b) = sign(w x+ b) denotes +1 denotes -1 Any of these would be fine.. ..but which is the best?
a Linear Classifiers x f yest f(x,w,b) = sign(w x +b) denotes +1 denotes -1 How would you classify this data? Misclassified to +1 class
a a Classifier Margin x x f f yest yest f(x,w,b) = sign(w x +b) f(x,w,b) = sign(w x +b) denotes +1 denotes -1 denotes +1 denotes -1 Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint.
a Maximum Margin x f yest • Maximizing the margin is good according to intuition and PAC theory • Implies that only support vectors are important; other training examples are ignorable. • Empirically it works very very well. f(x,w,b) = sign(w x+ b) denotes +1 denotes -1 The maximum margin linear classifier is the linear classifier with the, um, maximum margin. This is the simplest kind of SVM (Called an LSVM) Support Vectors are those datapoints that the margin pushes up against Linear SVM
Linear SVM Mathematically Our Goal • 1) Correctly classify all training data if yi = +1 if yi = -1 for all i 2) Maximize the Margin same as minimize
Linear SVM Mathematically We can formulate a Quadratic Optimization Problem and solve for w and b • Minimize subject to Linear inequality constraints Strictly convex quadratic function
e11 e2 wx+b=1 e7 wx+b=0 wx+b=-1 Soft Margin Classification Slack variablesξi can be added to allow misclassification of difficult or noisy examples. What should our quadratic optimization criterion be? Minimize
Hard Margin v.s. Soft Margin • The old formulation: • The new formulation incorporating slack variables: • Parameter C can be viewed as a way to control overfitting. Find w and b such that Φ(w) =½ wTw is minimized and for all {(xi,yi)} yi (wTxi+ b) ≥ 1 Find w and b such that Φ(w) =½ wTw + CΣξi is minimized and for all {(xi,yi)} yi(wTxi+ b) ≥ 1- ξi and ξi≥ 0 for all i
● ● Expenditures ● ● ● ● ● Age Linear Support Vector Regression • Marketing Problem Given variables: • person’s age • income group • season • holiday duration • location • number of children • etc. (12 variables) Predict: • the level of holiday Expenditures
● ● ● ● ● ● ● ● Expenditures Expenditures ● ● ● ● ● ● Age Age Linear Support Vector Regression “Lazy case” (underfitting) ● ● ● ● Expenditures ● ● ● Age “Compromise case”, SVR (good generalizability) “Suspiciously smart case” (overfitting)
Linear Support Vector Regression • The epsilon-insensitive loss function penalty 4 ● ● 3 ● ● ● ● 2 error, penalty ● 1 45° ● 0
middle-sized area ● ● ● ● ● ● ● ● Expenditures Expenditures ● ● ● ● ● ● Age Age Linear Support Vector Regression “Lazy case” (underfitting) small area biggest area ● ● ● ● Expenditures ● ● ● “Support vectors” Age “Suspiciously smart case” (overfitting) “Compromise case”, SVR(good generalizability) • The thinner the “tube”, the more complex the model
● ● ● ● Expenditures ● ● ● Age Non-linear Support Vector Regression • Map the data into a higher-dimensional space:
● ● ● ● Expenditures ● ● ● Age Non-linear Support Vector Regression • Map the data into a higher-dimensional space:
● ● ● ● Expenditures ● ● ● Age Non-linear Support Vector Regression • Finding the value of a new point:
Linear SVR: Derivation ● ● ● ● Expenditures ● ● ● Age • Given training data • Find: , such that optimally describes the data: (1)