320 likes | 334 Views
Explore soft-margin SVM optimization and extensions covering primal, dual, and classification-calibration theory in machine learning. Understand the zero-one loss, hinge loss, and calibration principles in relation to SVMs. Delve into the effects of different hyperparameters and optimization tricks to enhance SVM performance. Learn about the history of optimization algorithms and the applications of SVM for regression and large-scale training scenarios.
E N D
CS480/680: IntrotoML Lecture 08: Soft-margin SVM Yao-Liang Yu
Outline • Formulation • Dual • Optimization • Extension Yao-Liang Yu
Hard-margin SVM Primal Dual hard constraint Yao-Liang Yu
What if inseparable? Yao-Liang Yu
Soft-margin (Cortes & Vapnik’95) Primal Primal soft constraint hard constraint propto 1/margin training error hyper-parameter prediction (no sign) Yao-Liang Yu
Zero-one loss • Find prediction rule f so that on an unseen random X, our prediction sign(f(X)) has small chance to be different from the true label Y your prediction Yao-Liang Yu
The hinge loss upper bound zero-one Squared hinge still suffer loss! exponential loss logistic loss Yao-Liang Yu
Classification-calibration • Want to minimize zero-one loss • End up with minimizing some other loss Theorem (Bartlett, Jordan, McAuLiffe’06). Any convex margin loss ℓ is classification-calibrated iffℓ is differentiable at 0 and ℓ’(0) < 0. Classification calibration. has the same sign as , i.e., the Bayes rule. Yao-Liang Yu
Outline • Formulation • Dual • Optimization • Extension Yao-Liang Yu
Important optimization trick joint over x and t Yao-Liang Yu
Slack for “wrong” prediction Yao-Liang Yu
Lagrangian Yao-Liang Yu
Dual problem only dot product is needed! Yao-Liang Yu
The effect of C RdxR • C 0? • C inf? Rn Yao-Liang Yu
Karush-Kuhn-Tucker conditions • Primal constraints on w, b and ξ: • Dual constraints on α and β: • Complementary slackness • Stationarity Yao-Liang Yu
Parsing the equations Yao-Liang Yu
Support Vectors Yao-Liang Yu
Recover b • Take any i such that • Then xi is on the hyperplane: • How to recover ξ ? • What if there is no such i ? Yao-Liang Yu
More examples Yao-Liang Yu
Outline • Formulation • Dual • Optimization • Extension Yao-Liang Yu
Gradient Descent • Step size (learning rate) • const., if L is smooth • diminishing, otherwise (Generalized) gradient O(nd) ! Yao-Liang Yu
Stochastic Gradient Descent (SGD) • diminishing step size, e.g., 1/sqrt{t} or 1/t • averaging, momentum, variance-reduction, etc. • sample w/o replacement; cycle; permute in each pass average over n samples a random sample suffices O(d) Yao-Liang Yu
The derivative What about perceptron? What about zero-one loss? All other losses are diff. Yao-Liang Yu
Solving the dual • Can choose constant step size ηt = η • Faster algorithms exist: e.g., choose a pair of αp and αq and derive a closed-form update O(n*n) Yao-Liang Yu
A little history on optimization • Gradient descent mentioned first in (Cauchy, 1847) • First rigorous convergence proof (Curry, 1944) • SGD proposed and analyzed (Robbins & Monro, 1951) Yao-Liang Yu
Herbert Robbins(1915 – 2001) Yao-Liang Yu
Outline • Formulation • Dual • Optimization • Extension Yao-Liang Yu
Multiclass (Crammer & Singer’01) • Soft-margin is similar • Many other variants • Calibration theory is more involved separate by a “safety margin” Prediction for wrong classes Prediction for correct class Yao-Liang Yu
Regression (Drucker et al.’97) Yao-Liang Yu
Large-scale training (You, Demmel, et al.’17) • Randomly partition training data evenly into p nodes • Train SVM independently on each node • Compute center on each node • For a test sample • Find the nearest center (node / SVM) • Predict using the corresponding node / SVM Yao-Liang Yu
Questions? Yao-Liang Yu