430 likes | 620 Views
Pattern Analysis using Convex Optimization: Part 2 of Chapter 7 Discussion. Presenter: Brian Quanz. About today’s discussion…. Last time: discussed convex opt. Today: Will apply what we learned to 4 pattern analysis problems given in book:
E N D
Pattern Analysis using Convex Optimization: Part 2 of Chapter 7 Discussion Presenter: Brian Quanz
About today’s discussion… • Last time: discussed convex opt. • Today: Will apply what we learned to 4 pattern analysis problems given in book: • (1) Smallest enclosing hypersphere (one-class SVM) • (2) SVM classification • (3) Support vector regression (SVR) • (4) On-line classification and regression
About today’s discussion… • This time for the most part: • Describe problems • Derive solutions ourselves on the board! • Apply convex opt. knowledge to solve • Mostly board work today
Recall: KKT Conditions • What we will use: • Key to remember ch. 7: • Complementary slackness -> sparse dual rep. • Convexity -> efficient global solution
Novelty Detection: Hypersphere • Train data – learn support • Capture with hypersphere • Outside – ‘novel’ or ‘abnormal’ or ‘anomaly’ • Smaller sphere = more fine-tuned novelty detection
1st: Smallest Enclosing Hypersphere • Given: • Find center, c, of smallest hypersphere containing S
S.E.H. Optimization Problem • O.P.: • Let’s solve using Lagrangian and KKT and discuss
S.E.H.: Solution • H(x) = 1 if x>=0, 0 o.w. Dual=primal @
Hypersphere that only contains some data – soft hypersphere • Balance missing some points and reducing radius • Robustness –single point could throw off • Introduce slack variables (repeated approach) • 0 within sphere, squared distance outside
Hypersphere optimization problem • Now with trade off between radius and training point error: • Let’s derive solution again
Remarks • If data lies in subspace of feature space: • Hypersphere overestimates support in perpendicular dir. • Can use kernel PCA (next week discussion) • If normalized data (k(x,x)=1) • Corresponds to separating hyperplane, from origin
Maximal Margin Classifier • Data and linear classifier • Hinge loss, gamma margin • Linear separable if
Typical formulation • Typical formulation fixes gamma (functional margbin) to 1 and allows w to vary since scaling doesn’t affect decision, margin proportional to 1/norm(w) to vary. • Here we fix w norm, and vary functional margin gamma
Hard Margin SVM • Arrive at optimization problem • Let’s solve
Solution • Recall:
Soft Margin Classifier • Non-separable - Introduce slack variables as before • Trade off with 1-norm of error vector
Solve Soft Margin SVM • Let’s solve it!
Support Vector Regression • Similar idea to classification, except turned inside-out • Epsilon-insensitive loss instead of hinge • Ridge Regression: Squared-error loss
Support Vector Regression • But, encourage sparseness • Need inequalities • epsilon-insensitive loss
Epsilon-insensitive • Defines band around function for 0-loss
SVR (linear epsilon) • Opt. problem: • Let’s solve again
SVR Dual and Solution • Dual problem
Online • So far batch: processed all at once • Many tasks require data processed one at a time from start • Learner: • Makes prediction • Gets feedback (correct value) • Updates • Conservative only updates if non-zero loss
Simple On-line Alg.: Perceptron • Threshold linear function • At t+1 weight updated if error • Dual update rule: • If
Novikoff Theorem • Convergence bound for hard-margin case • If training points contained in ball of radius R around origin • w* hard margin svm with no bias and geometric margin gamma • Initial weight: • Number of updates bounded by:
Proof • From 2 inequalities: • Putting these together we have: • Which leads to bound:
Kernel Adatron • Simple modification to perceptron, models hard margin SVM with 0 threshold alpha stops changing, either alpha positive and right term 0, or right term negative
Kernel Adatron – Soft Margin • 1-norm soft margin version • Add upper bound to the values of alpha (C) • 2-norm soft margin version • Add constant to diagonal of kernel matrix • SMO • To allow a variable threshold, updates must be made on pair of examples at once • Results in SMO • Rate of convergence both algs. sensitive to order • Good heuristics, e.g. choose points most violate conditions first
On-line regression • Also works for regression case • Basic gradient ascent with additional constraints
Questions • Questions, Comments?