Support Vector Machines R&N: Chap. 20, Sec. 20.6

Support Vector MachinesR&N:Chap. 20, Sec. 20.6

Machine Learning Classes • CS659 (Hauser) Principles of Intelligent Robot Motion • CS657 (Yu) Computer Vision • STAT520 (Trosset) Introduction to Statistics • STAT682 (Rocha) Statistical Model Selection

Schedule • 11/24 • No class 11/26 (Thanksgiving) • 12/1 – Final project presentations • 12/3 – Last class, review and wrap-up

Thanksgiving week • Vote: • Class as usual reinforcement learning unsupervised learning • Review class • No class

Average Grades HW1-5 • HW1: 83.2 • HW2: 85.0 • HW3: 75.6 • HW4: 81.2 • HW5: 82.4

Final Projects • Mid-term report • 1-2 paragraphs on status of project • Checkpoint for choosing HW instead of project • In-class presentation • High-level overview of: • Project topic and research questions • Demos / findings / major ideas • Relationship with material learned in class • Aim for 5 minutes (~5 slides) • Final report due any time before final

Neural Networks in the News • http://spectrum.ieee.org/computing/hardware/ibm-unveils-a-new-brain-simulator • 1 billion neurons • 10 trillion synapses • Speed: 100 times slower than a brain • Power: 1.4 MW ~= 500 homes

Agenda • Support Vector Machines • Kernels • Regularization • False negatives/positives, precision/recall curves

Motivation: Feature Mappings • Given attributes x, learn in the space of features f(x) • E.g., parity, FACE(card), RED(card) • Hope CONCEPT is easier in feature space

Example x2 x1

Example • Choose f1=x12, f2=x22, f3=2 x1x2 x2 f3 f2 f1 x1

SVM Intuition • Find “best” linear classifier in feature space • Hope to avoid overfitting

Maximum Margin Classification • Find boundary that maximizes margin between positive and negative examples Margin

Margin • The farther away from the boundary we are, the more “confident” the classification Margin Not as confident Very confident

Geometric Margin • The farther away from the boundary we are, the more “confident” the classification Margin Distance of example to the boundary is its geometric margin

Geometric Margin • Let yi = -1 or 1 • Boundary qTx + b = 0, ||q||=1 • Margin for example i = y(i) (qTx(i) + b) Margin Distance of example to the boundary is its geometric margin

Optimizing the Margin • Maximum margin classifier given by argmaxq,b mini y(i) (qTx(i) + b)

Optimizing the Margin • Maximum margin classifier given by argmaxq,b mini y(i) (qTx(i) + b) • Also written as constrained optimization argmaxq,bl such that y(i) (qTx(i) + b)  l (for all i) ||q|| = 1

Optimizing the Margin • Maximum margin classifier given by argmaxq,b mini y(i) (qTx(i) + b) • Also written as constrained optimization argmaxq,bl such that y(i) (qTx(i) + b)  l (for all i) |q| = 1 Nasty (nonconvex) constraint

Optimizing the Margin, part 2 • Better formulation argminw,b ||w||2 such that y(i) (wTx(i) + b)  1 (for all i) • Can be solved with “off the shelf” numerical packages (e.g., Matlab)

Deconstructing the formulation • Better formulation argminw,b ||w||2 such that y(i) (wTx(i) + b)  1 (for all i) • Letting q = w/||w|| • y(i) (qTx(i) + b/||w||)  1 / ||w|| • So minimizing ||w||2 will maximize the margin

Insights • The optimal classification boundary is defined by just a few (d+1) points: support vectors Margin

The Magic of Duality • The Lagrangian dual problem of argminw,b ||w||2 such that y(i) (wTx(i) + b)  1 (for all i) • Is: maxa Si ai – ½ Si,j ai aj y(i)y(j) (x(i)T x(j)) such thatai 0 and Si ai y(i) = 0 • With classification boundary w = Si ai y(i) x(i)

The Kernel Trick • Only a few of the ai’s are nonzero, so wTx = Si ai y(i) x(i)Tx can be evaluated quickly • We’ve rewritten everything in terms of(x(i)T x(j))… so what?

The Kernel Trick • Only a few of the ai’s are nonzero, so wTx = Si ai y(i) x(i)Tx can be evaluated quickly • We’ve rewritten everything in terms of(x(i)T x(j))… so what? • Replace inner product (aT b) with a kernel function K(a,b)

Kernel Functions • A kernelcan implicitly compute a feature mapping to a high dimensional space, without having to construct the features!

Kernel Functions • A kernelcan implicitly compute a feature mapping to a high dimensional space, without having to construct the features! • Example: K(a,b) = (aTb)2 • (a1b1 + a2b2)2= a12b12 + 2a1b1a2b2 + a22b22= [a12, a22 , 2a1a2]T[b12, b22 , 2b1b2] • An implicit mapping to feature space of dimension 3 (for n attributes, dimension n(n+1)/2)

Types of Kernel • Polynomial K(a,b) = (aTb+1)d • Feature space is exponential in d • Gaussian K(a,b) = exp(-||a-b||2/s2) • Feature space is infinite dimensional • Sigmoid, etc… • Any symmetric positivedefinite function

Overfitting / underfitting

Nonseparable Data • Linear classifiers often cannot achieve perfect accuracy

Regularization • Assists trading off between accuracy and generality • Tolerate some errors, cost of error determined by some parameter C

Formulating Regularization and Nonseparable data argminw,b,e ||w||2 + CSi ei such that y(i) (wTx(i) + b)  1-ei(for all i) ei  0 ei’s are error terms measuring how much the classification is violated C is the regularization weight that penalizes larger errors

Comments • SVMs often have very good performance • E.g., digit classification, face recognition, etc • Still need parametertweaking • Kernel type • Kernel parameters • Regularization weight • Fast optimization • Off-the-shelf libraries • SVMlight

Precision and Recall

Motivation • Predicting risk of terrorist attack • Predicting “no attack” when one will occur, is far worse than predicting “attack” when none will occur • Searching for images • Returning irrelevant images is worse than omitting relevant ones

Classification Thresholds • Many learning algorithms give real-valued output v(x) that needs thresholding for classification v(x) > t => positive label given to x v(x) < t => negative label given to x • May want to tune threshold to get fewer false positives or false negatives • Use weights to favor accuracy on positive or negative examples

False Positives True concept Learned concept x2 x1

False Positives True concept Learned concept x2 New query x1

False Negatives True concept Learned concept x2 New query x1

Reducing False Positive Rate True concept Learned concept x2 x1

Reducing False Negative rate True concept Learned concept x2 x1

Precision vs. Recall • Precision • # of relevant documents retrieved / # of total documents retrieved • Recall • # of relevant documents retrieved / # of total relevant documents • Numbers between 0 and 1

Precision vs. Recall • Precision • # of true positives / (# true positives + # false positives) • Recall • # of true positives / (# true positives + # false negatives) • A precise classifier is selective • A classifier with high recall is not

Precision-Recall curves Measure Precision vs Recall as tolerance (or weighting) is tuned Perfect classifier Recall Actual performance Precision

Precision-Recall curves Measure Precision vs Recall as tolerance (or weighting) is tuned Recall Penalize false negatives Equal weight Penalize false positives Precision

Precision-Recall curves Measure Precision vs Recall as tolerance (or weighting) is tuned Recall Precision

Comments • Combined scores (e.g., F scores, correlation coefficients) are better measures of performance on biased datasets than accuracy • Related to receiver-operating-characteristic (ROC) curves

Administrative • Graded HW5

Support Vector Machines R&N: Chap. 20, Sec. 20.6