480 likes | 585 Views
Support Vector Machines R&N: Chap. 20, Sec. 20.6. Machine Learning Classes. CS659 (Hauser) Principles of Intelligent Robot Motion CS657 (Yu) Computer Vision STAT520 (Trosset) Introduction to Statistics STAT682 (Rocha) Statistical Model Selection. Schedule. 11/24
E N D
Machine Learning Classes • CS659 (Hauser) Principles of Intelligent Robot Motion • CS657 (Yu) Computer Vision • STAT520 (Trosset) Introduction to Statistics • STAT682 (Rocha) Statistical Model Selection
Schedule • 11/24 • No class 11/26 (Thanksgiving) • 12/1 – Final project presentations • 12/3 – Last class, review and wrap-up
Thanksgiving week • Vote: • Class as usual reinforcement learning unsupervised learning • Review class • No class
Average Grades HW1-5 • HW1: 83.2 • HW2: 85.0 • HW3: 75.6 • HW4: 81.2 • HW5: 82.4
Final Projects • Mid-term report • 1-2 paragraphs on status of project • Checkpoint for choosing HW instead of project • In-class presentation • High-level overview of: • Project topic and research questions • Demos / findings / major ideas • Relationship with material learned in class • Aim for 5 minutes (~5 slides) • Final report due any time before final
Neural Networks in the News • http://spectrum.ieee.org/computing/hardware/ibm-unveils-a-new-brain-simulator • 1 billion neurons • 10 trillion synapses • Speed: 100 times slower than a brain • Power: 1.4 MW ~= 500 homes
Agenda • Support Vector Machines • Kernels • Regularization • False negatives/positives, precision/recall curves
Motivation: Feature Mappings • Given attributes x, learn in the space of features f(x) • E.g., parity, FACE(card), RED(card) • Hope CONCEPT is easier in feature space
Example x2 x1
Example • Choose f1=x12, f2=x22, f3=2 x1x2 x2 f3 f2 f1 x1
SVM Intuition • Find “best” linear classifier in feature space • Hope to avoid overfitting
Maximum Margin Classification • Find boundary that maximizes margin between positive and negative examples Margin
Margin • The farther away from the boundary we are, the more “confident” the classification Margin Not as confident Very confident
Geometric Margin • The farther away from the boundary we are, the more “confident” the classification Margin Distance of example to the boundary is its geometric margin
Geometric Margin • Let yi = -1 or 1 • Boundary qTx + b = 0, ||q||=1 • Margin for example i = y(i) (qTx(i) + b) Margin Distance of example to the boundary is its geometric margin
Optimizing the Margin • Maximum margin classifier given by argmaxq,b mini y(i) (qTx(i) + b)
Optimizing the Margin • Maximum margin classifier given by argmaxq,b mini y(i) (qTx(i) + b) • Also written as constrained optimization argmaxq,bl such that y(i) (qTx(i) + b) l (for all i) ||q|| = 1
Optimizing the Margin • Maximum margin classifier given by argmaxq,b mini y(i) (qTx(i) + b) • Also written as constrained optimization argmaxq,bl such that y(i) (qTx(i) + b) l (for all i) |q| = 1 Nasty (nonconvex) constraint
Optimizing the Margin, part 2 • Better formulation argminw,b ||w||2 such that y(i) (wTx(i) + b) 1 (for all i) • Can be solved with “off the shelf” numerical packages (e.g., Matlab)
Deconstructing the formulation • Better formulation argminw,b ||w||2 such that y(i) (wTx(i) + b) 1 (for all i) • Letting q = w/||w|| • y(i) (qTx(i) + b/||w||) 1 / ||w|| • So minimizing ||w||2 will maximize the margin
Insights • The optimal classification boundary is defined by just a few (d+1) points: support vectors Margin
The Magic of Duality • The Lagrangian dual problem of argminw,b ||w||2 such that y(i) (wTx(i) + b) 1 (for all i) • Is: maxa Si ai – ½ Si,j ai aj y(i)y(j) (x(i)T x(j)) such thatai 0 and Si ai y(i) = 0 • With classification boundary w = Si ai y(i) x(i)
The Kernel Trick • Only a few of the ai’s are nonzero, so wTx = Si ai y(i) x(i)Tx can be evaluated quickly • We’ve rewritten everything in terms of(x(i)T x(j))… so what?
The Kernel Trick • Only a few of the ai’s are nonzero, so wTx = Si ai y(i) x(i)Tx can be evaluated quickly • We’ve rewritten everything in terms of(x(i)T x(j))… so what? • Replace inner product (aT b) with a kernel function K(a,b)
Kernel Functions • A kernelcan implicitly compute a feature mapping to a high dimensional space, without having to construct the features!
Kernel Functions • A kernelcan implicitly compute a feature mapping to a high dimensional space, without having to construct the features! • Example: K(a,b) = (aTb)2 • (a1b1 + a2b2)2= a12b12 + 2a1b1a2b2 + a22b22= [a12, a22 , 2a1a2]T[b12, b22 , 2b1b2] • An implicit mapping to feature space of dimension 3 (for n attributes, dimension n(n+1)/2)
Types of Kernel • Polynomial K(a,b) = (aTb+1)d • Feature space is exponential in d • Gaussian K(a,b) = exp(-||a-b||2/s2) • Feature space is infinite dimensional • Sigmoid, etc… • Any symmetric positivedefinite function
Nonseparable Data • Linear classifiers often cannot achieve perfect accuracy
Regularization • Assists trading off between accuracy and generality • Tolerate some errors, cost of error determined by some parameter C
Formulating Regularization and Nonseparable data argminw,b,e ||w||2 + CSi ei such that y(i) (wTx(i) + b) 1-ei(for all i) ei 0 ei’s are error terms measuring how much the classification is violated C is the regularization weight that penalizes larger errors
Comments • SVMs often have very good performance • E.g., digit classification, face recognition, etc • Still need parametertweaking • Kernel type • Kernel parameters • Regularization weight • Fast optimization • Off-the-shelf libraries • SVMlight
Motivation • Predicting risk of terrorist attack • Predicting “no attack” when one will occur, is far worse than predicting “attack” when none will occur • Searching for images • Returning irrelevant images is worse than omitting relevant ones
Classification Thresholds • Many learning algorithms give real-valued output v(x) that needs thresholding for classification v(x) > t => positive label given to x v(x) < t => negative label given to x • May want to tune threshold to get fewer false positives or false negatives • Use weights to favor accuracy on positive or negative examples
False Positives True concept Learned concept x2 x1
False Positives True concept Learned concept x2 New query x1
False Negatives True concept Learned concept x2 New query x1
Reducing False Positive Rate True concept Learned concept x2 x1
Reducing False Negative rate True concept Learned concept x2 x1
Precision vs. Recall • Precision • # of relevant documents retrieved / # of total documents retrieved • Recall • # of relevant documents retrieved / # of total relevant documents • Numbers between 0 and 1
Precision vs. Recall • Precision • # of true positives / (# true positives + # false positives) • Recall • # of true positives / (# true positives + # false negatives) • A precise classifier is selective • A classifier with high recall is not
Precision-Recall curves Measure Precision vs Recall as tolerance (or weighting) is tuned Perfect classifier Recall Actual performance Precision
Precision-Recall curves Measure Precision vs Recall as tolerance (or weighting) is tuned Recall Penalize false negatives Equal weight Penalize false positives Precision
Precision-Recall curves Measure Precision vs Recall as tolerance (or weighting) is tuned Recall Precision
Comments • Combined scores (e.g., F scores, correlation coefficients) are better measures of performance on biased datasets than accuracy • Related to receiver-operating-characteristic (ROC) curves
Administrative • Graded HW5