1 / 48

Support Vector Machines R&N: Chap. 20, Sec. 20.6

Support Vector Machines R&N: Chap. 20, Sec. 20.6. Machine Learning Classes. CS659 (Hauser) Principles of Intelligent Robot Motion CS657 (Yu) Computer Vision STAT520 (Trosset) Introduction to Statistics STAT682 (Rocha) Statistical Model Selection. Schedule. 11/24

elvin
Download Presentation

Support Vector Machines R&N: Chap. 20, Sec. 20.6

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Support Vector MachinesR&N:Chap. 20, Sec. 20.6

  2. Machine Learning Classes • CS659 (Hauser) Principles of Intelligent Robot Motion • CS657 (Yu) Computer Vision • STAT520 (Trosset) Introduction to Statistics • STAT682 (Rocha) Statistical Model Selection

  3. Schedule • 11/24 • No class 11/26 (Thanksgiving) • 12/1 – Final project presentations • 12/3 – Last class, review and wrap-up

  4. Thanksgiving week • Vote: • Class as usual reinforcement learning unsupervised learning • Review class • No class

  5. Average Grades HW1-5 • HW1: 83.2 • HW2: 85.0 • HW3: 75.6 • HW4: 81.2 • HW5: 82.4

  6. Final Projects • Mid-term report • 1-2 paragraphs on status of project • Checkpoint for choosing HW instead of project • In-class presentation • High-level overview of: • Project topic and research questions • Demos / findings / major ideas • Relationship with material learned in class • Aim for 5 minutes (~5 slides) • Final report due any time before final

  7. Neural Networks in the News • http://spectrum.ieee.org/computing/hardware/ibm-unveils-a-new-brain-simulator • 1 billion neurons • 10 trillion synapses • Speed: 100 times slower than a brain • Power: 1.4 MW ~= 500 homes

  8. Agenda • Support Vector Machines • Kernels • Regularization • False negatives/positives, precision/recall curves

  9. Motivation: Feature Mappings • Given attributes x, learn in the space of features f(x) • E.g., parity, FACE(card), RED(card) • Hope CONCEPT is easier in feature space

  10. Example x2 x1

  11. Example • Choose f1=x12, f2=x22, f3=2 x1x2 x2 f3 f2 f1 x1

  12. SVM Intuition • Find “best” linear classifier in feature space • Hope to avoid overfitting

  13. Maximum Margin Classification • Find boundary that maximizes margin between positive and negative examples Margin

  14. Margin • The farther away from the boundary we are, the more “confident” the classification Margin Not as confident Very confident

  15. Geometric Margin • The farther away from the boundary we are, the more “confident” the classification Margin Distance of example to the boundary is its geometric margin

  16. Geometric Margin • Let yi = -1 or 1 • Boundary qTx + b = 0, ||q||=1 • Margin for example i = y(i) (qTx(i) + b) Margin Distance of example to the boundary is its geometric margin

  17. Optimizing the Margin • Maximum margin classifier given by argmaxq,b mini y(i) (qTx(i) + b)

  18. Optimizing the Margin • Maximum margin classifier given by argmaxq,b mini y(i) (qTx(i) + b) • Also written as constrained optimization argmaxq,bl such that y(i) (qTx(i) + b)  l (for all i) ||q|| = 1

  19. Optimizing the Margin • Maximum margin classifier given by argmaxq,b mini y(i) (qTx(i) + b) • Also written as constrained optimization argmaxq,bl such that y(i) (qTx(i) + b)  l (for all i) |q| = 1 Nasty (nonconvex) constraint

  20. Optimizing the Margin, part 2 • Better formulation argminw,b ||w||2 such that y(i) (wTx(i) + b)  1 (for all i) • Can be solved with “off the shelf” numerical packages (e.g., Matlab)

  21. Deconstructing the formulation • Better formulation argminw,b ||w||2 such that y(i) (wTx(i) + b)  1 (for all i) • Letting q = w/||w|| • y(i) (qTx(i) + b/||w||)  1 / ||w|| • So minimizing ||w||2 will maximize the margin

  22. Insights • The optimal classification boundary is defined by just a few (d+1) points: support vectors Margin

  23. The Magic of Duality • The Lagrangian dual problem of argminw,b ||w||2 such that y(i) (wTx(i) + b)  1 (for all i) • Is: maxa Si ai – ½ Si,j ai aj y(i)y(j) (x(i)T x(j)) such thatai 0 and Si ai y(i) = 0 • With classification boundary w = Si ai y(i) x(i)

  24. The Kernel Trick • Only a few of the ai’s are nonzero, so wTx = Si ai y(i) x(i)Tx can be evaluated quickly • We’ve rewritten everything in terms of(x(i)T x(j))… so what?

  25. The Kernel Trick • Only a few of the ai’s are nonzero, so wTx = Si ai y(i) x(i)Tx can be evaluated quickly • We’ve rewritten everything in terms of(x(i)T x(j))… so what? • Replace inner product (aT b) with a kernel function K(a,b)

  26. Kernel Functions • A kernelcan implicitly compute a feature mapping to a high dimensional space, without having to construct the features!

  27. Kernel Functions • A kernelcan implicitly compute a feature mapping to a high dimensional space, without having to construct the features! • Example: K(a,b) = (aTb)2 • (a1b1 + a2b2)2= a12b12 + 2a1b1a2b2 + a22b22= [a12, a22 , 2a1a2]T[b12, b22 , 2b1b2] • An implicit mapping to feature space of dimension 3 (for n attributes, dimension n(n+1)/2)

  28. Types of Kernel • Polynomial K(a,b) = (aTb+1)d • Feature space is exponential in d • Gaussian K(a,b) = exp(-||a-b||2/s2) • Feature space is infinite dimensional • Sigmoid, etc… • Any symmetric positivedefinite function

  29. Overfitting / underfitting

  30. Nonseparable Data • Linear classifiers often cannot achieve perfect accuracy

  31. Regularization • Assists trading off between accuracy and generality • Tolerate some errors, cost of error determined by some parameter C

  32. Formulating Regularization and Nonseparable data argminw,b,e ||w||2 + CSi ei such that y(i) (wTx(i) + b)  1-ei(for all i) ei  0 ei’s are error terms measuring how much the classification is violated C is the regularization weight that penalizes larger errors

  33. Comments • SVMs often have very good performance • E.g., digit classification, face recognition, etc • Still need parametertweaking • Kernel type • Kernel parameters • Regularization weight • Fast optimization • Off-the-shelf libraries • SVMlight

  34. Precision and Recall

  35. Motivation • Predicting risk of terrorist attack • Predicting “no attack” when one will occur, is far worse than predicting “attack” when none will occur • Searching for images • Returning irrelevant images is worse than omitting relevant ones

  36. Classification Thresholds • Many learning algorithms give real-valued output v(x) that needs thresholding for classification v(x) > t => positive label given to x v(x) < t => negative label given to x • May want to tune threshold to get fewer false positives or false negatives • Use weights to favor accuracy on positive or negative examples

  37. False Positives True concept Learned concept x2 x1

  38. False Positives True concept Learned concept x2 New query x1

  39. False Negatives True concept Learned concept x2 New query x1

  40. Reducing False Positive Rate True concept Learned concept x2 x1

  41. Reducing False Negative rate True concept Learned concept x2 x1

  42. Precision vs. Recall • Precision • # of relevant documents retrieved / # of total documents retrieved • Recall • # of relevant documents retrieved / # of total relevant documents • Numbers between 0 and 1

  43. Precision vs. Recall • Precision • # of true positives / (# true positives + # false positives) • Recall • # of true positives / (# true positives + # false negatives) • A precise classifier is selective • A classifier with high recall is not

  44. Precision-Recall curves Measure Precision vs Recall as tolerance (or weighting) is tuned Perfect classifier Recall Actual performance Precision

  45. Precision-Recall curves Measure Precision vs Recall as tolerance (or weighting) is tuned Recall Penalize false negatives Equal weight Penalize false positives Precision

  46. Precision-Recall curves Measure Precision vs Recall as tolerance (or weighting) is tuned Recall Precision

  47. Comments • Combined scores (e.g., F scores, correlation coefficients) are better measures of performance on biased datasets than accuracy • Related to receiver-operating-characteristic (ROC) curves

  48. Administrative • Graded HW5

More Related