1 / 29

CS 480/680 : Intro to ML

Dive into the world of the perceptron algorithm with a focus on supervised learning, linear threshold functions, convergence theorems, margin analysis, and more. Discover how to improve model performance and handle non-separable data.

sweets
Download Presentation

CS 480/680 : Intro to ML

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS480/680: IntrotoML Lecture 01: Perceptron Yao-Liang Yu

  2. Outline • Announcements • Perceptron • Extensions • Next Yao-Liang Yu

  3. Outline • Announcements • Perceptron • Extensions • Next Yao-Liang Yu

  4. Announcements • Assignment 1 unleashed tonight • Projects • NIPS: https://papers.nips.cc/ • ICML: http://proceedings.mlr.press/v80/ • COLT: http://proceedings.mlr.press/v75/ • AISTATS: http://proceedings.mlr.press/v84/ • ICLR: https://iclr.cc/Conferences/2018/Schedule?type=Poster • JMLR: http://www.jmlr.org/papers/v18/ • Be ambitious and work hard Yao-Liang Yu

  5. Outline • Announcements • Perceptron • Extensions • Next Yao-Liang Yu

  6. Supervised learning Yao-Liang Yu

  7. Spam filtering example • Training set (X = [x1, x2, … , xn], y=[y1, y2, … , yn]) • xi in X = Rd: instance i with d dimensional features • yiin Y = {-1, 1}: instance i is spam or ham? • Good feature representation is of uttermost importance Yao-Liang Yu

  8. Thought Experiment • Repeat the following game • Observe instance xi • Predict its label (in whatever way you like) • Reveal the true label • Suffer a mistake if • How many mistakes will you make? • Predict first, reveal next: no peek into the future! Yao-Liang Yu

  9. Batch vs. Online • Batch learning • Interested in performance on test set X’ • Training set (X, y) is just a means • Statistical assumption on X and X’ • Online learning • Data comes one by one (streaming) • Need to predict y before knowing its true value • Interested in making as few mistakes as possible • Compare against some baseline • Online to Batch conversion Yao-Liang Yu

  10. Linear threshold function • Find (w, b) such that foralli: yi = sign(<w, xi> + b) • w in Rd: weight vector for the separating hyperplane • b in R: offset (threshold, bias) of the separating hyperplane • sign: thresholding function Yao-Liang Yu

  11. Perceptron [Rosenblatt’58] w1 w2 w3 w4 x1 x2 x3 x4  >=0? Nonlinear activation function Linear superposition Yao-Liang Yu

  12. History 1969 Yao-Liang Yu

  13. The perceptron algorithm • Typically δ= 0,w0=0,b0=0 • y(<x, w>+b) > 0 implies y = sign(<x, w>+b) • y(<x, w>+b) < 0 implies y sign(<x, w>+b) • y(<x, w>+b) = 0implies… • Lazyupdate:ifitain’tbroke,don’tfixit truth prediction Yao-Liang Yu

  14. Does it work? • w0=[0,0,0,0,0],b0=0,pred-1onx1,wrong • w1=[1, 1,0,1,1],b1=1,pred1onx2,wrong • w2=[1, 1,-1,0,1],b2=0,pred-1onx3,wrong • w3=[1, 2,0,0,1],b3=1,pred1onx4,wrong • w4=[0, 2,0,-1,1],b4=0,pred1onx5,correct • w4=[0, 2,0,-1,1],b4=0,pred-1onx6,correct Yao-Liang Yu

  15. Simplification:LinearFeasibility • Padding constant 1 to the end of each x denoteasz • Pre-multiply x with its label y denoteasa • Find z such that ATz > 0,whereA=[a1,a2, …,an] Yao-Liang Yu

  16. Perceptron Convergence Theorem Theorem(Block’62;Novikoff’62).AssumethereexistssomezsuchthatATz>0,thentheperceptronalgorithmconvergestosomez*.IfeachcolumnofAisselectedindefinitelyoften,thenATz* > δ. Corollary.Letδ = 0 and z0 = 0. Then perceptron converges after at most (R/γ)2steps, where Yao-Liang Yu

  17. The Margin • zs.t.ATz>0iffforsomehencealls>0zs.t.ATz>s1 • From the proof, perceptron convergence depends on: • Thelargerthemarginis,themore(linearly)separablethedatais,hencethefasterperceptronlearns. Yao-Liang Yu

  18. What does the bound mean? Corollary.Letδ = 0 and z0 = 0. Then perceptron converges after at most (R/γ)2steps, where • Treating R and γ as constants, then • Otherwise may need exponential time… • Can we improve the bound? # of mistakes independent of n and d! Yao-Liang Yu

  19. But • The larger the margin, the faster perceptron converges • But perceptron stops at an arbitrary linear separator… • Which one do you prefer? Support vector machines Yao-Liang Yu

  20. What if non-separable? • Find a better feature representation • Use a deeper model • Soft margin: can replace the following in the perceptron proof Yao-Liang Yu

  21. Perceptron BoundednessTheorem • Perceptron convergence requires the existence of a separating hyperplane • how to check this assumption in practice? • What if it fails? (trust me, it will) Theorem(Minsky and Papert’67; Block and Levin’70). The iteratez=(w;b) of theperceptronalgorithmis always bounded. In particular, if there is no separating hyperplane, then perceptron cycles. “…proof of this theorem is complicated and obscure. So are the other proofs we have since seen...” --- Minsky and Papert, 1987 Yao-Liang Yu

  22. When to stop perceptron? • Online learning: never • Batch learning • Maximum number of iteration reached or run out of time • Training error stops changing • Validation error stops decreasing • Weights stopped changing much, if using a diminishing step size ηt, i.e., wt+1 wt + ηtyixi Yao-Liang Yu

  23. Outline • Announcements • Perceptron • Extensions • Next Yao-Liang Yu

  24. Multiclass Perceptron • One vs. all • Class c as positive • All other classes as negative • Highest activation wins: pred = argmaxcwcTx • One vs. one • Class c as positive • Class c’ as negative • Voting Yao-Liang Yu

  25. Winnow (Littlestone’88) • Multiplicative vs. additive Theorem(Littlestone’88).Assumethereexistssome nonnegativezsuchthatATz>0,thenwinnowconvergestosomez*.IfeachcolumnofAisselectedindefinitelyoften,thenATz* > δ. Yao-Liang Yu

  26. Outline • Announcements • Perceptron • Extensions • Next Yao-Liang Yu

  27. Linear regression • Perceptron is a linear rule for classification • y in a finite set, e.g., {+1, -1} • What if y can be any real number? • Known as regression • Again, there is a linear rule • Regularization to avoid overfitting • Cross-validation to select hyperparameter regularization empirical risk hyperparameter Yao-Liang Yu

  28. Questions? Yao-Liang Yu

  29. Another example: pricing • Selling a product Z to user x with price y = f(x, w) • If y is too high, user won’t buy, update w • If y is acceptable, user buy [update w] • How to measure our performance? Yao-Liang Yu

More Related