1 / 25

Lecture 3: Perceptron

Lecture 3: Perceptron. Recap: Perceptron algorithm. Datapoints (x 1 ,y 1 ), (x 2 , y 2 ), …, x t 2 R d , y t 2 {+1,-1}, are separable by a hyperplane through the origin w = 0 for t = 1,2,… if y t (w ¢ x t ) · 0: w = w + y t x t Claim Suppose ||x t || · R for all t

lisbet
Download Presentation

Lecture 3: Perceptron

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 3: Perceptron

  2. Recap: Perceptron algorithm Datapoints (x1,y1), (x2, y2), …, xt2 Rd, yt2 {+1,-1}, are separable by a hyperplane through the origin w = 0 for t = 1,2,… if yt(w ¢ xt) · 0: w = w + yt xt Claim Suppose • ||xt|| · R for all t • There is some unit vector u 2 Rd and some “margin”  > 0 such that yt (u ¢ xt) ¸ for all t Then Perceptron makes at most (R/)2 mistakes/updates.

  3. Preprocessing step Points (x,y) where x 2 Rd, y 2 {+1,-1} Add an extra feature to x, and set it to 1: x0 = (x,1) 2 Rd+1 Then: points (x,y) linearly separable  points (x0, y) linearly separable by a hyperplane through the origin

  4. Fisher’s IRIS data Four features sepal length sepal width petal length petal width Three classes (species of iris) setosa versicolor virginica 50 instances of each

  5. Features 1 and 2 (sepal width/length)

  6. Features 3 and 4 (petal width/length)

  7. Features 1 and 2; goal: separate setosa from other two 1500 updates (different permutation: 900)

  8. Features 3 and 4; goal: separate setosa from other two Point 51 Points 1,2 Iteration 1 [1,51] Iteration 2 [1,2] Iteration 3 [ ]

  9. Linear separator vs nearest neighbor Linear separators parametric model fixed number of parameters to learn Nearest neighbor nonparametric prediction on test point x depends only on training data near x, not on the rest of the training data Advantages of linear separators: compact fast convergence potentially meaningful

  10. Nonseparable data What if data is not linearly separable? In this case: almost separable… how will the perceptron perform?

  11. Online perceptron Data comes in an endless stream… convergence is not an issue.But how many mistakes does it make? Suppose that for all t ¸ 0: there is some u 2 Rd and some kt¸ 0 such that for all but kt of the first t datapoints (x,y), yt(u ¢ xt) ¸ Then for all t ¸ 0: the perceptron algorithm makes at most (R/)2 + kt(1 + 2R/) updates (ie. mistakes) upto time t.

  12. Batch perceptron Batch algorithm: w = 0 while some (xi,yi) is misclassified: w = w + yi xi Nonseparable data: will never converge. How can this be fixed? Dream: somehow find the separator that misclassifies the fewest points… but this is NP-hard (in fact, even NP-hard to approximately solve).

  13. Fixing the batch perceptron Idea one: only go through the data once, or a fixed number of times w = 0 for k = 1 to K for i = 1 to m if (xi,yi) is misclassified: w = w + yi xi At least this stops! Problem: the final w might not be good Eg. right before terminating, the alg might perform an update on a total outlier…

  14. Voted-perceptron Idea two: keep around intermediate hypotheses, and have them “vote” [Freund and Schapire, 1998] n = 1 w1 = 0 c1 = 0 for k = 1 to K for i = 1 to m if (xi,yi) is misclassified: wn+1 = wn + yi xi cn+1 = 1 n = n + 1 else cn = cn + 1 At the end, a collection of linear separators w0, w1, w2, …, along with survival times: cn = amount of time that wn survived.

  15. Voted-perceptron, cont’d Idea two: keep around intermediate hypotheses, and have them “vote” [Freund and Schapire, 1998] At the end, a collection of linear separators w0, w1, w2, …, along with survival times: cn = amount of time that wn survived. This cn is a good measure of the reliability of wn. To classify a test point x, use a weighted majority vote:

  16. Voted-perceptron, cont’d • Problem: need to keep around a lot of wn vectors • Solutions: • Find “representatives” • Alternative prediction rule: wavg

  17. IRIS: features 3 and 4; goal: separate setosa (circle) from the rest Corrupted setosa Run Voted-Perc for five rounds: cn = 0 1 2 3 1 1 5 117 2 41 2 13 1 3 8 222 2 173 3 95 3 52 Final hypothesis: 1 wrong (either voting or averaging)

  18. IRIS: features 3 and 4; goal: separate + from o/x 100 rounds, 1595 updates (5 errors) Final hypothesis: 5 errors for voting, 6 for averaging

  19. Postscript: multiclass What if there are k classes? Reduce to binary: all-vs-one Not always easy to do: 1 2 2 1 3 4 3

  20. Some open problems Modify the (voted) perceptron algorithm to: [1] Find a linear separator with large margin [2] “Give up” on troublesome points after a while

More Related