1 / 22

Announcements

Announcements. See Chapter 5 of Duda, Hart, and Stork. Tutorial by Burge linked to on web page. “Learning quickly when irrelevant attributes abound,” by Littlestone, Machine Learning 2:285-318, 1988. Supervised Learning. Classification with labeled examples.

Download Presentation

Announcements

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Announcements • See Chapter 5 of Duda, Hart, and Stork. • Tutorial by Burge linked to on web page. • “Learning quickly when irrelevant attributes abound,” by Littlestone, Machine Learning 2:285-318, 1988.

  2. Supervised Learning • Classification with labeled examples. • Images vectors in high-D space.

  3. Supervised Learning • Labeled examples called training set. • Query examples called test set. • Training and test set must come from same distribution.

  4. Linear Discrimants • Images represented as vectors, x1, x2, …. • Use these to find hyperplane defined by vector w and w0. xis on hyperplane: wTx + w0 = 0. • Notation: aT= [w0, w1, …]. yT = [1,x1,x2, …] So hyperplane is aTy=0. • A query, q, is classified based on whether aTq > 0 or aTq < 0.

  5. Why linear discriminants? • Optimal if classes are Gaussian with same covariances. • Linear separators easier to find. • Hyperplanes have few parameters, prevents overfitting. • Have lower VC dimension, but we don’t have time for this.

  6. Linearly Separable Classes

  7. Linearly Separable Classes • For one set of classes, aTy > 0. For other set: aTy < 0. • Notationally convenient if, for second class, make y negative. • Then, finding a linear separator means finding a such that, for all i, aTy > 0. • Note, this is a linear program. • Problem convex, so descentalgorithms converge to global optimum.

  8. Perceptron Algorithm Perceptron Error Function Y is set of misclassified vectors. So update a by: Simplify by cycling through y and whenever one is misclassified, update a  a + cy. This converges after finite # of steps.

  9. a(k+1) a(k) Perceptron Intuition

  10. Support Vector Machines • Extremely popular. • Find linear separator with maximum margin. • Some guarantees this generalizes well. • Can work in high-dimensional space without overfitting. • Nonlinear map to high-dim. space, then find linear separator. • Special tricks allow efficiency in ridiculously high dimensional spaces. • Can handle non-separable classes also. • Not as important if space very high-dimensional.

  11. Maximum Margin Maximize the minimum distance from hyperplane to points. Points at this minimum distance are support vectors.

  12. Geometric Intuitions • Maximum margin between points -> Maximum margin between convex sets

  13. This implies max margin hyperplane is orthogonal to vector connecting nearest points of convex sets, and halfway between.

  14. Non-statistical learning • There are a class of functions that could label the data. • Our goal is to select the correct function, with as little information as possible. • Don’t think of data coming from a class described by probability distributions. • Look at worst-case performance. • This is CS’ey approach. • In statistical model, worst case not meaningful.

  15. On-Line Learning • Let X be a set of objects (eg., vectors in a high-dimensional space). • Let C be a class of possible classifying functions (eg., hyperplanes). • f in C: X-> {0,1} • One of these correctly classifies all data. • The learner is asked to classify an item in X, then told the correct class. • Eventually, learner determines correct f. • Measure number of mistakes made. • Worst case bound for learning strategy.

  16. VC Dimension • S, a subset of X, is shattered by C if, for any U, a subset of S, there exists f in C such that f is 1 on U and 0 on S-U. • The VC Dimension of C is the size of the largest set shattered by C.

  17. VC Dimension and worst case learning • Any learning strategy makes VCdim(C) mistakes in the worst case. • If S is shattered by C • Then for any assignment of values to S, there is an f in C that makes this assignment. • So any set of choices the learner makes for S can be entirely wrong.

  18. Winnow • x’selements have binary values. • Find weights. Classify x by whether wTx > 1 or wTx < 1. • Algorithm requires that there exist weights u, such that: • uTx > 1 when f(x) = 1 • uTx < 1 – d when f(x) = 0. • That is, there is a margin of d.

  19. Winnow Algorithm • Initialize w = (1,1, …1). • Let a = 1+d/2. • Decision: wTx > q • If learner correct, weights don’t change. If wrong:

  20. Some intuitions • Note that this is just like Perceptron, but with multiplicative change, not additive. • Moves weights in right direction; • if wTx was too big, shrink weights that effect inner product. • If wTx too small, make weights bigger. • Weights change more rapidly (exponential with mistakes, not linear).

  21. Theorem Number of errors bounded by: Set q = n Note: if xiis an irrelevant feature, ui = 0. Errors grow logarithmically with irrelevant features. Empirically, Perceptron errors grow linearly. This is optimal for k-monotone disjunctions: f(x1, … xn) = xi1V xi2V … V xik

  22. Winnow Summary • Simple algorithm; variation on Perceptron. • Great with irrelevant attributes. • Optimal for monotone disjunctions.

More Related