1 / 57

CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification: Linear Learning Algorithms 2009

CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification: Linear Learning Algorithms 2009. {. 1 if w 1 x 1 + w 2 x 2 +. . . w n x n >=  0 Otherwise . f (x) =. y = x 1  x 3  x 5. Disjunctions: .

gustave
Download Presentation

CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification: Linear Learning Algorithms 2009

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS546: Machine Learning and Natural LanguageLecture 7: Introduction to Classification:Linear Learning Algorithms2009

  2. { 1 if w1 x1 + w2 x2 +. . . wn xn >=  0 Otherwise f (x) = y = x1 x3 x5 • Disjunctions: y = ( 1•x1 + 1•x3 + 1•x5 >= 1) y = at least 2 of {x1 ,x3 , x5} • At least m of n: y = ( 1•x1 + 1•x3 + 1•x5 >=2) Linear Functions y = (x1 x2) v (x1 x2) • Exclusive-OR: y = (x1 x2) v (x3 x4) • Non-trivial DNF:

  3. Linear Functions w ¢ x =  - - - - - - - - - - - - - - - w ¢ x = 0

  4. 1 2 T 7 3 4 5 6 Perceptron learning rule • On-line, mistake driven algorithm. • Rosenblatt (1959) suggested that when a target output value is provided for a single neuron with fixed input, it can incrementally change weights and learn to produce the output using the Perceptron learning rule Perceptron == Linear Threshold Unit

  5. Given Labeled examples: • Initialize w=0 • 2. Cycle through all examples • a. Predict the label of instance x to bey’ = sgn{wx) • b. If y’y, update the weight vector: • w = w + r y x(r - a constant, learning rate) • Otherwise, if y’=y, leave weights unchanged. Perceptron learning rule • We learn f:X{-1,+1} represented as f = sgn{wx) Where X= or X= w

  6. Footnote About the Threshold • On previous slide, Perceptron has no threshold • But we don’t lose generality:

  7. Geometric View

  8. Initialize w=0 • 2. Cycle through all examples • a. Predict the label of instance x to bey’ = sgn{wx) • b. If y’y, update the weight vector to • w = w + r y x (r - a constant, learning rate) • Otherwise, if y’=y, leave weights unchanged. Perceptron learning rule • If x is Boolean, only weights of active features are updated.

  9. Perceptron Learnability • Obviously can’t learn what it can’t represent • Only linearly separable functions • Minsky and Papert (1969)wrote an influential book demonstrating Perceptron’s representational limitations • Parity functions can’t be learned (XOR) • In vision, if patterns are represented with local features, can’t represent symmetry, connectivity • Research on Neural Networks stopped for years • Rosenblatt himself (1959) asked, • “What pattern recognition problems can be transformed so as to become linearly separable?”

  10. (x1 x2) v (x3 x4) y1 y2

  11. Perceptron Convergence • Perceptron Convergence Theorem: • If there exist a set of weights that are consistent with the • (I.e., the data is linearly separable) the perceptron learning • algorithm will converge • -- How long would it take to converge ? • Perceptron Cycling Theorem: If the training data is not linearly • the perceptron learning algorithm will eventually repeat the • same set of weights and therefore enter an infinite loop. • -- How to provide robustness, more expressivity ?

  12. Perceptron: Mistake Bound Theorem • Maintains a weight vector wRN, w0=(0,…,0). • Upon receiving an example x  RN • Predicts according to the linear threshold function w•x 0. Theorem [Novikoff,1963] Let (x1; y1),…,: (xt; yt), be a sequence of labeled examples with xiRN, xiRand yi{-1,1} for all i. Let uRN, > 0 be such that, ||u|| = 1 and yi u • xifor all i. Then Perceptron makes at most R2 /  2mistakes on this example sequence. (see additional notes) Margin Complexity Parameter

  13. Perceptron-Mistake Bound Proof: Let vkbe the hypothesis before the k-th mistake. Assume that the k-th mistake occurs on the input example (xi, yi). Assumptions v1 = 0 ||u|| ≤ 1 yi u • xi Multiply by u By definition of u By induction Projection K < R2 /  2

  14. Perceptron for Boolean Functions • How many mistakes will the Perceptron algorithms make • when learning a k-disjunction? • It can make O(n) mistakes on k-disjunction on n attributes. • Our bound:R2 /  2 • w : 1 / k 1/2 – for k components, 0 for others, •  : difference only in one variable : 1 / k ½ • R: n 1/2 • Thus, we get : n k • Is it possible to do better? • This is important if n—the number of features is very large

  15. Winnow Algorithm • The Winnow Algorithm learns Linear Threshold Functions. • For the class of disjunction, • instead of demotion we can use elimination.

  16. Winnow - Example

  17. Winnow - Example • Notice that the same algorithm will learn a conjunction over • these variables (w=(256,256,0,…32,…256,256) )

  18. Winnow - Mistake Bound Claim: Winnow makes O(k log n) mistakes on k-disjunctions u - # of mistakes on positive examples (promotions) v - # of mistakes on negative examples (demotions)

  19. Winnow - Mistake Bound Claim: Winnow makes O(k log n) mistakes on k-disjunctions u - # of mistakes on positive examples (promotions) v - # of mistakes on negative examples (demotions) 1. u < k log(2n)

  20. Winnow - Mistake Bound Claim: Winnow makes O(k log n) mistakes on k-disjunctions u - # of mistakes on positive examples (promotions) v - # of mistakes on negative examples (demotions) 1. u < k log(2n) A weight that corresponds to a good variable is only promoted. When these weights get to n there will no more mistakes on positives

  21. Winnow - Mistake Bound u - # of mistakes on positive examples (promotions) v - # of mistakes on negative examples (demotions) 2. v < 2(u + 1)

  22. Winnow - Mistake Bound u - # of mistakes on positive examples (promotions) v - # of mistakes on negative examples (demotions) 2. v < 2(u + 1) Total weight: TW=n initially

  23. Winnow - Mistake Bound u - # of mistakes on positive examples (promotions) v - # of mistakes on negative examples (demotions) 2. v < 2(u + 1) Total weight: TW=n initially Mistake on positive: TW(t+1) < TW(t) + n

  24. Winnow - Mistake Bound u - # of mistakes on positive examples (promotions) v - # of mistakes on negative examples (demotions) 2. v < 2(u + 1) Total weight TW=n initially Mistake on positive: TW(t+1) < TW(t) + n Mistake on negative: TW(t+1) < TW(t) - n/2

  25. Winnow - Mistake Bound u - # of mistakes on positive examples (promotions) v - # of mistakes on negative examples (demotions) 2. v < 2(u + 1) Total weight TW=n initially Mistake on positive: TW(t+1) < TW(t) + n Mistake on negative: TW(t+1) < TW(t) - n/2 0 < TW < n + u n - v n/2 v < 2(u+1)

  26. Winnow - Mistake Bound u - # of mistakes on positive examples (promotions) v - # of mistakes on negative examples (demotions) # of mistakes: u+ v < 3u + 2 = O(k log n)

  27. Winnow - Extensions • This algorithm learns monotone functions • in Boolean algebra sense • For the general case: • - Duplicate variables • For the negation of variable x, introduce a new variable y. • Learn monotone functions over 2n variables • - Balanced version: • Keep two weights for each variable; effective weight is the difference

  28. Winnow - A Robust Variation • Winnow is robust in the presence of various kinds of noise. • (classification noise, attribute noise) • Importance: sometimes we learn under some distribution • but test under a slightly different one. • (e.g., natural language applications)

  29. Winnow - A Robust Variation • Modeling: • Adversary’s turn: may change the target concept by adding or • removing some variable from the target disjunction. • Cost of each addition move is 1. • Learner’s turn: makes prediction on the examples given, and is • then told the correct answer (according to current target function) • Winnow-R: Same as Winnow, only doesn’t let weights go below 1/2 • Claim: Winnow-R makes O(c log n) mistakes, (c - cost of adversary) • (generalization of previous claim)

  30. Additive update algorithms: Perceptron SVM (not on-line, but a close relative of Perceptron) • Multiplicative update algorithms: Winnow Close relatives: Boosting; Max Entropy Algorithmic Approaches • Focus: Two families of algorithms (one of the on-line representative) Which Algorithm to choose?

  31. Additive weight update algorithm • (Perceptron, Rosenblatt, 1958. Variations exist) Algorithm Descriptions • Multiplicative weight update algorithm (Winnow, Littlestone, 1988. Variations exist)

  32. How to Compare? • Generalization (since the representation is the same) How many examples are needed to get to a given level of accuracy? • Efficiency How long does it take to learn a hypothesis and evaluate it (per-example)? • Robustness; Adaptation to a new domain, ….

  33. Sentence Representation S= I don’t know whether to laugh or cry - Define a set of features: features are relations that hold in the sentence - Map a sentence to its feature-based representation The feature-based representation will give some of the information in the sentence - Use this as an example to your algorithm

  34. Sentence Representation S= I don’t know whether to laugh or cry - Define a set of features: features are relations that hold in the sentence - Conceptually, there are two steps in coming up with a feature-based representation 1. What are the information sources available? Sensors: words, order of words, properties (?) of words 2. What features to construct based on these? Why needed?

  35. Whether Weather Embedding New discriminator in functionally simpler

  36. Domain Characteristics • The number of potential features is very large • The instance space is sparse • Decisions depend on a small set of features (sparse) • Want to learn from a number of examples that is small relative to the dimensionality

  37. Which Algorithm to Choose? • Generalization • Multiplicative algorithms: • Bounds depend on||u||, the separating hyperplane • M =2ln n ||u||12 maxi||x(i)||12/mini(u ¢ x(i))2 • Advantage with few relevant features in concept • Additive algorithms: • Bounds depend on||x|| (Kivinen / Warmuth, ‘95) • M = ||u||2 maxi||x(i)||2/mini(u ¢ x(i))2 • Advantage with few active features per example The l1 norm: ||x||1 = i|xi| The l2 norm: ||x||2 =(1n|xi|2)1/2 The lp norm: ||x||p = (1n|xi|p )1/p The l1 norm: ||x||1 = maxi|xi|

  38. Generalization • Dominated by the sparseness of the function space • Most features are irrelevant • # of examples required by multiplicative algorithms • depends mostly on # of relevant features • (Generalization bounds depend on ||w||;) • Lesser issue: Sparseness of features space: • advantage to additive. Generalization depend on ||x|| • (Kivinen/Warmuth 95); see additional notes.

  39. Mistakes bounds for 10 of 100 of n Function: At least 10 out of fixed 100 variables are active Dimensionality is n Perceptron,SVMs # of mistakes to convergence Winnow n: Total # of Variables (Dimensionality)

  40. Dual Perceptron • We can replace xi ¢ xj with K(xi ,xj) which can be regarded a dot product in some large (or infinite) space • K(x,y) - often can be computed efficiently without computing mapping to this space

  41. Efficiency • Dominatedby the size of the feature space • Most features are functions (e.g., conjunctions) of raw attributes • Additive algorithms allow the use of Kernels • No need to explicitly generate the complex features • Could be more efficient since work is done in the original feature space. • In practice: explicit Kernels (feature space blow-up) is often more efficient.

  42. Practical Issues and Extensions • There are many extensions that can be made to these basic algorithms. • Some are necessary for them to perform well. • Infinite attribute domain • Regularization

  43. Extensions: Regularization • In general – regularization is used to bias the learner in the direction of a low-expressivity (low VC dimension) separator • Thick Separator (Perceptron or Winnow) • Promote if: • w x > + • Demote if: • w x < - w ¢ x =  - - - - - - - - - - - - - - - w ¢ x = 0

  44. SNoW • A learning architecture that supports several linear update rules (Winnow, Perceptron, naïve Bayes) • Allows regularization; voted Winnow/Perceptron; pruning; many options • True multi-class classification • Variable size examples; very good support for large scale domains in terms of number of examples and number of features. • “Explicit” kernels (blowing up feature space). • Very efficient (1-2 order of magnitude faster than SVMs) • Stand alone, implemented in LBJ [Dowload from: http://L2R.cs.uiuc.edu/~cogcomp ]

  45. COLT approach to explaining Learning • No Distributional Assumption • Training Distribution is the same as the Test Distribution • Generalization bounds depend on this view and affects model selection. ErrD(h) < ErrTR(h) + P(VC(H), log(1/±),1/m) • This is also called the “Structural Risk Minimization” principle.

  46. COLT approach to explaining Learning • No Distributional Assumption • Training Distribution is the same as the Test Distribution • Generalization bounds depend on this view and affect model selection. ErrD(h) < ErrTR(h) + P(VC(H), log(1/±),1/m) • As presented, the VC dimension is a combinatorial parameter that is associated with a class of functions. • We know that the class of linear functions has a lower VC dimension than the class of quadratic functions. • But, this notion can be refined to depend on a given data set, and this way directly affect the hypothesis chosen for this data set.

  47. Data Dependent VC dimension • Consider the class of linear functions, parameterized by their margin. • Although both classifiers separate the data, the distance with which the separation is achieved is different: • Intuitively, we can agree that: Large Margin  Small VC dimension

More Related