1 / 114

Second Order Learning

Second Order Learning. Koby Crammer Department of Electrical Engineering. ECML PKDD 2013 Prague. Thanks. Mark Dredze Alex Kulesza Avihai Mejer Edward Moroshko Francesco Orabona Fernando Pereira Yoram Singer Nina Vaitz. Tutorial Context. Online

cole
Download Presentation

Second Order Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

  2. Thanks • Mark Dredze • Alex Kulesza • AvihaiMejer • Edward Moroshko • Francesco Orabona • Fernando Pereira • Yoram Singer • Nina Vaitz

  3. Tutorial Context Online Learning SVMs Tutorial Optimization Theory Real-World Data

  4. Outline • Background: • Online learning + notation • Perceptron • Stochastic-gradient descent • Passive-aggressive • Second-Order Algorithms • Second order Perceptron • Confidence-Weighted and AROW • AdaGrad • Properties • Kernels • Analysis • Empirical Evaluation • Synthetic • Real Data

  5. Online Learning Tyrannosaurus rex

  6. Online Learning Triceratops

  7. Online Learning Velocireptor Tyrannosaurus rex

  8. Formal Setting – Binary Classification • Instances • Images, Sentences • Labels • Parse tree, Names • Prediction rule • Linear predictions rules • Loss • No. of mistakes

  9. Predictions • Discrete Predictions: • Hard to optimize • Continuous predictions : • Label • Confidence

  10. Loss Functions • Natural Loss: • Zero-One loss: • Real-valued-predictions loss: • Hinge loss: • Exponential loss (Boosting) • Log loss (Max Entropy, Boosting)

  11. Loss Functions Hinge Loss Zero-One Loss 1 1

  12. Online Learning Maintain Model M Get Instance x Update Model Predict Label y=M(x) M Suffer Loss l(y,y) Get True Label y

  13. Notation Abuse Linear Classifiers • Any Features • W.l.o.g. • Binary Classifiers of the form

  14. Linear Classifiers (cntd.) • Prediction : • Confidence in prediction:

  15. Linear Classifiers Input Instance to be classified Weight vector of classifier

  16. Margin • Margin of an example with respect to the classifier : • Note : • The set is separable iff there exists such that

  17. Geometrical Interpretation

  18. Geometrical Interpretation

  19. Geometrical Interpretation

  20. Geometrical Interpretation Margin <<0 Margin >0 Margin <0 Margin >>0

  21. Hinge Loss

  22. Why Online Learning? • Fast • Memory efficient - process one example at a time • Simple to implement • Formal guarantees – Mistake bounds • Online to Batch conversions • No statistical assumptions • Adaptive • Not as good as a well designed batch algorithms

  23. Outline • Background: • Online learning + notation • Perceptron • Stochastic-gradient descent • Passive-aggressive • Second-Order Algorithms • Second order Perceptron • Confidence-Weighted and AROW • AdaGrad • Properties • Kernels • Analysis • Empirical Evaluation • Synthetic • Real Data

  24. Rosenblat 1958 The Perceptron Algorithm • If No-Mistake • Do nothing • If Mistake • Update • Margin after update :

  25. Geometrical Interpretation

  26. Outline • Background: • Online learning + notation • Perceptron • Stochastic-gradient descent • Passive-aggressive • Second-Order Algorithms • Second order Perceptron • Confidence-Weighted and AROW • AdaGrad • Properties • Kernels • Analysis • Empirical Evaluation • Synthetic • Real Data

  27. Gradient Descent • Consider the batch problem • Simple algorithm: • Initialize • Iterate, for • Compute • Set

  28. Stochastic Gradient Descent • Consider the batch problem • Simple algorithm: • Initialize • Iterate, for • Pick a random index • Compute • Set

  29. Stochastic Gradient Descent • “Hinge” loss • The gradient • Simple algorithm: • Initialize • Iterate, for • Pick a random index • If then else • Set The preceptron is a stochastic gradient descent algorithm with a sum of “hinge”-loss and a specific order of examples

  30. Outline • Background: • Online learning + notation • Perceptron • Stochastic-gradient descent • Passive-aggressive • Second-Order Algorithms • Second order Perceptron • Confidence-Weighted and AROW • AdaGrad • Properties • Kernels • Analysis • Empirical Evaluation • Synthetic • Real Data

  31. Motivation • Perceptron: No guaranties of margin after the update • PA :Enforce a minimal non-zero margin after the update • In particular : • If the margin is large enough (1), then do nothing • If the margin is less then unit, update such that the margin after the update is enforced to be unit

  32. Input Space

  33. Input Space : Points are input data One constraint is induced by weight vector Primal space Half space = all input examples that are classified correctly by a given predictor (weight vector) Version Space : Points are weight vectors One constraints is induced by input data Dual space Half space = all predictors (weight vectors) that classify correctly a given input example Input Space vs. Version Space

  34. Weight Vector (Version) Space The algorithm forces to reside in this region

  35. Passive Step Nothing to do. already resides on the desired side.

  36. Aggressive Step The algorithm projects on the desired half-space

  37. Aggressive Update Step • Set to be the solution of the following optimization problem : • Solution:

  38. Perceptron vs. PA • Common Update : • Perceptron • Passive-Aggressive

  39. Perceptron vs. PA Error No-Error, Large Margin No-Error, Small Margin Margin

  40. Perceptron vs. PA

  41. Outline • Background: • Online learning + notation • Perceptron • Stochastic-gradient descent • Passive-aggressive • Second-Order Algorithms • Second order Perceptron • Confidence-Weighted and AROW • AdaGrad • Properties • Kernels • Analysis • Empirical Evaluation • Synthetic • Real Data

  42. Geometrical Assumption • All examples are bounded in a ball of radius R

  43. Separablity • There exists a unit vector that classifies the data correctly

  44. Perceptron’s Mistake Bound • Simple case: positive points negative points • Separating hyperplane • Bound is : • The number of mistakes the algorithm makes is bounded by

  45. Geometrical Motivation

  46. SGD on such data

  47. Outline • Background: • Online learning + notation • Perceptron • Stochastic-gradient descent • Passive-aggressive • Second-Order Algorithms • Second order Perceptron • Confidence-Weighted and AROW • AdaGrad • Properties • Kernels • Analysis • Empirical Evaluation • Synthetic • Real Data

  48. Nicolò Cesa-Bianchi , Alex Conconi , Claudio Gentile, 2005 Second Order Perceptron • Assume all inputs are given • Compute “whitening” matrix • Run the Perceptron on “wightened” data • New “whitening” matrix

More Related