1.15k likes | 1.31k Views
Second Order Learning. Koby Crammer Department of Electrical Engineering. ECML PKDD 2013 Prague. Thanks. Mark Dredze Alex Kulesza Avihai Mejer Edward Moroshko Francesco Orabona Fernando Pereira Yoram Singer Nina Vaitz. Tutorial Context. Online
E N D
Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague
Thanks • Mark Dredze • Alex Kulesza • AvihaiMejer • Edward Moroshko • Francesco Orabona • Fernando Pereira • Yoram Singer • Nina Vaitz
Tutorial Context Online Learning SVMs Tutorial Optimization Theory Real-World Data
Outline • Background: • Online learning + notation • Perceptron • Stochastic-gradient descent • Passive-aggressive • Second-Order Algorithms • Second order Perceptron • Confidence-Weighted and AROW • AdaGrad • Properties • Kernels • Analysis • Empirical Evaluation • Synthetic • Real Data
Online Learning Tyrannosaurus rex
Online Learning Triceratops
Online Learning Velocireptor Tyrannosaurus rex
Formal Setting – Binary Classification • Instances • Images, Sentences • Labels • Parse tree, Names • Prediction rule • Linear predictions rules • Loss • No. of mistakes
Predictions • Discrete Predictions: • Hard to optimize • Continuous predictions : • Label • Confidence
Loss Functions • Natural Loss: • Zero-One loss: • Real-valued-predictions loss: • Hinge loss: • Exponential loss (Boosting) • Log loss (Max Entropy, Boosting)
Loss Functions Hinge Loss Zero-One Loss 1 1
Online Learning Maintain Model M Get Instance x Update Model Predict Label y=M(x) M Suffer Loss l(y,y) Get True Label y
Notation Abuse Linear Classifiers • Any Features • W.l.o.g. • Binary Classifiers of the form
Linear Classifiers (cntd.) • Prediction : • Confidence in prediction:
Linear Classifiers Input Instance to be classified Weight vector of classifier
Margin • Margin of an example with respect to the classifier : • Note : • The set is separable iff there exists such that
Geometrical Interpretation Margin <<0 Margin >0 Margin <0 Margin >>0
Why Online Learning? • Fast • Memory efficient - process one example at a time • Simple to implement • Formal guarantees – Mistake bounds • Online to Batch conversions • No statistical assumptions • Adaptive • Not as good as a well designed batch algorithms
Outline • Background: • Online learning + notation • Perceptron • Stochastic-gradient descent • Passive-aggressive • Second-Order Algorithms • Second order Perceptron • Confidence-Weighted and AROW • AdaGrad • Properties • Kernels • Analysis • Empirical Evaluation • Synthetic • Real Data
Rosenblat 1958 The Perceptron Algorithm • If No-Mistake • Do nothing • If Mistake • Update • Margin after update :
Outline • Background: • Online learning + notation • Perceptron • Stochastic-gradient descent • Passive-aggressive • Second-Order Algorithms • Second order Perceptron • Confidence-Weighted and AROW • AdaGrad • Properties • Kernels • Analysis • Empirical Evaluation • Synthetic • Real Data
Gradient Descent • Consider the batch problem • Simple algorithm: • Initialize • Iterate, for • Compute • Set
Stochastic Gradient Descent • Consider the batch problem • Simple algorithm: • Initialize • Iterate, for • Pick a random index • Compute • Set
Stochastic Gradient Descent • “Hinge” loss • The gradient • Simple algorithm: • Initialize • Iterate, for • Pick a random index • If then else • Set The preceptron is a stochastic gradient descent algorithm with a sum of “hinge”-loss and a specific order of examples
Outline • Background: • Online learning + notation • Perceptron • Stochastic-gradient descent • Passive-aggressive • Second-Order Algorithms • Second order Perceptron • Confidence-Weighted and AROW • AdaGrad • Properties • Kernels • Analysis • Empirical Evaluation • Synthetic • Real Data
Motivation • Perceptron: No guaranties of margin after the update • PA :Enforce a minimal non-zero margin after the update • In particular : • If the margin is large enough (1), then do nothing • If the margin is less then unit, update such that the margin after the update is enforced to be unit
Input Space : Points are input data One constraint is induced by weight vector Primal space Half space = all input examples that are classified correctly by a given predictor (weight vector) Version Space : Points are weight vectors One constraints is induced by input data Dual space Half space = all predictors (weight vectors) that classify correctly a given input example Input Space vs. Version Space
Weight Vector (Version) Space The algorithm forces to reside in this region
Passive Step Nothing to do. already resides on the desired side.
Aggressive Step The algorithm projects on the desired half-space
Aggressive Update Step • Set to be the solution of the following optimization problem : • Solution:
Perceptron vs. PA • Common Update : • Perceptron • Passive-Aggressive
Perceptron vs. PA Error No-Error, Large Margin No-Error, Small Margin Margin
Outline • Background: • Online learning + notation • Perceptron • Stochastic-gradient descent • Passive-aggressive • Second-Order Algorithms • Second order Perceptron • Confidence-Weighted and AROW • AdaGrad • Properties • Kernels • Analysis • Empirical Evaluation • Synthetic • Real Data
Geometrical Assumption • All examples are bounded in a ball of radius R
Separablity • There exists a unit vector that classifies the data correctly
Perceptron’s Mistake Bound • Simple case: positive points negative points • Separating hyperplane • Bound is : • The number of mistakes the algorithm makes is bounded by
Outline • Background: • Online learning + notation • Perceptron • Stochastic-gradient descent • Passive-aggressive • Second-Order Algorithms • Second order Perceptron • Confidence-Weighted and AROW • AdaGrad • Properties • Kernels • Analysis • Empirical Evaluation • Synthetic • Real Data
Nicolò Cesa-Bianchi , Alex Conconi , Claudio Gentile, 2005 Second Order Perceptron • Assume all inputs are given • Compute “whitening” matrix • Run the Perceptron on “wightened” data • New “whitening” matrix