Second Order Learning

Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

Thanks • Mark Dredze • Alex Kulesza • AvihaiMejer • Edward Moroshko • Francesco Orabona • Fernando Pereira • Yoram Singer • Nina Vaitz

Tutorial Context Online Learning SVMs Tutorial Optimization Theory Real-World Data

Outline • Background: • Online learning + notation • Perceptron • Stochastic-gradient descent • Passive-aggressive • Second-Order Algorithms • Second order Perceptron • Confidence-Weighted and AROW • AdaGrad • Properties • Kernels • Analysis • Empirical Evaluation • Synthetic • Real Data

Online Learning Tyrannosaurus rex

Online Learning Triceratops

Online Learning Velocireptor Tyrannosaurus rex

Formal Setting – Binary Classification • Instances • Images, Sentences • Labels • Parse tree, Names • Prediction rule • Linear predictions rules • Loss • No. of mistakes

Predictions • Discrete Predictions: • Hard to optimize • Continuous predictions : • Label • Confidence

Loss Functions • Natural Loss: • Zero-One loss: • Real-valued-predictions loss: • Hinge loss: • Exponential loss (Boosting) • Log loss (Max Entropy, Boosting)

Loss Functions Hinge Loss Zero-One Loss 1 1

Online Learning Maintain Model M Get Instance x Update Model Predict Label y=M(x) M Suffer Loss l(y,y) Get True Label y

Notation Abuse Linear Classifiers • Any Features • W.l.o.g. • Binary Classifiers of the form

Linear Classifiers (cntd.) • Prediction : • Confidence in prediction:

Linear Classifiers Input Instance to be classified Weight vector of classifier

Margin • Margin of an example with respect to the classifier : • Note : • The set is separable iff there exists such that

Geometrical Interpretation

Geometrical Interpretation Margin <<0 Margin >0 Margin <0 Margin >>0

Hinge Loss

Why Online Learning? • Fast • Memory efficient - process one example at a time • Simple to implement • Formal guarantees – Mistake bounds • Online to Batch conversions • No statistical assumptions • Adaptive • Not as good as a well designed batch algorithms

Rosenblat 1958 The Perceptron Algorithm • If No-Mistake • Do nothing • If Mistake • Update • Margin after update :

Geometrical Interpretation

Gradient Descent • Consider the batch problem • Simple algorithm: • Initialize • Iterate, for • Compute • Set

Stochastic Gradient Descent • Consider the batch problem • Simple algorithm: • Initialize • Iterate, for • Pick a random index • Compute • Set

Stochastic Gradient Descent • “Hinge” loss • The gradient • Simple algorithm: • Initialize • Iterate, for • Pick a random index • If then else • Set The preceptron is a stochastic gradient descent algorithm with a sum of “hinge”-loss and a specific order of examples

Motivation • Perceptron: No guaranties of margin after the update • PA :Enforce a minimal non-zero margin after the update • In particular : • If the margin is large enough (1), then do nothing • If the margin is less then unit, update such that the margin after the update is enforced to be unit

Input Space

Input Space : Points are input data One constraint is induced by weight vector Primal space Half space = all input examples that are classified correctly by a given predictor (weight vector) Version Space : Points are weight vectors One constraints is induced by input data Dual space Half space = all predictors (weight vectors) that classify correctly a given input example Input Space vs. Version Space

Weight Vector (Version) Space The algorithm forces to reside in this region

Passive Step Nothing to do. already resides on the desired side.

Aggressive Step The algorithm projects on the desired half-space

Aggressive Update Step • Set to be the solution of the following optimization problem : • Solution:

Perceptron vs. PA • Common Update : • Perceptron • Passive-Aggressive

Perceptron vs. PA Error No-Error, Large Margin No-Error, Small Margin Margin

Perceptron vs. PA

Geometrical Assumption • All examples are bounded in a ball of radius R

Separablity • There exists a unit vector that classifies the data correctly

Perceptron’s Mistake Bound • Simple case: positive points negative points • Separating hyperplane • Bound is : • The number of mistakes the algorithm makes is bounded by

Geometrical Motivation

SGD on such data

Nicolò Cesa-Bianchi , Alex Conconi , Claudio Gentile, 2005 Second Order Perceptron • Assume all inputs are given • Compute “whitening” matrix • Run the Perceptron on “wightened” data • New “whitening” matrix

Second Order Learning

Second Order Learning

Presentation Transcript

Second-Order Conditioning

Second-Order Circuits Cont’d

SECOND ORDER CIRCUITS

Second Order Analysis

From Second Order Cybernetics to Second Order Science

Second Order Filters

Second Order RLC Circuit

Circuits of Second Order

SECOND ORDER CIRCUITS

SECOND-ORDER DIFFERENTIAL EQUATIONS

Second Order Systems

Second-Order Circuits

Second order minimum conditions:

Second Order Analysis

Second-Order Processes

Second Order Differentiation

SECOND ORDER ECONOMICS: AN EXAMPLE OF SECOND ORDER CYBERNETICS

SECOND-ORDER CIRCUITS

Mars: Second Order Landscapes

What is Second Order?

Second Order Derivative

Mars: Second Order Landscapes