Online Learning for Real-World Problems

Online Learning forReal-World Problems Koby Crammer University of Pennsylvania

Ofer Dekel Josheph Keshet Shai Shalev-Schwatz Yoram Singer Axel Bernal Steve Caroll Mark Dredze Kuzman Ganchev Ryan McDonald Artemis Hatzigeorgiu Fernando Pereira Fei Sha Partha Pratim Talukdar Thanks

Tutorial Context SVMs Real-World Data Online Learning Tutorial Multicass, Structured Optimization Theory

Online Learning Tyrannosaurus rex

Online Learning Triceratops

Online Learning Velocireptor Tyrannosaurus rex

Formal Setting – Binary Classification • Instances • Images, Sentences • Labels • Parse tree, Names • Prediction rule • Linear predictions rules • Loss • No. of mistakes

Predictions • Discrete Predictions: • Hard to optimize • Continuous predictions : • Label • Confidence

Loss Functions • Natural Loss: • Zero-One loss: • Real-valued-predictions loss: • Hinge loss: • Exponential loss (Boosting) • Log loss (Max Entropy, Boosting)

Loss Functions Hinge Loss Zero-One Loss 1 1

Online Framework • Initialize Classifier • Algorithm works in rounds • On round the online algorithm : • Receives an input instance • Outputs a prediction • Receives a feedback label • Computes loss • Updates the prediction rule • Goal : • Suffer small cumulative loss

Notation Abuse Linear Classifiers • Any Features • W.l.o.g. • Binary Classifiers of the form

Linear Classifiers (cntd.) • Prediction : • Confidence in prediction:

Margin • Margin of an example with respect to the classifier : • Note : • The set is separable iff there exists such that

Geometrical Interpretation

Geometrical Interpretation Margin <<0 Margin >0 Margin <0 Margin >>0

Hinge Loss

Separable Set

Inseparable Sets

Degree of Freedom - I The same geometrical hyperplane can be represented by many parameter vectors

Degree of Freedom - II Problem difficulty does not change if we shrink or expand the input space

Why Online Learning? • Fast • Memory efficient - process one example at a time • Simple to implement • Formal guarantees – Mistake bounds • Online to Batch conversions • No statistical assumptions • Adaptive • Not as good as a well designed batch algorithms

Update Rules • Online algorithms are based on an update rule which defines from (and possibly other information) • Linear Classifiers : find from based on the input • Some Update Rules : • Perceptron (Rosenblat) • ALMA (Gentile) • ROMMA (Li & Long) • NORMA (Kivinen et. al) • MIRA (Crammer & Singer) • EG (Littlestown and Warmuth) • Bregman Based (Warmuth)

Three Update Rules • The Perceptron Algorithm : • Agmon 1954; Rosenblatt 1952-1962, Block 1962, Novikoff 1962, Minsky & Papert 1969, Freund & Schapire 1999, Blum & Dunagan 2002 • Hildreth’s Algorithm : • Hildreth 1957 • Censor & Zenios 1997 • Herbster 2002 • Loss Scaled : • Crammer & Singer 2001, • Crammer & Singer 2002

The Perceptron Algorithm • If No-Mistake • Do nothing • If Mistake • Update • Margin after update :

Geometrical Interpretation

Relative Loss Bound • For any competitor prediction function • We bound the loss suffered by the algorithm with the loss suffered by Cumulative Loss Suffered by the Algorithm Sequence of Prediction Functions Cumulative Loss of Competitor

Relative Loss Bound • For any competitor prediction function • We bound the loss suffered by the algorithm with the loss suffered by Inequality Possibly Large Gap Regret Extra Loss Competitiveness Ratio

Relative Loss Bound • For any competitor prediction function • We bound the loss suffered by the algorithm with the loss suffered by Grows With T Grows With T Constant

Relative Loss Bound • For any competitor prediction function • We bound the loss suffered by the algorithm with the loss suffered by Best Prediction Function in hindsight for the data sequence

Remarks • If the input is inseparable, then the problem of finding a separating hyperplane which attains less then M errors is NP-hard (Open hemisphere) • Obtaining a zero-one loss bound with a unit competitiveness ratio is as hard as finding a constant approximating error for the Open Hemisphere problem. • Bound of the number of mistakes the perceptron makes with the hinge loss of any competitor

Definitions • Any Competitor • The parameters vector can be chosen using the input data • The parameterized hinge loss of on • True hinge loss • 1-norm and 2-norm of hinge loss

Geometrical Assumption • All examples are bounded in a ball of radius R

Perceptron’s Mistake Bound • Bounds : • If the sample is separable then

[FS99, SS05] Proof - Intuition • Two views : • The angle between and decreases with • The following sum is fixed as we make more mistakes, our solution is better

[C04] Proof • Define the potential : • Bound it’s cumulative sum from above and below

Proof • Bound from above : Telescopic Sum Zero Vector Non-Negative

Proof • Bound From Below : • No error on tth round • Error on tth round

Proof • We bound each term :

Proof • Bound From Below : • No error on tth round • Error on tth round • Cumulative bound :

Proof • Putting both bounds together : • We use first degree of freedom (and scale) : • Bound :

Proof • General Bound : • Choose : • Simple Bound : Objective of SVM

Proof • Better bound : optimize the value of

Remarks • Bound does not depend on dimension of the feature vector • The bound holds for all sequences. It is not tight for most real world data • But, there exists a setting for which it is tight

Three Bounds

Separable Case • Assume there exists such that for all examples Then all bounds are equivalent • Perceptron makes finite number of mistakes until convergence (not necessarily to )

Separable Case – Other Quantities • Use 1st (parameterization) degree of freedom • Scale the such that • Define • The bound becomes

Separable Case - Illustration

Online Learning for Real-World Problems

Online Learning for Real-World Problems

Presentation Transcript

Engaging Patients: Real World Problems, Real World Solutions Marilyn R. Gardner

APPLYING PSYCHOLOGY TO REAL WORLD PROBLEMS

Engaging Patients: Real World Problems, Real World Solutions Marilyn R. Gardner

Solving Real-World Problems with Wireshark

Solving Real World Problems with Linear Equations

Healthcare Retro-commissioning Real world problems - real world results

Understanding Quadratic Equations Through Real World Problems

Real World Learning Network

Substitution: Real-World Problems

Solving Real World Problems

Machine Learning in Real World: C4.5

Online Learning for Online Pricing Problems

Real World Problems

Virtual Laboratories for Learning Real World Security

“Real problems”

Parallel Ant Colony Optimization for Real World Problems

Learning Through Personalization and Real-World Learning

E-Learning in the Real World