230 likes | 379 Views
On-line learning and Boosting. Overview of “A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting,” by Freund and Schapire (1997). Tim Miller University of Minnesota Department of Computer Science and Engineering. Hedge - Motivation.
E N D
On-line learning and Boosting Overview of “A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting,” by Freund and Schapire (1997). Tim Miller University of Minnesota Department of Computer Science and Engineering
Hedge - Motivation • Generalization of Weighted Majority Algorithm • Given a set of expert predictions, minimize mistakes over time • Slight emphasis in motivation on possibility of treating wtas a prior.
Hedge Algorithm • Parameters , w, T • For 1..T • Choose allocation p (probability distribution formed from weights) • Receive loss vector l • Suffer loss p l • Set new weight vector to w l
Hedge Analysis • Does not perform “too much worse” than best strategy: • LHedge() ( - ln (w1) – Li ln ) · Z • Z = 1 / (1 - ) • Is it possible to do better?
Boosting • If we have n classifiers, possibly looking at the problem from different perspectives, how can we optimally combine them • Example: We have a collection of “rules of thumb” for predicting horse races, how to weight them
Definitions • Given labeled data < x, c(x) >, where c is the targetconcept, c: X {0, 1}. • c C, the concept class • Strong PAC-learning algorithm: For parameters ,, hypothesis has error less than with probability (1-) • Weak algorithm: (0.5 - ), > 0
AdaBoost Algorithm • Input: • Sequence of N labeled examples • Distribution D over the N examples • Weak learning algorithm (called WeakLearn) • Number of iterations T
AdaBoost contd. • Initialize: w1 = D • For t =1..T • Form probability distribution p from w • Call WeakLearn with distribution p • Calculate error t = i=1..N pi | ht(xi) – yi | • Set t = t / (1 - t) • Multiplicatively adjust weights (w)by t 1-|ht(xi)–yi|
AdaBoost Output • Output (+1) if: • t=1..T (log 1/t) ht(x) ½ t=1..T log 1/t • 0 otherwise • Computes a weighted average
AdaBoost Analysis • Note of “dual” relationship with Hedge • Strategies Examples • Trials Weak hypotheses • Hedge increases weight for successful strategies, AdaBoost increases weight for difficult examples • AdaBoost has dynamic
AdaBoost Bounds • 2T t=1..T sqrt(t(1 - t)) • Previous bounds depended on maximum error of weakest hypothesis (weak link syndrome) • AdaBoost takes advantage of gains from best hypotheses
Multi-class Setting • k > 2 output labels, i.e. Y = {1, 2, …, k} • Error: Probability of incorrect prediction • Two algorithms: • AdaBoost.M1 – More direct • AdaBoost.M2 – Somewhat complex constraints on weak learners • Could also just divide into “one vs. one” or “one vs. all” categories
AdaBoost.M1 • Requires each classifier to have error less than 50% (stronger requirement than binary case) • Similar to regular AdaBoost algorithm except: • Error is 1 if ht(xi) yi • Can’t use algorithms with error > 0.5 • Algorithm outputs vector of length k with values between 0 & 1
AdaBoost.M1 Analysis • 2T t=1..T sqrt(t(1 - t)) • Same as bounds for regular AdaBoost • Proof converts multi-class problem to a binary setup • Can we improve this algorithm?
AdaBoost.M2 • More expressive, more complex constraints on weak hypotheses • Defines idea of “Pseudo-Loss” • Pseudo-loss of each weak hypothesis must be better than chance • Benefit: Allows contributions from hypotheses with accuracy < 0.5
Pseudo-loss • Replaces straightforward loss of AdaBoost.M1 • plossq(h,i) = 0.5 ( 1 – h(xi,yi) + yyi q(i,y) h(xi,y) • Intuition: For each incorrect label, pit it against known label in binary classification (second term), then take a weighted average. • Makes use of information in entire hypothesis vector, not just prediction
AdaBoost.M2 Details • Extra init: wti,y = D(i) / (k-1) • For each iteration t = 1 to T • Wti = yyi wti,y • qt(i,y) = wti,y / Wti • Dt(i) = Wti / i=1..N Wti • WeakLearn gets D as well as q • Calculate t as shown above • t = t / (1 - t) • wti,y· t(0.5)(1 + ht(xi,yi) – ht(xi,y))
Error Bounds • (k – 1) 2Tt=1..T sqrt(t(1 - t)) • Where is traditional error and the t are pseudo-losses
Regression Setting • Instead of picking from a discrete set of output labels, choose a continuous value • More formally Y = [0, 1] • Minimize the mean squared error: • E[(h(x) – y)2] • Reduce to binary classification and use AdaBoost!
How it works (roughly) • For each example in training set, create continuum of associated instances xtilde(xi, y) where y [0, 1]. • Label is 1 if y yi • Mapping to an infinite training set – need to convert discrete distributions to density functions
AdaBoost.R Bounds • 2T t=1..T sqrt(t (1 - t))
Conclusions • Starting from a on-line learning perspective, it is possible to generalize to boosting • Boosting can take weak learners and convert them to strong learners • This paper presented several algorithms to do boosting, with proofs of error bounds