Regret to the Best vs. Regret to the Average

Regret to the Bestvs. Regret to the Average Eyal Even-Dar Michael Kearns Yishay Mansour Jennifer Wortman Upenn + Tel Aviv Univ. Slides: Csaba TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAAA

Motivation • Expert algorithms attempt to control regret to the return of the best expert • Regret to the average return? • Same bound! • Weak??? • EW: wi1=1, wit=wi,t-1e git , pit=wit/Wt, Wt = i wit E1: 1 0 1 0 1 0 1 0 1 0 … E2: 0 1 0 1 0 1 0 1 0 1 … GA,T=T/2-cT1/2 GT+ = GT- = GT0 = T/2 RT+· cT1/2, RT0· c T1/2

Notation - gains • git2 [0,1] - gains • g=(git) - sequence of gains • GiT(g)= t=1T git- cumulated gains • G0T(g)=(i GiT(g))/N - average gain • G-T(g)=mini GiT(g)- worst gain • G+T(g)=maxi GiT(g)- best gain • GDT(g)=i Di GiT(g)- weighted avg. gain

Notation - algorithms • wit– unnormalized weights • pit=wit/Wt, – normalized weightsWt = i wit • gA,t=i pit git – gain of A • GAT(g)= t gA,t – cumulated gain of A

Notation - regret regret to the.. • R+T(g) = (G+T(g) – GA,T(g)) Ç 1 – best • R-T(g) = (G-T(g) – GA,T(g)) Ç 1 – worst • R0T(g) = (G0T(g) – GA,T(g)) Ç 1 – avg • RDT(g) = (GDT(g) – GA,T(g)) Ç 1 – dist.

Goal • Algorithm A is “nice” if .. • R+A,T· O(T1/2) • R0A,T· 1 • Program: • Examine existing algorithms (“difference algorithms”) – lower bound • Show “nice” algorithms • Show that no substantial further improvement is possible

“Difference” algorithms • Def:A is a difference algorithm if for N=2, git2 {0,1}, p1t = f(dt), p2t = 1-f(dt), dt = G1t-G2t • Examples: • EW: wit = e Git • FPL: Choose argmaxi ( Git+Zit ) • Prod: wit = s (1+ gis) = (1+)Git

A lower bound for difference algorithms • Theorem:If A is a difference algorithm then there exist some series, g, g’ (tuned to A), such thatR+AT (g)R0AT (g’)¸ R+AT (g)R-AT (g’) = (T) • For R+AT = maxg R+AT(g), R-AT = maxg R-AT(g), R0AT = maxg R0AT(g), R+AT R0AT ¸ R+AT R-AT = (T)

1 1 1 1 1 1 1 1 1 1 1 1 1 1 … 0 0 0 0 0 0 0 0 0 0 0 0 0 0 … g: t  Proof • Assume T is even, p11· ½ • : first time t when p1t¸ 2/3 ) R+AT(g) ¸/3 • 92 {2,3,..,} s.t. p1-p1-1¸ 1/(6)

1 1 1 1 1 1 0 1 0 1 0 1 0 0 00000 0 0 0 0 0 0 1 0 1 0 1 0 1 1 1 1 1 1 1 g’: t T-2t t Proof/2 • p1-p1-1¸ 1/(6) • G+T=G-T=G0T=T/2 • GAT(g’)· + (T-2)/2 (1-1/(6)) • R-AT(g’) ¸ (T-2)/(12) )R+AT(g)R-AT(g’)¸ (T-2)/36 p1,t=p1, p1,t+1=p1,-1 Gain: ·1-1/(6) p1t=p1,T-t Gain: p1t+1-p1t=1

Tightness • We know that for difference algorithms R+AT R0AT ¸ R+AT R-AT = (T) • Can a (difference) algorithm achieve this? • Theorem: EW=EW(), with appropriately tuned =(), 0·· 1/2 has R+EW,T· T1/2+ (1+ln N) R0EW,T· T1/2-

Breaking the frontier • What’s wrong with the difference algorithms? • They are designed to find the best expert with low regret (fast) • ..they don’t pay attention to the average gain and how it compares with the best gain

BestWorst(A) • G+T-G-T: the spread of cumulated gain • Idea: Stay with the average, until the spread becomes large. Then switch to learning (using algorithm A). • When the spread is large enough, G0T=GBW(A),TÀG-T) “Nothing” to loose • Spread threshold: NR; where R=RT,N is a bound on the regret of A.

BestWorst(A) • Theorem:R+BW(A),T = O(NR), GBW(A),T¸ G-{T} • Proof:At the time of switch, GBW(A) ¸ (G++(N-1)+G-)/N. Since G+¸ G-+NR, GBW(A)¸ G- + R.

PhasedAgression(A,R,D) fork=1:log2(R) do :=2k-1/R A.reset(); s:=0 // local time, new phase while (G+s-GDs<2R) do qs := A.getNormedWeights( gs-1 ) ps :=  qs + (1-) D end end A.reset() run A until time T

PA(A,R,D) – Theorem • Theorem:Let A be any algorithm with regret R = RT,N to the best expert, D any distribution.Then for PA=PA(A,R,D), R+PA,T· 2R(log R+1) RDPA,T· 1

Proof • Consider local time s during phase k. • D and A share the gains & the regret G+s-GPA,s < 2k-1/R£ R + (1-2k-1/R) £ 2R < 2R GDs-GPA,s· 2k-1/R £ R =2k-1 • What happens at the end of the phase? GPA,s-GD,s¸ 2k-1/R £ (G+s-R-GDs) ¸ 2k-1/R £ (G+s-GDs-R+GDsGDs) ¸ 2k-1/R £ R = 2k-1. • What if PA ends in phase k at time T: G+T-GPA,T· 2R k · 2R (log R + 1) GDT-GPA,T· 2k-1 - j=1k-1 2j-1= 2k-1(2k-1-1)=1

General lower bounds • Theorem: R+A,T=O(T1/2) ) R0A,T=(T1/2) R+A,T· (Tlog(T))1/2/10 ) R0A,T=(T), where ¸ 0.02 Compare this with R+PA,T· 2R(log R+1), RDPA,T· 1, where R=(T log N)1/2

Conclusions • Achieving constant regret to the average is a reasonable goal. • “Classical” algorithms do not have this property, but satisfy R+AT R0AT ¸(T). • Modification: Learn only when it makes sense; ie. when the best is much better than the average • PhasedAgression: Optimal tradeoff • Can we remove dependence on T?

Regret to the Best vs. Regret to the Average