Regret to the Best vs. Regret to the Average

Regret to the Bestvs.Regret to the Average Eyal Even-Dar Computer and Information Science University of Pennsylvania Collaborators: Michael Kearns (Penn) Yishay Mansour (Tel Aviv) Jenn Wortman (Penn)

The No-Regret Setting • Learner maintains a weighting over N “experts” • On each of T trials, learner observes payoffs for all K • Payoff to the learner = weighted payoff • Learner then dynamically adjusts weights • Let Ri,T be cumulative payoff of expert i on some sequence of T trials • Let RA,T be cumulative payoff of learning algorithm A • Classical no-regret results: We can produce a learning algorithm A such that on any sequence of trials, RA,T > max{Ri,T} – sqrt(log(N)*T) • “No regret”: per-trial regret sqrt(log(N)/T) approaches 0 as T grows

This Work • We simultaneously examine: • Regret to best expert in hindsight • Regret to the average return of all experts • Note that no learning is required to achieve just this! • Why look at the average? • A “safety net” or “sanity check” • Simple algorithm outperforms • Future direction: S&P 500 • We assume a fixed horizon T • But this can easily be relaxed…

Our Results • Every difference based algorithm with regret O(Tα) to the best expert has Ω(T1-α) regret to the average • There exists simple difference based algorithm achieving the tradeoff • Every algorithm with O(T1/2) regret to the best expert must have regret Ω(T1/2) regret to the average • We can produce an algorithm with O(logT T1/2) regret to the best and O(1) regret to the average

Oscillations: The Cost of an Update • Consider 2 experts with instantaneous gains in {0,1} • Let w be the weight on first expert and initialize w = ½ • Suppose expert 1 gets a gain of 1 on the first time step, and expert 2 gets a gain of 1 on the second… w + D (1,0) (0,1) w w Best, worst, and average all earn 1 Algorithm earns w + (1 – w – D) = 1 – D Regret to Best = Regret to Worst = Regret to Average = D

A Bad Sequence • Consider the following sequence • Expert 1: 1,0,1,0,1,0,1,0,…,1,0 • Expert 2: 0,1,0,1,0,1,0,1,…,0,1 • We can examine w over time for existing algorithms… • Follow the Perturbed Leader: ½,½ + 1/(T(1+ln(2))1/2- 1/2T,½,½ +1/(T(1+ln(2))1/2- 1/2T, ½, … • Weighted Majority: ½, ½ + (ln(2)/2T)1/2/(1+(ln(2)/2T)1/2), ½, ½+(ln(2)/2T)1/2/(1+(ln(2)/2T)1/2), ½, ... • Both will lose to best, worst, and average

A Simple Trade-off: The (T) Barrier w = 2/3 …  Some Dt > 1/6L … w = ½ + D … … w = ½ + D w = ½ T steps, regret to average ~ (T/2)*(1/6L) ~ W(T/L) L steps, regret to best > L/3 • Again, consider 2 experts with instantaneous gains in {0,1} • Let w be the weight on first expert and initialize w = ½ • Will first examine algorithms that depend only on cumulative difference in payoffs • Insight holds more generally for aggressive updating Regret to Best * Regret to Average ~ W(T) ! (1,0) (1,0) (1,0) (1,0)

Exponential Weights [F94] • Unnormalized weight on expert i at time t: wi,t = eηRi,t • Define Wt=∑ wi,t, so we have pi,t = wi,t /Wt • Let N be the number of experts • Setting η = O(1/T1/2)achievesO(T1/2) regret to the best • Setting η = O(1/T1/2+α)achievesO(T1/2+α) regret to the best • Can be shown that Setting η = O(1/T1/2+α) regret to the average is O(T1/2-α)

So far… 1/2 cumulative difference algorithms 1/2 1 Regret to best ~ Tx Regret to average ~ Ty

all algorithms 1/2 cumulative difference algorithms 1/2 1 Regret to best ~ Tx Regret to average ~ Ty An Unrestricted Lower Bound • Any algorithm achieving O(T1/2) regret to best must suffer (T1/2) regret to average • Any algorithm achieving O(log(T)T)1/2 regret to best must suffer (Teregret to the average • Not restricted to cumulative difference algorithms!

A Simple Additive Algorithm • Once again, 2 experts with instantaneous gains in {0,1}, w initialized to ½ • Let Dt be difference in cumulative payoffs of the two experts at time t • The algorithm will make the following updates • If expert gains are (0,0) or (1,1):no change to w • If expert gains are (1,0):w  w + D • If expert gains are (0,1):w  w – D • Assume we never reach w =1 • For any difference Dt = d we have w = ½ + d D

Breaking the (T) Barrier • While |Dt| < H • (0,0) or (1,1):no change to w • (1,0):w  w + D • (0,1):w  w – D • Play EW with h = T-1/3 Will analyze what happens: 1. If we stay in the loop 2. If we exit the loop

Staying in the Loop w + D (1,0) (0,1) w w Lose D to Best & Average While |D_t| < H • (0,0) or (1,1):no change to w • (1,0):w  w + D • (0,1):w  w – D Distance Dt d+1 d Observe Rbest,t - Ravg,T < H Enough to compute regret to the average Time t Regret to the Average at most TD Regret to the Best at most TD+H

Exiting the Loop w + D (1,0) w Lose 1-wto Best Gain w-½ over Average While |D_t| < H • (0,0) or (1,1):no change to w • (1,0):w  w + D • (0,1):w  w – D Play EW with h = T-1/3 Upon exit from loop: • Regret to the best: still at most H + TD • Gain over the average: (D + 2D + 3D + ... + HD) - TD ~ H2 D - TD • So e.g. H = T2/3 and D = 1/T gives • Regret to best: < T2/3 in loop or upon exit • Regret to average: constant in loop; but gain T1/3 upon exit • Now EW regret to the best T2/3 and to the average T1/3 Distance Dt d+1 d Time t

all algorithms 1/2 cumulative difference algorithms 1/2 2/3 1 Regret to best ~ Tx Regret to avg ~ Ty

Obliterating the (T) Barrier • Instead of playing additive algorithm inside the loop, we can play EW with η = Δ = 1/T • Instead of having one phase, we can have many Set η = 1/T, k = logT For i = 1 to k • Reset and run EW with the current value of η until Rbest,t – Ravg,t > H = O(T1/2) • Set η = η * 2 Reset and run EW with final value of η

Extensions and Open Problems • Known Extensions to Our Algorithm: • Instead of average, can use any static weight inside the simplex • Future Goals: • Nicer dependence on the number of experts • Ours is O(logN), typically O(sqrt(logN)) • Generalization to the returns setting and to other loss functions

Thanks! Questions?

Regret to the Best vs. Regret to the Average