Efficient Sequential Decision-Making in Structured Problems

Efficient SequentialDecision-Making in Structured Problems Adam Tauman Kalai Georgia Institute of Technology Weizmann Institute Toyota Technological Institute National Institute of Corrections

BANDITS AND REGRET 8 4 3 3 1 8 1 4 6 9 8 1 2 9 5 5 8 1 5 4 6 AVG TIME 1 2 3 1 9 5 REGRET = AVG REWARD OF BEST DECISION – AVG REWARD = 8 – 5 = 3

TWO APPROACHES Bayesian setting [Robbins52] • Independent prior probability dist. over payoff sequences for each machine • Thm: Maximize (discounted) expected reward by pulling arm of largest “Gittins index” Nonstochastic[Auer,Cesa-Bianchi,Freund,Schapire95] • Thm: For any sequence of [0,1] costs on N machines, their algorithm achieves expected regret ofO

STRUCTURED COMB-OPT Online examples: Routing Compression Binary search trees PCFGs Pruning dec. trees Poker Auctions Classification Problems not included: Portfolio selection (nonlinear) Online sudoko Route Time Clustering Errors 25 min 40 17 min 55 44 min 19

STRUCTURED COMB-OPT Known decision set S. KnownLINEAR cost func. c: S £ [0,1]d! [0,1]. Unknownw1, w2, …, wT2 [0,1]d On period t = 1, 2, …, T: Alg. picks st2S. Alg. pays and finds out c(st,wt). REGRET = =

MAIN POINTS • Offline optimization M: [0,1]d!S • M(w) = argmins2Sc(s,w), e.g. shortest path • Easier than sequential decision-making!? • EXPLORATION • Automatically find “exploration basis” using M • LOW REGRET • Dimension matters more than # decisions • EFFICIENCY • Online algorithm uses offline black-box opt. M

MAIN RESULT [AK04,MB04,DH06] An algorithm that achives: For any set S, any linear c: S£[0,1]d![0,1], any T¸ 1, and any sequence w1,…,wT2 [0,1]d, E[regret of alg] · 15dT-1/3 Each update requires linear time and calls offline optimizer M with probability O(dT-1/3)

EXPLORE vs EXPLOIT [AK04, MB04] Find good “exploration basis” using M On period t = 1, 2, …, T: • Explore with probability , • Play st := a random element of exploration basis • Estimate vt somehow • Exploit with probability 1-, • Play st := M(i<tvi+p) • vt := 0 Key property: E[vt] = wt E[calls to M] = dT. random perturbation [Hannan57]

REMAINDER OF TALK • EXPLORATION • Good “exploration basis” definition • Finding one • EXPLOITATION • Perturbation (randomized regularization) • Stability analysis • OTHER DIRECTIONS • Approximation algorithms • Convex problems

EXPLORATION

GOING TO d-DIMENSIONS • Linear cost function c: S£ [0,1]d! [0,1] • MappingS! [0,1]d:s = (c(s,(1,0,…,0)),c(s,(0,1,…,0)),…,c(s,(0,…,0,1)) • c(s,w) = s¢w S = { s | s2S } K = convex-hull(S) WLOG dim(S)=d K

EXPLORATION BASIS [AK04] Def: Exploration basis b1, b2, …, bd2S is a 2-Barycentric-spanner if, for every s2S, s = iibi for some 1, 2, …,d 2 [-2,2] Possible to find an exploration basis efficiently using offline optimizer M(w) = argmins2Sc(s,w) bad good S = { s | s2S } K = convex-hull(S) WLOG dim(S)=d K

EXPLOITATION

EXPLORE vs EXPLOIT [AK04, MB04] Find good “exploration basis” using M On period t = 1, 2, …, T: • Explore with probability , • Play st := a random element of exploration basis • Estimate vt somehow • Exploit with probability 1-, • Play st := M(i<tvi+p) • vt := 0 Key property: E[vt] = wt E[calls to M] = dT. random perturbation [Hannan57]

INSTABILITY Define zt = M(i·twi) = argmins2Si·tc(s,wi) Natural idea: use zt-1 on period t? REGRET=1! ½ 0 0 1 1 0 0 1 1 0

STABILITY ANALYSIS [KV03] Define zt = M(i·twi) = argmins2Si<tc(s,wi) Lemma: Regret of using zt on period t is 0 Proof: mins2Sc(s,w1)+c(s,w2)+…+c(s,wT) = c(zT,w1)+…+c(zT,wT-1)+c(zT,wT) ¸ c(zT-1,w1)+…+c(zT-1,wT-1)+c(zT,wT) ¸  ¸ c(z1,w1)+c(z2,w2)+…+c(zT,wT)

STABILITY ANALYSIS [KV03] Define zt = M(i·twi) = argmins2Si<tc(s,wi) Lemma: Regret of using zt on period tis 0 ) Regret of zt-1 on t·t·Tc(zt-1,wt)-c(zt,wt) Idea: regularize to achieve stability Let yt = M(i·t wi+p), for randomp2 [0,1]d. E[Regret of yt-1 on t] ·t·T E[c(yt-1,wt)-c(yt,wt)] +  Strange: randomized regularization! yt can be computed using M

OTHER DIRECTIONS

BANDIT CONVEX OPT. • Convex feasible set SµRd • Unknown sequence of concave functions f1,…, fT: S! [0,1] • On period t = 1,2,…,T: • Algorithm chooses xt2 S • Algorithm pays and finds outft(xt) • Thm. 8 concave f1, f2, …: S! [0,1], 8T0,T¸ 1, bacterial ascent algorithm achieves:

MOTIVATING EXAMPLE • Company has to decide how much to advertize among d channels, within budget. • Feedback is total profit, affected by external factors. f4(x4) f3(x3) f2(x2) f4 $PROFIT f1(x1) f3 f2 f1 x4 x3 x2 x* x1 $ADVERTISING

BACTERIAL ASCENT EXPLORE EXPLOIT x0 x1 S

BACTERIAL ASCENT EXPLORE EXPLOIT x0 x2 x1 S

BACTERIAL ASCENT EXPLORE EXPLOIT x3 x0 x2 x1 S

APPROXIMATION ALG’s • What if offline optimization is NP-hard? • Example: repeated traveling salesman problem • Suppose you have approximation algorithm A,c(A(w),w) · mins2Sc(s,w) for all w2 [0,1]d • Would like to achieve low -regret = our cost – (min cost of best s2S) • Possible using convex optimization approach above and transformations of approximation algorithms [KKL07]

CONCLUSIONS • Can extend bandit algorithms to structured problems • Guarantee worst-case low regret • Linear combinatorial optimizationproblems • Convex optimization • Remarks • Works against adaptive adversaries as well • Online efficiency = offline efficiency • Can handle approximation algorithms • Can achieve cost · (1+) min cost + O(1/)

Efficient Sequential Decision-Making in Structured Problems