250 likes | 352 Views
Efficient Sequential Decision-Making in Structured Problems. Adam Tauman Kalai Georgia Institute of Technology Weizmann Institute Toyota Technological Institute National Institute of Corrections. BANDITS AND REGRET. 8. 4. 3. 3. 1. 8. 1. 4. 6. 9. 8. 1. 2. 9. 5. 5. 8. 1.
E N D
Efficient SequentialDecision-Making in Structured Problems Adam Tauman Kalai Georgia Institute of Technology Weizmann Institute Toyota Technological Institute National Institute of Corrections
BANDITS AND REGRET 8 4 3 3 1 8 1 4 6 9 8 1 2 9 5 5 8 1 5 4 6 AVG TIME 1 2 3 1 9 5 REGRET = AVG REWARD OF BEST DECISION – AVG REWARD = 8 – 5 = 3
TWO APPROACHES Bayesian setting [Robbins52] • Independent prior probability dist. over payoff sequences for each machine • Thm: Maximize (discounted) expected reward by pulling arm of largest “Gittins index” Nonstochastic[Auer,Cesa-Bianchi,Freund,Schapire95] • Thm: For any sequence of [0,1] costs on N machines, their algorithm achieves expected regret ofO
STRUCTURED COMB-OPT Online examples: Routing Compression Binary search trees PCFGs Pruning dec. trees Poker Auctions Classification Problems not included: Portfolio selection (nonlinear) Online sudoko Route Time Clustering Errors 25 min 40 17 min 55 44 min 19
STRUCTURED COMB-OPT Known decision set S. KnownLINEAR cost func. c: S £ [0,1]d! [0,1]. Unknownw1, w2, …, wT2 [0,1]d On period t = 1, 2, …, T: Alg. picks st2S. Alg. pays and finds out c(st,wt). REGRET = =
MAIN POINTS • Offline optimization M: [0,1]d!S • M(w) = argmins2Sc(s,w), e.g. shortest path • Easier than sequential decision-making!? • EXPLORATION • Automatically find “exploration basis” using M • LOW REGRET • Dimension matters more than # decisions • EFFICIENCY • Online algorithm uses offline black-box opt. M
MAIN RESULT [AK04,MB04,DH06] An algorithm that achives: For any set S, any linear c: S£[0,1]d![0,1], any T¸ 1, and any sequence w1,…,wT2 [0,1]d, E[regret of alg] · 15dT-1/3 Each update requires linear time and calls offline optimizer M with probability O(dT-1/3)
EXPLORE vs EXPLOIT [AK04, MB04] Find good “exploration basis” using M On period t = 1, 2, …, T: • Explore with probability , • Play st := a random element of exploration basis • Estimate vt somehow • Exploit with probability 1-, • Play st := M(i<tvi+p) • vt := 0 Key property: E[vt] = wt E[calls to M] = dT. random perturbation [Hannan57]
REMAINDER OF TALK • EXPLORATION • Good “exploration basis” definition • Finding one • EXPLOITATION • Perturbation (randomized regularization) • Stability analysis • OTHER DIRECTIONS • Approximation algorithms • Convex problems
GOING TO d-DIMENSIONS • Linear cost function c: S£ [0,1]d! [0,1] • MappingS! [0,1]d:s = (c(s,(1,0,…,0)),c(s,(0,1,…,0)),…,c(s,(0,…,0,1)) • c(s,w) = s¢w S = { s | s2S } K = convex-hull(S) WLOG dim(S)=d K
EXPLORATION BASIS [AK04] Def: Exploration basis b1, b2, …, bd2S is a 2-Barycentric-spanner if, for every s2S, s = iibi for some 1, 2, …,d 2 [-2,2] Possible to find an exploration basis efficiently using offline optimizer M(w) = argmins2Sc(s,w) bad good S = { s | s2S } K = convex-hull(S) WLOG dim(S)=d K
EXPLORE vs EXPLOIT [AK04, MB04] Find good “exploration basis” using M On period t = 1, 2, …, T: • Explore with probability , • Play st := a random element of exploration basis • Estimate vt somehow • Exploit with probability 1-, • Play st := M(i<tvi+p) • vt := 0 Key property: E[vt] = wt E[calls to M] = dT. random perturbation [Hannan57]
INSTABILITY Define zt = M(i·twi) = argmins2Si·tc(s,wi) Natural idea: use zt-1 on period t? REGRET=1! ½ 0 0 1 1 0 0 1 1 0
STABILITY ANALYSIS [KV03] Define zt = M(i·twi) = argmins2Si<tc(s,wi) Lemma: Regret of using zt on period t is 0 Proof: mins2Sc(s,w1)+c(s,w2)+…+c(s,wT) = c(zT,w1)+…+c(zT,wT-1)+c(zT,wT) ¸ c(zT-1,w1)+…+c(zT-1,wT-1)+c(zT,wT) ¸ ¸ c(z1,w1)+c(z2,w2)+…+c(zT,wT)
STABILITY ANALYSIS [KV03] Define zt = M(i·twi) = argmins2Si<tc(s,wi) Lemma: Regret of using zt on period tis 0 ) Regret of zt-1 on t·t·Tc(zt-1,wt)-c(zt,wt) Idea: regularize to achieve stability Let yt = M(i·t wi+p), for randomp2 [0,1]d. E[Regret of yt-1 on t] ·t·T E[c(yt-1,wt)-c(yt,wt)] + Strange: randomized regularization! yt can be computed using M
BANDIT CONVEX OPT. • Convex feasible set SµRd • Unknown sequence of concave functions f1,…, fT: S! [0,1] • On period t = 1,2,…,T: • Algorithm chooses xt2 S • Algorithm pays and finds outft(xt) • Thm. 8 concave f1, f2, …: S! [0,1], 8T0,T¸ 1, bacterial ascent algorithm achieves:
MOTIVATING EXAMPLE • Company has to decide how much to advertize among d channels, within budget. • Feedback is total profit, affected by external factors. f4(x4) f3(x3) f2(x2) f4 $PROFIT f1(x1) f3 f2 f1 x4 x3 x2 x* x1 $ADVERTISING
BACTERIAL ASCENT EXPLORE EXPLOIT x0 x1 S
BACTERIAL ASCENT EXPLORE EXPLOIT x0 x2 x1 S
BACTERIAL ASCENT EXPLORE EXPLOIT x3 x0 x2 x1 S
APPROXIMATION ALG’s • What if offline optimization is NP-hard? • Example: repeated traveling salesman problem • Suppose you have approximation algorithm A,c(A(w),w) · mins2Sc(s,w) for all w2 [0,1]d • Would like to achieve low -regret = our cost – (min cost of best s2S) • Possible using convex optimization approach above and transformations of approximation algorithms [KKL07]
CONCLUSIONS • Can extend bandit algorithms to structured problems • Guarantee worst-case low regret • Linear combinatorial optimizationproblems • Convex optimization • Remarks • Works against adaptive adversaries as well • Online efficiency = offline efficiency • Can handle approximation algorithms • Can achieve cost · (1+) min cost + O(1/)