1 / 25

Efficient Sequential Decision-Making in Structured Problems

Efficient Sequential Decision-Making in Structured Problems. Adam Tauman Kalai Georgia Institute of Technology Weizmann Institute Toyota Technological Institute National Institute of Corrections. BANDITS AND REGRET. 8. 4. 3. 3. 1. 8. 1. 4. 6. 9. 8. 1. 2. 9. 5. 5. 8. 1.

karis
Download Presentation

Efficient Sequential Decision-Making in Structured Problems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient SequentialDecision-Making in Structured Problems Adam Tauman Kalai Georgia Institute of Technology Weizmann Institute Toyota Technological Institute National Institute of Corrections

  2. BANDITS AND REGRET 8 4 3 3 1 8 1 4 6 9 8 1 2 9 5 5 8 1 5 4 6 AVG TIME 1 2 3 1 9 5 REGRET = AVG REWARD OF BEST DECISION – AVG REWARD = 8 – 5 = 3

  3. TWO APPROACHES Bayesian setting [Robbins52] • Independent prior probability dist. over payoff sequences for each machine • Thm: Maximize (discounted) expected reward by pulling arm of largest “Gittins index” Nonstochastic[Auer,Cesa-Bianchi,Freund,Schapire95] • Thm: For any sequence of [0,1] costs on N machines, their algorithm achieves expected regret ofO

  4. STRUCTURED COMB-OPT Online examples: Routing Compression Binary search trees PCFGs Pruning dec. trees Poker Auctions Classification Problems not included: Portfolio selection (nonlinear) Online sudoko Route Time Clustering Errors 25 min 40 17 min 55 44 min 19

  5. STRUCTURED COMB-OPT Known decision set S. KnownLINEAR cost func. c: S £ [0,1]d! [0,1]. Unknownw1, w2, …, wT2 [0,1]d On period t = 1, 2, …, T: Alg. picks st2S. Alg. pays and finds out c(st,wt). REGRET = =

  6. MAIN POINTS • Offline optimization M: [0,1]d!S • M(w) = argmins2Sc(s,w), e.g. shortest path • Easier than sequential decision-making!? • EXPLORATION • Automatically find “exploration basis” using M • LOW REGRET • Dimension matters more than # decisions • EFFICIENCY • Online algorithm uses offline black-box opt. M

  7. MAIN RESULT [AK04,MB04,DH06] An algorithm that achives: For any set S, any linear c: S£[0,1]d![0,1], any T¸ 1, and any sequence w1,…,wT2 [0,1]d, E[regret of alg] · 15dT-1/3 Each update requires linear time and calls offline optimizer M with probability O(dT-1/3)

  8. EXPLORE vs EXPLOIT [AK04, MB04] Find good “exploration basis” using M On period t = 1, 2, …, T: • Explore with probability , • Play st := a random element of exploration basis • Estimate vt somehow • Exploit with probability 1-, • Play st := M(i<tvi+p) • vt := 0 Key property: E[vt] = wt E[calls to M] = dT. random perturbation [Hannan57]

  9. REMAINDER OF TALK • EXPLORATION • Good “exploration basis” definition • Finding one • EXPLOITATION • Perturbation (randomized regularization) • Stability analysis • OTHER DIRECTIONS • Approximation algorithms • Convex problems

  10. EXPLORATION

  11. GOING TO d-DIMENSIONS • Linear cost function c: S£ [0,1]d! [0,1] • MappingS! [0,1]d:s = (c(s,(1,0,…,0)),c(s,(0,1,…,0)),…,c(s,(0,…,0,1)) • c(s,w) = s¢w S = { s | s2S } K = convex-hull(S) WLOG dim(S)=d K

  12. EXPLORATION BASIS [AK04] Def: Exploration basis b1, b2, …, bd2S is a 2-Barycentric-spanner if, for every s2S, s = iibi for some 1, 2, …,d 2 [-2,2] Possible to find an exploration basis efficiently using offline optimizer M(w) = argmins2Sc(s,w) bad good S = { s | s2S } K = convex-hull(S) WLOG dim(S)=d K

  13. EXPLOITATION

  14. EXPLORE vs EXPLOIT [AK04, MB04] Find good “exploration basis” using M On period t = 1, 2, …, T: • Explore with probability , • Play st := a random element of exploration basis • Estimate vt somehow • Exploit with probability 1-, • Play st := M(i<tvi+p) • vt := 0 Key property: E[vt] = wt E[calls to M] = dT. random perturbation [Hannan57]

  15. INSTABILITY Define zt = M(i·twi) = argmins2Si·tc(s,wi) Natural idea: use zt-1 on period t? REGRET=1! ½ 0 0 1 1 0 0 1 1 0

  16. STABILITY ANALYSIS [KV03] Define zt = M(i·twi) = argmins2Si<tc(s,wi) Lemma: Regret of using zt on period t is 0 Proof: mins2Sc(s,w1)+c(s,w2)+…+c(s,wT) = c(zT,w1)+…+c(zT,wT-1)+c(zT,wT) ¸ c(zT-1,w1)+…+c(zT-1,wT-1)+c(zT,wT) ¸  ¸ c(z1,w1)+c(z2,w2)+…+c(zT,wT)

  17. STABILITY ANALYSIS [KV03] Define zt = M(i·twi) = argmins2Si<tc(s,wi) Lemma: Regret of using zt on period tis 0 ) Regret of zt-1 on t·t·Tc(zt-1,wt)-c(zt,wt) Idea: regularize to achieve stability Let yt = M(i·t wi+p), for randomp2 [0,1]d. E[Regret of yt-1 on t] ·t·T E[c(yt-1,wt)-c(yt,wt)] +  Strange: randomized regularization! yt can be computed using M

  18. OTHER DIRECTIONS

  19. BANDIT CONVEX OPT. • Convex feasible set SµRd • Unknown sequence of concave functions f1,…, fT: S! [0,1] • On period t = 1,2,…,T: • Algorithm chooses xt2 S • Algorithm pays and finds outft(xt) • Thm. 8 concave f1, f2, …: S! [0,1], 8T0,T¸ 1, bacterial ascent algorithm achieves:

  20. MOTIVATING EXAMPLE • Company has to decide how much to advertize among d channels, within budget. • Feedback is total profit, affected by external factors. f4(x4) f3(x3) f2(x2) f4 $PROFIT f1(x1) f3 f2 f1 x4 x3 x2 x* x1 $ADVERTISING

  21. BACTERIAL ASCENT EXPLORE EXPLOIT x0 x1 S

  22. BACTERIAL ASCENT EXPLORE EXPLOIT x0 x2 x1 S

  23. BACTERIAL ASCENT EXPLORE EXPLOIT x3 x0 x2 x1 S

  24. APPROXIMATION ALG’s • What if offline optimization is NP-hard? • Example: repeated traveling salesman problem • Suppose you have approximation algorithm A,c(A(w),w) · mins2Sc(s,w) for all w2 [0,1]d • Would like to achieve low -regret = our cost – (min cost of best s2S) • Possible using convex optimization approach above and transformations of approximation algorithms [KKL07]

  25. CONCLUSIONS • Can extend bandit algorithms to structured problems • Guarantee worst-case low regret • Linear combinatorial optimizationproblems • Convex optimization • Remarks • Works against adaptive adversaries as well • Online efficiency = offline efficiency • Can handle approximation algorithms • Can achieve cost · (1+) min cost + O(1/)

More Related