270 likes | 379 Views
On Routing without Regret. Avrim Blum CMU. Portions of this talk are joint work with Eyal Even-Dar, Katrina Ligett, Yishay Mansour, and Brendan McMahan. Plan for this talk:. History and background on “no-regret” algorithms.
E N D
On Routing without Regret Avrim Blum CMU Portions of this talk are joint work with Eyal Even-Dar, Katrina Ligett, Yishay Mansour, and Brendan McMahan.
Plan for this talk: • History and background on “no-regret” algorithms. • Some recent results / new directions, esp motivated by routing problems. • What can we say about global behavior if everyone is optimizing in this way, in Wardrop traffic model.
Consider the following setting… • Each morning, you need to pick one of N possible routes to get to work. • But traffic is different each day. • Not clear a priori which will be best. • When you get there you find out how long your route took. (And maybe others too or maybe not.) IAS 32 min • Is there a strategy for picking routes so that in the long run, whatever the sequence of traffic patterns has been, you’ve done not much worse than the best fixed route in hindsight? (In expectation, over internal randomness in the algorithm) • Yes.
“No-regret” algorithms for repeated games Adversary – world - life Algorithm The setup: • Repeated play of matrix game with N rows. (Algorithm is row-player, rows represent different possible actions). • At each time step, algorithm picks row, life picks column. • Alg pays cost for action chosen. • Alg gets column as feedback (or just its own cost in the “bandit” model). • All entries scaled to be losses/costs between 0 & 1.
“No-regret” algorithms for repeated games • At each time step, algorithm picks row, life picks column. • Alg pays cost for action chosen. • Alg gets column as feedback (or just its own cost in the “bandit” model). • All entries scaled to be losses/costs between 0 & 1. Define total regret in T time steps as difference between (expected) total cost incurred and cost of best fixed row in hindsight. Average regret is (total regret)/T, which we want to go to 0 or better [= “no-regret” algorithm]. AKA “combining expert advice in the decision-theoretic setting”.
Some intuition & properties of no-regret algs. Adversary – world - life Algorithm • Time-average performance guaranteed to approach minimax value V of game (or better, if life isn’t adversarial). • In fact, existence of no-regret algs yields proof of minimax thm. • If can implement adversary (separation) oracle, can use to get apx minimax-optimal. Also, two NR algorithms played against each other will have empirical distribution approach minimax optimal. • Algorithms must be randomized or else it’s hopeless.
History and development (abridged) • [Hannan’57]. Algorithm with total regret O((TN)1/2). • Can see T1/2 is necessary from coin-flipping example. • Re-phrasing, need only T = O(N/2) steps to get time-average regret down to . (will call this quantity T) • Game-theorists viewed N as fixed, constant, and not so important as T, so pretty much done.
History and development (abridged) • [Hannan’57]. Algorithm with total regret O((TN)1/2). • T = O(N/2) steps to get average regret down to . • Learning-theory 80s-90s: • Q: given a space of hypotheses (like conjunctions over n boolean features), can you do online prediction in a way that does nearly as well as best of them in hindsight (ignoring computational issues: here N=2n)? • [LittlestoneWarmuth’89]: Weighted-majority algorithm • E[#Mistakes] · OPT(1+) + -1 log N, • Or, OPT + O((T log N)1/2) if set based on T. • Or, T = O((log N)/2). • For intuition, think of case where OPT=0. • Can replace “log N” with #bits to describe opt. • [FreundSchapire] realized could apply to game setup.
Weighted-majority & variants • Initialize all weights of all actions to 1. Pick action i with probability pi = wi/W, where W = i wi. • Given cost vector c = (c1, c2, …, cN), update wià wi(1-)ci. • Won’t give proof since will analyze a more general setting & algorithm instead.
A generalization before continuing our story… • A natural generalization of our regret goal is: what if we also want that on rainy days, we do nearly as well as the best route for rainy days. • And on Mondays, do nearly as well as best route for Mondays. • More generally, have N “rules” (on Monday, use path P). Goal: simultaneously, for each rule i, guarantee to do nearly as well as it on the time steps in which it fires. • For all i, want E[costi(alg)] · (1+)costi(i) + O(-1log N). “Specialists” or “sleeping experts” problem. Studied theoretically in [B95][FSSW97][BM05]; in practice [CS’96,CS’99] for document classification.
A generalization before continuing our story… • Simple alg (joint with Yishay Mansour): • Define “relaxed regret” with respect to rule i as: Ri = E[costi(alg)]/(1+) – costi(i). • Give rule i weight wi = (1+)Ri. Pick with prob pi=wi/W. • Initially, all weights are 1 and sum to N. • Prove sum of weights never increases: • Conclude Ri· log1+N ¼-1log N. • Can extend to rules that can be fractionally on too. Want ·-1log N insert proof here
A generalization before continuing our story… • Simple alg (joint with Yishay Mansour): • Define “relaxed regret” with respect to rule i as: Ri = E[costi(alg)]/(1+) – costi(i). • Give rule i weight wi = (1+)Ri. Pick with prob pi=wi/W. • Initially, all weights are 1 and sum to N. • Prove sum of weights never increases: • Conclude Ri· log1+N ¼-1log N. • Can extend to rules that can be fractionally on too. Want ·-1log N
A generalization before continuing our story… Ri = E[costi(alg)]/(1+) – costi(i). • Give rule i weight wi = (1+)Ri. Pick with prob pi=wi/W. • Prove sum of weights never increases: That’s it!
History and development contd… • [Hannan’57]. T=O(N/2). • Weighted-maj: T=O((log N)/2). • So, conceivably can do well when N is exponential in natural problem size (like in online routing), if only could implement efficiently. • Learning theory 90s-00s: series of results giving efficient implementation/alternatives in various settings: • [HelmboldSchapire97]: best pruning of given DT. • [BChawlaKalai02]: 1+ static-optimal for list-update • [TakimotoWarmuth02]: online shortest paths. • [KalaiVempala03]: elegant setting generalizing all above • [Zinkevich03]: online convex programming • [AwerbuchKleinberg04][McMahanB04]: [KV] bandit model • [Kl,FlKaMc05]: bandit version of [Z03]
Kalai-Vempala setting • Set S of feasible points in Rm, of bounded diameter. (Think of as indicator vectors for possible paths) • For t = 1 to T do: • Alg picks xt2 S, adversary picks cost vector ct. • Alg pays xt¢ ct. • Goal is compete wrt best fixed x in hindsight: x 2 S that minimizes x¢ (c1 + c2 + … + cT). • Don’t store S explicitly. Instead, assume have oracle for offline problem: given c, find best x 2 S. • E.g., S is convex, or S is set of paths from vs to vt. • Goal is to use to solve online problem. x
Kalai-Vempala algorithm • Assume have oracle for offline problem: given c, find best x 2 S. • Algorithm is very simple: • Just pick xt2 S that minimizes x¢(c0 + c1 + … + ct-1), • where c0 is picked from appropriate distrib. • In fact, very similar to Hannan’s original alg. • Form of bounds: • T = O(diam(S) ¢ L1 bound on c’s ¢ log(m)/ 2). • For online shortest path, T = O(nm¢log(n)/2). x
Analysis sketch [KV] Ct-1 Ct Two algorithms walk into a bar… • Alg A picks xt minimizing xt¢ct-1, where ct-1=c1+…+ct-1. • Alg B picks xt minimizing xt¢ct, where ct=c1+…+ct. (B has fairy godparents who add ct into history) Step 1: prove B is at least as good as OPT: t (B’s xt)¢ ct· minx2 S x¢(c1 + … + cT) Uses cute telescoping argument. Now, A & B start drinking and their objectives get fuzzier… Step 2: at appropriate point, prove A & B are similar and yet B has not been hurt too much.
Applications Efficient online algorithm to perform nearly as well as: • Best fixed path in hindsight (in routing) • Best fixed search tree in hindsight (in data-structures) • … Potential use for algorithmic problems that can be viewed as finding an apx-optimal solution for an exponential-size matrix game, if can fit both players into this framework.
Can combine KV with sleeping-experts too • Say you are given N “conditions” or “features” to pay attention to (is it raining?, is it a Monday?, …). • Each day satisfies some conditions and not others. • What can we do? • For each condition i, run a copy of KV on just the days satisfying that condition. • Then view these N algorithms as “sleeping experts” and feed their suggestions as inputs into [BM]. • For each condition i, on the days satisfying that condition we do nearly as well as the best x 2 S for the days satisfying that condition.
Bandit setting [AK] [MB] • What if alg is only told costxt¢ct and not ct itself. • E.g., you only find out cost of your own path, not all edges in network. • Can you still perform comparably to the best path in hindsight? (which you don’t even know!) • Ans: yes, though bounds are worse. • Basic idea is fairly straightforward: • All we need is an estimate of ct-1=c1 + … + ct-1. • So, pick basis B and occasionally sample a random x2B. • Use dot-products with basis vectors to reconstruct estimate of ct-1. (Helps for B to be as orthogonal as possible) • Even if adversary is adaptive, still can’t bias your estimate too much.
What if everyone started using NR algs? • What if changing cost function is due to other players in the system optimizing for themselves? • No-regret can be viewed as a nice definition of reasonable self-interested behavior. • What happens to overall system if everyone uses one? • In zero-sum games, behavior quickly approaches minimax optimal. • In general-sum games, does behavior quickly (or at all) approach a Nash equilibrium? (after all, a Nash Eq is exactly a set of distributions that are no-regret wrt each other). • Well, unfortunately, no.
A bad example for general-sum games • Augmented Shapley game from [Z04]: • First 3 rows/cols are Shapley game (rock / paper / scissors but if both do same action then both lose). • 4th action “play foosball” has slight negative if other player is still doing r/p/s but positive if other player does 4th action too. • NR algs will cycle among first 3 and have no regret, but do worse than only Nash Equilibrium of both playing foosball. • But how about routing, since this has more structure?
Consider Wardrop/Roughgarden-Tardos traffic model ce(f)=f vt vs ce(f)=2f • Given a graph G. Each edge e has non-decreasing cost function ce(fe) that tells latency of that edge as a function of the amount of traffic using it. • Say 1 unit of traffic (infinitesimal users) wants to travel from vs to vt. E.g., simple case: • Nash equilibrium is flow f* such that all paths with positive flow have the same cost, and no path is cheaper. Nash is 2/3,1/3 • Useful notions: • Cost(f) = e ce(fe)fe = cost of average user under f. • Costf(P) = e 2 P ce(fe) = cost of using path P given f. • So, Cost(f*) = minP Costf*(P). • Whathappens if people use no-regret algorithms?
Global behavior of NR algs [B-EvenDar-Ligett] 2/3 vt vs 1/3 • On day t, have flow ft. • Average regret by some time T. • So, avgt[Cost(ft)] · + minP avgt[Costft(P)]. • What we’d like to say is the time-average flow favg is -Nash: Cost(favg) · + minP Costfavg(P) • Or even better that most ft are -Nash: Cost(ft) · + minP Costft(P) • But problems if cost functions are too sharp.
But can show if bounded slope… Proof sketch: • For any edge e, time-avg cost · flow-avg cost. So, feavg¢ avgt[ce(ft)] · avgt[ce(ft) ¢ ft] • Summing over all edges, and applying the regret bound: avgt[Costft(favg)]· avgt[Cost(ft)]·+minPavgt[Costft(P)], which in turn is· + avgt[Costft(favg)]. • This means that actually, for each edge, the time-avg cost must be pretty close to the flow-avg cost, which (by the assumption of bounded slope) means the costs can’t vary too much over time. • This then lets you swap quantifiers (cost/avg) to get: Cost(favg) ·’ + minP Costfavg(P) where ’ = O(( ¢ max-slope ¢ n)1/2). Can also get bounds for “most” ft too.
Summary/Open problems • Regret-minimizing algorithms esp motivated by online routing-type problems. • Can perform comparably to best fixed path in hindsight even with very limited information. • No-regret property sufficient to converge to Nash in Wardrop model. Open problems: • Algorithmic use of [KV] for fast apx minimax. • [KV] with apx oracles; internal regret. • Nash convergence bounds pretty loose, esp for “most ft close to -Nash”.