Vincent Conitzer and Tuomas Sandholm Computer Science Department Carnegie Mellon University

AWESOME: A General Multiagent Learning Algorithm that Converges in Self-Play and Learns a Best Response Against Stationary Opponents Vincent Conitzer and Tuomas Sandholm Computer Science Department Carnegie Mellon University

Learning in games • Two aspects of learning in games: • Learning the game (or aspects of the game) itself • Learning how the opponent is behaving • Many previous algorithms have interleaved this • This paper focuses solely on learning with respect to the opponent • It assumes that the game is known • It assumes that an equilibrium can be computed

The setting • There are N players, each with their own possible actions • There is a known stage game (matrix game) which the players play repeatedly • Mapping from action vectors to payoff vectors • Each round, the players decide on a distribution over their actions to play from (a mixed strategy) • The players have a long-term learning strategy • Special case: a stationary strategy (play from the same distribution every time) A 2-player, 3-action stage game

How should a stage game be played? • Nash equilibrium: • Every agent has a mixed strategy (distributions over actions) • Each agent’s mixed strategy is a best response to the other’s • Makes sense for infinitely rational agents • But: against a (less clever) opponent with a fixed mixed strategy, we could do better 50% 50% 0% 50% Unique Nash equilibrium 50% 0% Suboptimal opponent 49% 51% 0% Best response 100% 0% 0%

Objective: two properties • Our algorithm is designed to achieve two properties: • Against opponents that (eventually) play from a stationary distribution, eventually play the best response • Maximum exploitation against a fixed strategy • Compare opponent modeling • In self-play(playing against other agents using the algorithm), the agents should converge to a Nash equilibrium

Closely related prior results • Regret matching [Hart, Mas-Colell 2000]: • Regrets go to zero (which implies eventual best-responding to fixed strategies) • But: in self-play, convergence to correlated equilibrium only • Correlated equilibrium is a relaxed version of Nash equilibrium • WoLF-IGA [Bowling, Veloso 2002]: gets both properties given that: • The game is two-player, two-action • Players can observe each other’s mixed strategies (not just the played action) • Can use infinitesimally small step sizes AWESOME achieves both properties without any of these assumptions

Introducing AWESOME • AWESOME stands for Adapt When Everybody is Stationary, Otherwise Move to Equilibrium • The basic idea: • Detect if the other player is playing stationary • If so, try to play the best response • Otherwise, restart completely, go back to equilibrium strategy

AWESOME’s null hypotheses • AWESOME starts with a null hypothesis that everyone is playing the (precomputed) equilibrium • If this is rejected, AWESOME switches to another null hypothesis that all players are playing stationary • If this is rejected, AWESOME restarts completely • The current hypothesis is evaluated everyepoch • Epoch = certain number of rounds • Reject the equilibrium hypothesis if the actual distribution of actions is too far from the equilibrium • Reject the stationarity hypothesis if the actual distribution changes too much between epochs • We will discuss how to reject hypotheses later

What does AWESOME play? • While the equilibrium hypothesis is maintained, it also plays its equilibrium strategy • The goal of the equilibrium hypothesis is that we do not move away from the (possibly mixed-strategy) equilibrium because AWESOME starts playing (pure-strategy) best responses • When the equilibrium hypothesis is rejected, AWESOME picks a random action to play • Then, if another action appears to be (significantly) better against what the others were playing in the last epoch, AWESOME will switch to that action • Significant difference is necessary to prevent AWESOME from jumping around between equivalent actions, which may cause restarts

A naïve approach • Say we were to apply the same test of the hypothesis every epoch: • Same number of rounds every epoch • If the observed distribution of actions deviates more than epsilon from the hypothesized distribution, reject it Epoch 2 Epoch 1 … hypothesized distribution Fraction of times action 1 played (on x-axis) … bounds for accepting the hypothesis probability of acceptance (given hypothesis is true)

Two problems with the naïve approach Epoch 2 Epoch 1 … • Even if the hypothesis is true, each epoch there is constant probability of rejecting • Possibly, by fluke, the actual distribution looks nothing like the hypothesized one • How do we distinguish a distribution within epsilonfrom the hypothesized one? • E.g. if another player plays almost the equilibrium strategy, we want to best-respond (presumably a pure strategy), not play the mixed equilibrium strategy …

Solution • Let the epoch length increase, while the test gets stronger (observed distribution should get closer to the hypothesized distribution) … Epoch 2 Epoch 1 … Acceptable margin has decreased… …but the distribution is much narrower (more rounds)… …so that the chance of acceptance has actually increased! • If the chance of rejection is decreased fast enough, with nonzero probability, we will never reject!

Final proof details • We define what constitutes a valid schedulefor changing the epsilons (one for each hypothesis) and the number of rounds per epoch • Number of rounds should increase fast enough to get nonzero probability of never restarting (if hypothesis is true) • Chebyshev’s inequality then allows us to bound the probability of restart in a given epoch • This allows us to prove the paper’s main results: • Theorem. AWESOME (with a valid schedule) converges to a best response against (eventually) stationary opponents. • Theorem. AWESOME (with a valid schedule) converges to a Nash equilibrium in self-play. • Interestingly, it is not always the pre-computed equilibrium!

The algorithm

Summary • AWESOME is (to our knowledge) the first algorithm for learning in general repeated games that: • Converges to a best response against (eventually) stationary opponents • Converges to a Nash equilibrium in self-play • Basic idea: try to adapt (best-respond) when everybody appears to be playing stationary strategies, but otherwise go back to the equilibrium • AWESOME achieves this by testing various hypotheses each epoch of rounds • Convergence can be proved for carefully constructed schedules for simultaneously increasing • the strength of the test • the number of rounds per epoch

Future research • Speed of convergence • Does AWESOME converge fast? • For which schedules of increasing the number of rounds per epoch & strength of the test? • Can it be changed to converge faster? • Does AWESOME have additional properties? • For example, basic idea seems fairly “safe” in zero-sum games • Add code to the algorithm skeleton to obtain other properties? • Fewer assumptions • Can we integrate learning the structure of the game? • Can AWESOME be simplified? • E.g., not having to compute a Nash equilibrium • Are the proof techniques we used useful elsewhere?

Thank you for your attention!

Vincent Conitzer and Tuomas Sandholm Computer Science Department Carnegie Mellon University

Vincent Conitzer and Tuomas Sandholm Computer Science Department Carnegie Mellon University

Presentation Transcript

Tuomas Sandholm

Pingzhong Tang and Tuomas Sandholm Computer Science Department Carnegie Mellon University

Andrew Gilpin and Tuomas Sandholm Carnegie Mellon University Computer Science Department

Mor Harchol-Balter Carnegie Mellon University Computer Science

School of Computer Science Carnegie Mellon University

Vincent Conitzer Duke University

Vincent Conitzer, Tuomas Sandholm, Carnegie Mellon University Paolo Santi, Pisa University

Tuomas Sandholm Computer Science Department Carnegie Mellon University

Vincent Conitzer and Tuomas Sandholm

Communication Complexity of Common Voting Rules Vincent Conitzer and Tuomas Sandholm CMU

Carnegie Mellon University

Carnegie Mellon University

Carnegie Mellon University

Tuomas Sandholm

Tuomas Sandholm Professor Computer Science Department Carnegie Mellon University

Ali Kemal Sinop * Computer Science Department Carnegie Mellon University Leo Grady

Vincent Conitzer, Tuomas Sandholm, Carnegie Mellon University Paolo Santi, Pisa University

Tuomas Sandholm Computer Science Department Carnegie Mellon University

Tuomas Sandholm Computer Science Department Carnegie Mellon University

Tuomas Sandholm Professor Computer Science Department Carnegie Mellon University

Carnegie Mellon University

Tuomas Sandholm Professor Computer Science Department Carnegie Mellon University