170 likes | 276 Views
AWESOME : A General Multiagent Learning Algorithm that Converges in Self-Play and Learns a Best Response Against Stationary Opponents. Vincent Conitzer and Tuomas Sandholm Computer Science Department Carnegie Mellon University. Learning in games. Two aspects of learning in games:
E N D
AWESOME: A General Multiagent Learning Algorithm that Converges in Self-Play and Learns a Best Response Against Stationary Opponents Vincent Conitzer and Tuomas Sandholm Computer Science Department Carnegie Mellon University
Learning in games • Two aspects of learning in games: • Learning the game (or aspects of the game) itself • Learning how the opponent is behaving • Many previous algorithms have interleaved this • This paper focuses solely on learning with respect to the opponent • It assumes that the game is known • It assumes that an equilibrium can be computed
The setting • There are N players, each with their own possible actions • There is a known stage game (matrix game) which the players play repeatedly • Mapping from action vectors to payoff vectors • Each round, the players decide on a distribution over their actions to play from (a mixed strategy) • The players have a long-term learning strategy • Special case: a stationary strategy (play from the same distribution every time) A 2-player, 3-action stage game
How should a stage game be played? • Nash equilibrium: • Every agent has a mixed strategy (distributions over actions) • Each agent’s mixed strategy is a best response to the other’s • Makes sense for infinitely rational agents • But: against a (less clever) opponent with a fixed mixed strategy, we could do better 50% 50% 0% 50% Unique Nash equilibrium 50% 0% Suboptimal opponent 49% 51% 0% Best response 100% 0% 0%
Objective: two properties • Our algorithm is designed to achieve two properties: • Against opponents that (eventually) play from a stationary distribution, eventually play the best response • Maximum exploitation against a fixed strategy • Compare opponent modeling • In self-play(playing against other agents using the algorithm), the agents should converge to a Nash equilibrium
Closely related prior results • Regret matching [Hart, Mas-Colell 2000]: • Regrets go to zero (which implies eventual best-responding to fixed strategies) • But: in self-play, convergence to correlated equilibrium only • Correlated equilibrium is a relaxed version of Nash equilibrium • WoLF-IGA [Bowling, Veloso 2002]: gets both properties given that: • The game is two-player, two-action • Players can observe each other’s mixed strategies (not just the played action) • Can use infinitesimally small step sizes AWESOME achieves both properties without any of these assumptions
Introducing AWESOME • AWESOME stands for Adapt When Everybody is Stationary, Otherwise Move to Equilibrium • The basic idea: • Detect if the other player is playing stationary • If so, try to play the best response • Otherwise, restart completely, go back to equilibrium strategy
AWESOME’s null hypotheses • AWESOME starts with a null hypothesis that everyone is playing the (precomputed) equilibrium • If this is rejected, AWESOME switches to another null hypothesis that all players are playing stationary • If this is rejected, AWESOME restarts completely • The current hypothesis is evaluated everyepoch • Epoch = certain number of rounds • Reject the equilibrium hypothesis if the actual distribution of actions is too far from the equilibrium • Reject the stationarity hypothesis if the actual distribution changes too much between epochs • We will discuss how to reject hypotheses later
What does AWESOME play? • While the equilibrium hypothesis is maintained, it also plays its equilibrium strategy • The goal of the equilibrium hypothesis is that we do not move away from the (possibly mixed-strategy) equilibrium because AWESOME starts playing (pure-strategy) best responses • When the equilibrium hypothesis is rejected, AWESOME picks a random action to play • Then, if another action appears to be (significantly) better against what the others were playing in the last epoch, AWESOME will switch to that action • Significant difference is necessary to prevent AWESOME from jumping around between equivalent actions, which may cause restarts
A naïve approach • Say we were to apply the same test of the hypothesis every epoch: • Same number of rounds every epoch • If the observed distribution of actions deviates more than epsilon from the hypothesized distribution, reject it Epoch 2 Epoch 1 … hypothesized distribution Fraction of times action 1 played (on x-axis) … bounds for accepting the hypothesis probability of acceptance (given hypothesis is true)
Two problems with the naïve approach Epoch 2 Epoch 1 … • Even if the hypothesis is true, each epoch there is constant probability of rejecting • Possibly, by fluke, the actual distribution looks nothing like the hypothesized one • How do we distinguish a distribution within epsilonfrom the hypothesized one? • E.g. if another player plays almost the equilibrium strategy, we want to best-respond (presumably a pure strategy), not play the mixed equilibrium strategy …
Solution • Let the epoch length increase, while the test gets stronger (observed distribution should get closer to the hypothesized distribution) … Epoch 2 Epoch 1 … Acceptable margin has decreased… …but the distribution is much narrower (more rounds)… …so that the chance of acceptance has actually increased! • If the chance of rejection is decreased fast enough, with nonzero probability, we will never reject!
Final proof details • We define what constitutes a valid schedulefor changing the epsilons (one for each hypothesis) and the number of rounds per epoch • Number of rounds should increase fast enough to get nonzero probability of never restarting (if hypothesis is true) • Chebyshev’s inequality then allows us to bound the probability of restart in a given epoch • This allows us to prove the paper’s main results: • Theorem. AWESOME (with a valid schedule) converges to a best response against (eventually) stationary opponents. • Theorem. AWESOME (with a valid schedule) converges to a Nash equilibrium in self-play. • Interestingly, it is not always the pre-computed equilibrium!
Summary • AWESOME is (to our knowledge) the first algorithm for learning in general repeated games that: • Converges to a best response against (eventually) stationary opponents • Converges to a Nash equilibrium in self-play • Basic idea: try to adapt (best-respond) when everybody appears to be playing stationary strategies, but otherwise go back to the equilibrium • AWESOME achieves this by testing various hypotheses each epoch of rounds • Convergence can be proved for carefully constructed schedules for simultaneously increasing • the strength of the test • the number of rounds per epoch
Future research • Speed of convergence • Does AWESOME converge fast? • For which schedules of increasing the number of rounds per epoch & strength of the test? • Can it be changed to converge faster? • Does AWESOME have additional properties? • For example, basic idea seems fairly “safe” in zero-sum games • Add code to the algorithm skeleton to obtain other properties? • Fewer assumptions • Can we integrate learning the structure of the game? • Can AWESOME be simplified? • E.g., not having to compute a Nash equilibrium • Are the proof techniques we used useful elsewhere?