430 likes | 672 Views
Reinforcement Learning to Play an Optimal Nash Equilibrium in Coordination Markov Games. XiaoFeng Wang and Tuomas Sandholm Carnegie Mellon University. Outline. Introduction Settings Coordination Difficulties Optimal Adaptive Learning Convergence Proof Extension: Beyond Self-play
E N D
Reinforcement Learning to Play an Optimal Nash Equilibrium in Coordination Markov Games XiaoFeng Wang and Tuomas Sandholm Carnegie Mellon University
Outline • Introduction • Settings • Coordination Difficulties • Optimal Adaptive Learning • Convergence Proof • Extension: Beyond Self-play • Extension: Beyond Team Games • Conclusion and Future Works
Coordination Games • Coordination Games: • A coordination game typically possesses multiple Nash equilibria, some of which might be Pareto dominated by some of the others. • Assumption: Players (self-interested agents) prefer Nash equilibria than any other steady states (for example, a best-response loop). • Objective: to play a Nash equilibrium which is not Pareto dominated by other Nash equilibria. • Why coordination games are important? • Whenever an individual agent cannot achieve its goal without interacting with others, coordination problems could happen. • Study on coordination games helps us to understand how to achieve win-win outcomes in interactions and avoid being stuck in undesirable equilibria. • Examples: Team games, Battle-of-the-sexes, and minimum-effort games.
Team Games • Team Games: • In a team game, agents receive the same expected rewards. • Team Games are the simplest form of coordination games • Why team games are important? • A team game can have multiple Nash equilibria. Only some of them are optimal. This captures the important properties of a general category of coordination games. Study on team games gives us an easy start without loss of important generalities.
Coordination Markov Games • Markov decision process: • Model environment as a set of states S. A decision-maker (agent) drives the changes of states to maximize the sum of its discounted long-term payoffs. • A coordination Markov game: • Combination of MDP and coordination games: A set of self-interested agents choose joint action aA to determine the state transition so as to maximize their own profits. For example, Team Markov games. • Relation between Markov game and Repeated stage games: • A joint Q-function maps a state-joint action pair (s, a) to the tuple of the sum of discounted long-term rewards individual agents receive by taking joint action a at state s and then following a joint strategy . • Q(s, . ) can be viewed as a stage game in which agent i receivesa payoff Qi(s, a) (a component of the tuple of Q(s, a)) with a joint action a being taken by all agents at state s. We call such a game as state game. • A Subgame Perfect Nash equilibrium (SPNE) of a coordination Markov game is composed of the Nash equilibria of a sequence of coordination state games.
Reinforcement Learning (RL) • Objective of reinforcement learning • Find a strategy : S A to maximize an agent’s discounted long-term payoffs without knowledge about environment model (rewarding structure and transition probability) • Model-based reinforcement learning • Learning rewarding structure and transition probability to compute Q-function. • Model-free reinforcement learning • Learning Q-function directly. • Learning policy: • Interleave learning with execution of learnt policy. • GLIE guarantees the convergence to an optimal policy for a single-agent MDP.
RL in a Coordination Markov Game • Objective • Without knowing game structure, an agent i is trying to find an optimal individual strategy i: S Ai to maximize the sum of its discounted long-term payoffs. • Difficulties: • Two layers of learning (Learning of game structure and learning of strategy) are interdependent during the learning of a general Markov game: On one hand, strategy is determined over Q-function. On the other hand, Q-function is learnt with respect to the joint strategy agents take. • RL in team Markov games • Team Markov games simplify the learning problem: Off-policy learning of game structure, learning coordination over the individual state games. • In a team Markov game, the accumulation of individual agents’ optimal policies is an optimal Nash equilibrium for the game. • Although simple, more tricky than it appears to be.
Research Issues • How to play an optimal Nash equilibrium in an unknown team Markov game? • How to extend the results to a more general category of coordination stage game and Markov games?
Outline • Introduction • Settings • Coordination Difficulties • Optimal Adaptive Learning • Convergence Proof • Extension: Beyond Self-play • Extension: Beyond Team Games • Conclusion and Future Works
Setting: • Agents make decision independently and concurrently. • No communications between agents. • Agents independently receive reward signals with the same expected values • Environment model is unknown • Agents’ actions are fully observable • Objective: find an optimal joint policy *:S Ai to maximize the sum of discounted long-term rewards.
Outline • Introduction • Settings • Coordination Difficulties • Optimal Adaptive Learning • Convergence Proof • Extension: Beyond Self-play • Extension: Beyond Team Games • Conclusion and Future Works
Coordination over a known game A0 A1 A2 • A team may have multiple optimal NE. Without coordination, agents do not know how to play. B0 B1 B2 Claus and Boutilier’s stage game • Solutions: • Lexicographic conventions (Boutilier) • Problem: Sometimes, mechanism designer unable or unwilling to impose orders. • Learning: • Each agent treats others as nonstrategic players and best responds to the empirical distribution of others’ previous plays. E.g, Fictitious play, adaptive play • Problem: The learning process may converge to a sub-optimal NE, usually a risk dominant NE
A0 A1 A2 A0 A1 A2 B0 B1 B2 B0 B1 B2 9.9 10.1 0 0 -100 -100 0 0 5 5 0 0 B -100 -100 0 0 10.1 9.9 A Coordination over an unknown game • Unknown game structure and noisy payoffs make coordination even more difficult. • Independently receiving noisy rewards, agents may hold different views of a game at a particular moment. In this case, even lexicographic convention does not work.
Problems • Against a known game • By solving the game, agents can identify all the NE but do not know how to play. • By myopic play (learning), agents can learn to play a consistent NE which however may not be optimal. • Against an unknown game • Agents might not identify optimal NE before the game structure fully converges.
Outline • Introduction • Settings • Coordination Difficulties • Optimal Adaptive Learning • Convergence Proof • Extension: Beyond Self-play • Extension: Beyond Team Games • Conclusion and Future Works
Optimal Adaptive Learning • Basic ideas: • Over a known game: eliminate the sub-optimal NE and then use myopic play (learning) to learn to play. • Over a unknown game: estimate the NE of the game before the game structure converges. Interleave learning of coordination with learning of game structure. • Learning layers: • Learning of coordination: Biased Adaptive Play against virtual games. • Learning of game structure: Construction virtual games with -bound over a model-based RL algorithm.
A0 A1 A2 A0 A1 A2 B0 B1 B2 10 0 -100 B0 B1 B2 1 0 0 0 5 0 0 0 0 -100 0 10 0 0 1 Virtual games • A virtual game (VG) is derived from a team state game Q(s,.) as follows: • If a is an optimal NE in Q(s,.), VG(s,a)=1. Otherwise, VG(s,a)=0. • Virtual games eliminate all the strict sub-optimal NE of the original games. This is nontrivial when the number of players are more than 2.
Adaptive Play • Adaptive play (AP): • Each agent has a limited memory size to hold m recent plays being observed. • To choose actions, an agent i randomly draws k samples (without replacement) to build up an empirical model of others’ joint strategy. • For example, suppose that there exists an reduced joint action profile a-i(all but i’s individual actions) which appears in the samples for K(a-i) times, agent i treats the probability of the action as K(a-i)/k. • Agent i chooses the action which best responds to this distribution. • Previous work (Peyton Young) shows that AP converges to a strict NE in any weakly acyclic game.
A0 A1 A2 A0 A1 A2 B0 B1 B2 B0 B1 B2 1 1 0 1 0 0 1 0 0 0 0 1 0 0 0 1 1 1 Weakly Acyclic Games and Biased Set • Weakly acyclic games (WAG): • In a weakly acyclic game, there exists a best-response path from any strategy profile to a strict NE. • Many virtual games are WAGs • However, not all VGs are WAGs. • Some VGs only have weak NE which does not constitute an absorbing state. • Weakly acyclic game w.r.t. a biased set (WAGB): • A game in which exist best-response paths from any profile to an NE in a set D (called biased set).
Biased Adaptive Play • Biased adaptive play (BAP): • Similar to AP except that an agent biases its action selection when it detects that it is playing an NE in the biased set. • Biased rules: • For an agent i if its k samples contain the same a-i which has also been included in at least one of NE in D, the agent chooses its most recent best response to the strategy profile. For example, if B’s samples showthat A keeps playing A0 and its most recent best response is B0, B will stick to this action. • Biased adaptive play guarantees the convergence to an optimal NE for any VG constructed over a team game with the biased set containing all the optimal NE.
Construct VG over an unknown game • Basic ideas: • Using a slowly decreasing bound (called -bound) to find all optimal NE. Specifically, • At a state s and time t, an joint action a is -optimal for the state game in if Qt(s,a)+tmaxa’Qt(s,a’). • A virtual game VGt is constructed over these -optimal joint actions. • If limtt=0 and t decreases slower than Q-function, VGt converges to VG. • Construction of -bound depends on the RL algorithm used to learn the game structure. Over a model-based reinforcement learning algorithm, we prove that the following bound meets the condition: Nb-0.5 for all 0<b<0.5, where N is the minimal number of samples made up to time t.
The Algorithm • Learning of coordination • For each state, construct VGt according to -optimal actions. • Follow GLIE learning policy, use BAP to choose best-response actions over VGtwith exploitation probability. • Learning of game structure • Use a model-based RL to update Q-function. • Update -bound with the minimal number of sampling. Find -optimal actions with the bound
Outline • Introduction • Settings • Coordination Difficulties • Optimal Adaptive Learning • Convergence Proof • Extension: Beyond Self-play • Extension: Beyond Team Games • Conclusion and Future Works
Flowchart of the Proof Theorem 1: BAP converges over WAGB Theorem 3: BAP with GLIE converges over WAGB Lemma 2: Nonstationary Markov Chain Main Theorem: OAL converges to an optimal NE w.p.1 Lemma 4: Any VG is WAGB Theorem 5: Convergence rate of the model-based RL Theorem 6: VG can be learnt with -bound w.p.1
Model BAP as a Markov chain • Stationary Markov chain model: • State: • An initial state is composed of m initial joint actions agents observed: h0=(a1, a2,…, am). • The definition of other states is inductive: The successor state h’of a state h is obtained by deleting the leftmost element and add in a new observed joint action at the leftmost side of the tuple. • Absorbing state: (a,a,…,a) is an individual absorbing state if aD or it is a strict NE. All individual absorbing states are clustered into a unique absorbing state. • Transition: • The probability ph,h’that a state h transits to h’ is positive if and only if the left most joint action a={a1, a2,…, an} in h’ is composed of individual action aiwhich best responds to at least k samples in h. • Since the distribution an agent takes to sample its memory is independent of time, the transition probability between any two states does not change with time. Therefore, the Markov chain is stationary.
Convergence over a known game • Theorem 1 Let L(a) be the shortest length of a best-response path from joint action a to an NE in D. LG=maxaL(a). If mk(LG+2), BAP over WAGB converges to either a NE in D or a strict NE w.p.1. • Nonstationary Markov Chain Model: • With GLIE learning policy, at any moment, an agent has a probability to do experimenting (exploring the actions other than the estimated best-response). The exploration probability is diminishing with time. Therefore, we can model BAP with GLIE over WAGB as a nonstationary Markov chain, with a transition matrix Pt. Let P be the transition matrix of the stationary Markov chain for BAP over the same WAGB. Clearly, GLIE guarantees that PtP with t. • In stationary Markov chain model, we have only one absorbing state (composed of several individual absorbing states). Theorem 1 says that such a Markov chain is ergodic, with only one stationary distribution, given mk(LG+2). With nonstationary Markov chain theory, we can get the following Theorem: • Theorem 2With mk(LG+2), BAP with GLIE converges to either a NE in D or a strict NE w.p.1.
n’(length of NE prefix) … … … … n (number of agents) Non-NE strategy NE Determine the length of best-response path • In a team game, LG is no more than n (the number of agents). The following figure illustrates this. In the figure, each box represents an individual action of an agent. represents an individual action contained in a NE. In the figure, we see that n-n’ agents can move the joint actions to an NE by switching their individual actions one after the other. This switching is best-response given others stick to their individual actions. • Lemma 4 The VG of any team game is a WAGB w.r.t. the set of optimal NEwith LVG n.
Learning the virtual games • First, we assess the convergence rate of the model-based RL algorithm. • Then, we construct the sufficient condition for -bound over the convergence rate lemma.
Main Theorem • Theorem 7 In any team Markov game among n agents if 1) mk(n+1) 2) -bound satisfies Lemma 6, then the OAL algorithm converges to an optimal NE w.p.1 • General ideas of the proof: • With Lemma 6, we have that the probability of the event E that VGt=VG for the rest of play after time t converges to 1 with t. • Starting from a time t’, conditioning on the probability of E, agents play BAP with GLIE over a known game, which converges to an optimal NE w.p.1 according to Theorem 3. • Combine these two convergence process together, we get the convergence result.
Example: 2-agent game A0 A1 A2 B0 B1 B2
B1C1 B1C2 B1C3 B2C1 B2C2 B2C3 B3C1 B3C2 B3C3 10 -20 -20 -20 -20 5 -20 5 -20 A1 A2 A3 -20 -20 5 -20 10 -20 5 -20 -20 -20 5 -20 5 -20 -20 -20 -20 10 Example: 3-agent game
Outline • Introduction • Settings • Coordination Difficulties • Optimal Adaptive Learning • Convergence Proof • Extension: Beyond Self-play • Extension: Beyond Team Games • Conclusion and Future Works
Extension: general ideas • Classic game theory tells us how to solve a games, i.e., identifying the fixed points of introspections. However, it is less clear about how to play a game. • Standard ways to play a game: • Solve the game first and play a NE strategy (strategic play). • Problem: 1) With existence of multiple NE, sometimes, agents may not know how to play. 2) It might be computationally expensive. • Assume that others take stationary strategy and best response to the belief (myopic play). • Problem: Myopic strategies may lead agents to play a sub-optimal (Pareto dominated) NE. • The idea generalized from OAL: Partially Myopic and Partially Strategic (PMPS) play. • Biased Action Selection: Strategically lead the other to play a stable strategy. • Virtual Games: Compute NE first and then eliminate the sub-optimal NE. • Adaptive Play: Myopically adjust best-response strategy w.r.t. the agent’s observations.
Extension: Beyond self-play • Problem: • OAL only guarantees convergence to an optimal NE in self-play. That is, all players are OAL agents. Can agents find optimal coordination when only some of them play OAL? Let’s consider the simplest case: two agent, one is JAL or IL player (Claus and Boutilier 98) and the other is OAL player. • A straightforward way to enforce the optimal coordination: • Two players, one of them is an “opinionated” player who leads the play. Leader Learner • If the other is either JAL and IL player, the convergence to optimal NE is guaranteed. • How about that the other is also a leader agent? More seriously, how to play if the leader does not know the type of the other player? A0 A1 A2 B0 B1 B2
New Biased Rules • Original biased rules: • For an agent i if its k samples contain the same a-i which has also been included in at least one of NE in D, the agent chooses its most recent best response to the strategy profile. For example, if B’s samples showthat A keeps playing A0 and its most recent best response is B0, B will stick to this action. • New biased rules: • If an agent i has multiple best-response actions w.r.t. its k samples, it chooses the one included in an optimal NE in VG. If there exists several such choices, it chooses the one which has been played most recently. • Difference between the old and the new rules: • Old rules biases the action-selection when others’ joint strategy has been included in an optimal NE. Otherwise, it just randomizes its choices of best-response actions. • The new rules always biases the agent’s action-selection.
Example • The new rules preserves the properties of convergence in n-agent team Markov games.
Extension: Beyond Team Games • How to extend the ideas of PMPS play to general coordination games? • To simplify the setting, now we consider a category of coordination stage games with the following properties: • These games have at least one pure strategy NE. • Agents have compatible preferences of some of these NE over any other steady states (such as mixed strategy NE or best-response loops). • Let’s consider two situations: Perfect monitoring and imperfect monitoring. • Perfect monitoring: Agents can observe others’ actions and payoffs. • Imperfect monitoring: Agents only observe others’ actions. • All agents may not have information about the game structure.
Perfect Monitoring • Following the same idea of OAL. • Algorithm: • Learning of coordination • Compute all the NE of the game estimated. • Find out all the NE being dominated. For example, a strategy profile (a,b) is dominated by (a’,b’) if (Q(a)<Q(a’)-) and (Q(b)Q(b’)+). • Construct a VG which contains all the NE not being dominated, setting other values in VG to zero (without loss of generality, suppose that agents normalize their payoff to a value between zero and one). • With GLIE exploration, BAP over the VG. • Learning of game structure • Observe the others’ payoffs and update the sample means of agents’ expected payoffs in the game matrix. • Compute an -bound in the same way as OAL. • The learning over the coordination stage games we discussed is conjectured to converge to an NE not being Pareto dominated w.p.1
Imperfect Monitoring • In general, it is difficult to eliminate sub-optimal NE without knowing others’ payoffs. Let’s consider the simplest case: Two learning agents have at least one common interest (a strategy profile maximizes both agents’ payoffs). • For this game, agents can learn to play an optimal NE with a modified version of OAL (with new biased rules). • Biased rules: 1) Each agent randomizes its action-selection whenever the payoff of its best-response actions is zero over the virtual game. 2) Each agent biases its action to recent best response if all its k samples contain the same individual actions of the other agent, more than m-k recorded joint actions have this property and the agent have multiple best responses to give it payoff 1 w.r.t. to its k samples. Otherwise, randomly choose best-response action. • In this type of coordination stage game, the learning process is conjectured to converge to an optimal NE. The result can be extended to Markov game.
Example A0 A1 A2 B0 B1 B2
Conclusions and Future Works • In this research, we study RL techniques for agents to play an optimal NE (not being Pareto dominated by other NE) in coordination games when the environmental model is unknown beforehand. • We start our research with team game and propose the OAL algorithm, the first algorithm which guarantees the convergence to an optimal NE in any team Markov games. • We further generalize the basic ideas in OAL and propose a new approach for learning in games, called partially myopic and partially strategic play. • We extend the PMPS play beyond self-play and team games. Some of the results can be extended to Markov games. • In future research, we will further explore the application of PMPS play in coordination games. Especially, we will study how to eliminate sub-optimal NE in imperfect monitoring environments.