230 likes | 575 Views
Reinforcement Learning on Markov Games. Machine Learning Seminar Series. Nilanjan Dasgupta Department of Electrical and Computer Engineering Duke University Durham, NC 27708. Overview. Markov Decision Processes ( MDP ) and Markov games .
E N D
Reinforcement Learning on Markov Games Machine Learning Seminar Series Nilanjan Dasgupta Department of Electrical and Computer Engineering Duke University Durham, NC 27708
Overview • Markov Decision Processes (MDP) and Markov games. • Optimal policy search : Value iteration (VI), Policy iteration (PI), Reinforcement learning (RL). • Minimax-Q learning for zero-sum (ZS) games. • Quantitative analysis Minimax-Q and Q-learning algorithms for ZS games. • Cons of Minimax-Q and development of Nash-Q learning algorithm. • Constraints of Nash-Q and development of Friend-or-Foe Q-learning. • Brief discussion on Partially Observed Stochastic games POSG : POMDP for Multi-agent stochastic games.
MDP and Markov Games MDP • single agents operating in a fixed environment (“world”) • It is represented by a tuple {S, A, T, R} • The agent’s objective is to find the optimal strategy mapping • to maximize the expected reward • Multiple agents operating in an environment. • It is represented by a tuple {S, A1,…, An,T, R1,…, Rn} • Agent i’s objective is to maximize the expected reward T : S X A p(S) R : S X A R N : Length of horizon g : discount factor Markov Games T : S X A1 x…x An p(S) Ri : S X A1x …x AnR
Markov Games MG = {S, A1,…, An ,T, R1,…, Rn} • When |S|=1 (single-state), Markov Game is represented by Matrix games. • When (single agent), Markov Game is represented by an MDP. • For a two-player zero-sum (ZS) game, there is a single reward function with • agents having diametrically opposite goals. • For example, a two-player zero-sum matrix game • as shown below : Agent rock paper wood rock paper wood Opponent
Optimal policy : Matrix Games • Ro,a is the reward for the agent for taking action a with • opponent taking action o. • Agent strives to maximize the expected reward while • opponent tries to minimize it. • For the strategy p* to be optimal it needs to satisfy • i.e., find the strategy for the agent that has the best “worst-case” scenario. rock paper wood rock paper wood
Optimal policy : MDP & Markov Games • There exists hosts of methods such as value iteration, policy iteration (both • assumes complete knowledge of T), reinforcement learning (RL). • Value iteration (VI) : • Use of Dynamic-Programming to estimate value functions and convergence is • guaranteed [Bertsekas, 1987]). MDPMarkov Games
Optimal policy : Markov Games MDPMarkov Games • Note • Every MDP has atleast one stationary, deterministic optimal policy. • There may not exist an optimal stationary deterministic policy for MG. • The reason being the agents uncertainty in guessing the opponents move exactly, • specially when agents are making simultaneous moves, unlike tic-tac-toe etc.
Learning Optimal policy : Reinforcement Learning • First developed by Watkins in 1989 for optimal policy learning in MDP without • explicit knowledge of T. • Agents receives a reward r while making transition from s to s’ by taking action a • T(s,a,s’) is implicitly involved in the above state transition. • Minimax-Q utilizes the same principle of Q-learning in the two-player ZS game. Via LP
Performance of Minimax-Q • Software Soccer game on a 4x5 grid. • 20 states and 5 actions {N, S, E, W and Stand}. • Four different policies, • Minimax-Q trained against random opponent • Minimax-Q trained against Minimax-Q opponent • Q trained against random opponent • 4. Q trained against random opponent
Constraints of Minimax-Q : Nash-Q • Convergence is guaranteed only for two-player zero-sum games. • Nash-Q, proposed by Hu & Wellman ’98, maintains a set of approximate • Q functions and updates them as • – One-stage Nash Equilibrium policy of player k with current estimates • of {Q1,…,Qn}. • Note that Minimax-Q learning scheme was
Single-stage Nash Equilibrium Let’s explain via a classic example, The Battle of the Sexes • Check that there exists two situations for which no person can • single-handedly change his/her action to increase their respective payoffs. • NE : (opera, opera) and (fight, fight). • For an n-player Normal form Game, • represents the Nash equilibrium iff • Two types of Nash equilibrium : coordinated and adversarial. • All normal-form non-cooperative games have Nash equilibrium, • some may be mixed-strategy. Pat Opera Fight Opera Fight Chris
Analysis of Nash-Q Learning • Why update in such a way ? • For 1-player game (MDP), Nash-Q is simple maximization– Q-learning. • For zero-sum games, Nash-Q is Minimax-Q – guaranteed convergence. • Cons – Nash equilibrium is not unique (multiple equilibrium point exists), • hence convergence is not guaranteed. • Guaranteed to work when • There exists an unique coordinated equilibrium for the entire game • and for each game defined by the Q-functions during the entire learning. • There exists an unique adversarial equilibrium for the entire game • and for each game defined by the Q-functions during the entire learning. Or,
Relaxing constraints of Nash-Q : Friend-or-Foe Q • Uniqueness of NE is relaxed in FFQ, but the algorithm needs to know the • nature of the opponents : “friend” (coordinated equilibrium), or • “foe” (adversarial equilibrium). • There exists convergence guarantee for Friend-or-foe Q-leaning algorithm.
Analysis of FFQ Learning • FFQ provides a RL-based strategy learning in multi-player general-sum games. • Like Nash-Q, it should not be expected to find a Nash-equilibrium unless • either coordinated or adversarial equilibrium exists. • Unlike Nash-Q, FFQ does not require learning Q-estimates of all the players in • the game, but only for its own. • FFQ restrictions are much weaker : doesn’t require NE to be unique all along. • Both FFQ and Nash-Q fails for games having no NE (infinite games).
Partial Observability of States : POSG • Entire discussion assumed the states to be known, although the transition • probability and reward functions can be learned asynchronously (RL). • Partially Observed Stochastic Games (POSG) assumes the underlying states • to be partially observed via observations. • Stochastic Games are analogous to MDPs, so is learning via RL. • POMDP can be interpreted as MDP over belief space, with increased • complexity due to continuous belief space. But POSG cannot be solved by • transforming it to stochastic games over belief spaces, since each agent’s • belief is potentially different. • E. A. Hansen et al. proposes a policy iteration approach for POSG that • alleviates the scaling issue for a finite-horizon case via iterative elimination • of dominated strategies (policies).
Summary • Theory of MDP and Markov Games are strongly correlated. • Minimax-Q learning is a Q-learning scheme proposed for two-player ZS games. • Minimax-Q is very conservative in its action, since it chooses a strategy • that maximizes the worst-case performance of the agent. • Nash-Q is developed for multi-player, general-sum games but converges • only under strict restrictions (existence and uniqueness of NE). • FFQ relaxes the restrictions (uniqueness) a bit, but not much. • Most algorithms are reactive i.e., each agents lets others to choose an equilibrium • point and then learns its best response. • In partial observability of states, the problem is not yet scalable.