1 / 16

Reinforcement Learning on Markov Games

Reinforcement Learning on Markov Games. Machine Learning Seminar Series. Nilanjan Dasgupta Department of Electrical and Computer Engineering Duke University Durham, NC 27708. Overview. Markov Decision Processes ( MDP ) and Markov games .

papina
Download Presentation

Reinforcement Learning on Markov Games

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Reinforcement Learning on Markov Games Machine Learning Seminar Series Nilanjan Dasgupta Department of Electrical and Computer Engineering Duke University Durham, NC 27708

  2. Overview • Markov Decision Processes (MDP) and Markov games. • Optimal policy search : Value iteration (VI), Policy iteration (PI), Reinforcement learning (RL). • Minimax-Q learning for zero-sum (ZS) games. • Quantitative analysis Minimax-Q and Q-learning algorithms for ZS games. • Cons of Minimax-Q and development of Nash-Q learning algorithm. • Constraints of Nash-Q and development of Friend-or-Foe Q-learning. • Brief discussion on Partially Observed Stochastic games POSG : POMDP for Multi-agent stochastic games.

  3. MDP and Markov Games MDP • single agents operating in a fixed environment (“world”) • It is represented by a tuple {S, A, T, R} • The agent’s objective is to find the optimal strategy mapping • to maximize the expected reward • Multiple agents operating in an environment. • It is represented by a tuple {S, A1,…, An,T, R1,…, Rn} • Agent i’s objective is to maximize the expected reward T : S X A p(S) R : S X A R N : Length of horizon g : discount factor Markov Games T : S X A1 x…x An p(S) Ri : S X A1x …x AnR

  4. Markov Games MG = {S, A1,…, An ,T, R1,…, Rn} • When |S|=1 (single-state), Markov Game is represented by Matrix games. • When (single agent), Markov Game is represented by an MDP. • For a two-player zero-sum (ZS) game, there is a single reward function with • agents having diametrically opposite goals. • For example, a two-player zero-sum matrix game • as shown below : Agent rock paper wood rock paper wood Opponent

  5. Optimal policy : Matrix Games • Ro,a is the reward for the agent for taking action a with • opponent taking action o. • Agent strives to maximize the expected reward while • opponent tries to minimize it. • For the strategy p* to be optimal it needs to satisfy • i.e., find the strategy for the agent that has the best “worst-case” scenario. rock paper wood rock paper wood

  6. Optimal policy : MDP & Markov Games • There exists hosts of methods such as value iteration, policy iteration (both • assumes complete knowledge of T), reinforcement learning (RL). • Value iteration (VI) : • Use of Dynamic-Programming to estimate value functions and convergence is • guaranteed [Bertsekas, 1987]). MDPMarkov Games

  7. Optimal policy : Markov Games MDPMarkov Games • Note • Every MDP has atleast one stationary, deterministic optimal policy. • There may not exist an optimal stationary deterministic policy for MG. • The reason being the agents uncertainty in guessing the opponents move exactly, • specially when agents are making simultaneous moves, unlike tic-tac-toe etc.

  8. Learning Optimal policy : Reinforcement Learning • First developed by Watkins in 1989 for optimal policy learning in MDP without • explicit knowledge of T. • Agents receives a reward r while making transition from s to s’ by taking action a • T(s,a,s’) is implicitly involved in the above state transition. • Minimax-Q utilizes the same principle of Q-learning in the two-player ZS game. Via LP

  9. Performance of Minimax-Q • Software Soccer game on a 4x5 grid. • 20 states and 5 actions {N, S, E, W and Stand}. • Four different policies, • Minimax-Q trained against random opponent • Minimax-Q trained against Minimax-Q opponent • Q trained against random opponent • 4. Q trained against random opponent

  10. Constraints of Minimax-Q : Nash-Q • Convergence is guaranteed only for two-player zero-sum games. • Nash-Q, proposed by Hu & Wellman ’98, maintains a set of approximate • Q functions and updates them as • – One-stage Nash Equilibrium policy of player k with current estimates • of {Q1,…,Qn}. • Note that Minimax-Q learning scheme was

  11. Single-stage Nash Equilibrium Let’s explain via a classic example, The Battle of the Sexes • Check that there exists two situations for which no person can • single-handedly change his/her action to increase their respective payoffs. • NE : (opera, opera) and (fight, fight). • For an n-player Normal form Game, • represents the Nash equilibrium iff • Two types of Nash equilibrium : coordinated and adversarial. • All normal-form non-cooperative games have Nash equilibrium, • some may be mixed-strategy. Pat Opera Fight Opera Fight Chris

  12. Analysis of Nash-Q Learning • Why update in such a way ? • For 1-player game (MDP), Nash-Q is simple maximization– Q-learning. • For zero-sum games, Nash-Q is Minimax-Q – guaranteed convergence. • Cons – Nash equilibrium is not unique (multiple equilibrium point exists), • hence convergence is not guaranteed. • Guaranteed to work when • There exists an unique coordinated equilibrium for the entire game • and for each game defined by the Q-functions during the entire learning. • There exists an unique adversarial equilibrium for the entire game • and for each game defined by the Q-functions during the entire learning. Or,

  13. Relaxing constraints of Nash-Q : Friend-or-Foe Q • Uniqueness of NE is relaxed in FFQ, but the algorithm needs to know the • nature of the opponents : “friend” (coordinated equilibrium), or • “foe” (adversarial equilibrium). • There exists convergence guarantee for Friend-or-foe Q-leaning algorithm.

  14. Analysis of FFQ Learning • FFQ provides a RL-based strategy learning in multi-player general-sum games. • Like Nash-Q, it should not be expected to find a Nash-equilibrium unless • either coordinated or adversarial equilibrium exists. • Unlike Nash-Q, FFQ does not require learning Q-estimates of all the players in • the game, but only for its own. • FFQ restrictions are much weaker : doesn’t require NE to be unique all along. • Both FFQ and Nash-Q fails for games having no NE (infinite games).

  15. Partial Observability of States : POSG • Entire discussion assumed the states to be known, although the transition • probability and reward functions can be learned asynchronously (RL). • Partially Observed Stochastic Games (POSG) assumes the underlying states • to be partially observed via observations. • Stochastic Games are analogous to MDPs, so is learning via RL. • POMDP can be interpreted as MDP over belief space, with increased • complexity due to continuous belief space. But POSG cannot be solved by • transforming it to stochastic games over belief spaces, since each agent’s • belief is potentially different. • E. A. Hansen et al. proposes a policy iteration approach for POSG that • alleviates the scaling issue for a finite-horizon case via iterative elimination • of dominated strategies (policies).

  16. Summary • Theory of MDP and Markov Games are strongly correlated. • Minimax-Q learning is a Q-learning scheme proposed for two-player ZS games. • Minimax-Q is very conservative in its action, since it chooses a strategy • that maximizes the worst-case performance of the agent. • Nash-Q is developed for multi-player, general-sum games but converges • only under strict restrictions (existence and uniqueness of NE). • FFQ relaxes the restrictions (uniqueness) a bit, but not much. • Most algorithms are reactive i.e., each agents lets others to choose an equilibrium • point and then learns its best response. • In partial observability of states, the problem is not yet scalable.

More Related