380 likes | 592 Views
Game-Theoretic Multi-Agent Learning. By: Mostafa Sahraei-Ardakani. Research Groups:. Stanford University (Yoav Shoham) Rutgers University (Michael Litmaan) University of Michigan (Michael Wellman) University of Alberta (Michael Bowling) University of British Columbia (Kevin Leyton-Brown)
E N D
Game-Theoretic Multi-Agent Learning By: Mostafa Sahraei-Ardakani
Research Groups: • Stanford University (Yoav Shoham) • Rutgers University (Michael Litmaan) • University of Michigan (Michael Wellman) • University of Alberta (Michael Bowling) • University of British Columbia (Kevin Leyton-Brown) • McGill University (Shie Mannor) • Brown University (Ammy Greenwald) • Carnegie Mellon University
Basic Definitions • Markov Decision Process(MDP) • Stage Games • Repeated Games : Repeated Stage Game • Stochastic Games (Markov Games) : A generalization of Repeated games and MDPs
Definitions in SG point of view • Repeated Game: Stochastic game with only one stage (state) • MDP: Stochastic game with only one agent • So SG is a generalization of RG and MDP and has both properties.
What is the question?!!! • What exact question(s) is MAL addressing? • What is the yardstick? • Which information is available? • Game rules • Play observability • Rivals’ actions • Rivals’ Strategies • Learning or/and Teaching • Rock-Paper-Scissors • Repeated Prisoners’ Dilemma
Engineering Application • Distributed Controllers • Simplifies design of independent controllers • Equilibrium or Global Optimum? • Problem of Exploitation of Learning
Model-Based Approaches • Of Game Theorists Interest • Start with some model of opponent’s strategy • Compute and play best response • Observe the opponent’s play and update the model of her strategy • Go to Step 2 • Example: Fictitious Play (1951) • Compute rivals’ mixed strategy according to the history • Play the best response
Fictitious Play (FP) • assumes opponent’s play stationary strategies • multiple best responses choosen with positive probability Convergence guarantees: • Games which are iterated dominance solvable (strict Nash equilibrium) • Cooperative • In zero-sum games the empirical distribution converges to the unique mixed startegy Nash equilibrium. Note : Smooth FP can play mixed strategies
Incremental Gradient Ascent Learners (IGA) • incrementally climbs on the mixed strategy space • for 2-player 2-action general sum games • guarentees convergence to a Nash equilibrium or guarantees convergence to an average payoff that is sustained by some Nash equilibrium
AWESOME! • Adapt When Everybody is Stationary, Otherwise Move to Equilibrium • Is not RL • Converges to NE in Self Play • Plays Some epochs • APPE • APS Adapts and finds the best respons
Model-Free Approaches • Reinforcement learning • AVOID Building an explicit model • How well ones’ own various possible actions fare. • Mostly studied in: Computer Science- AI
Single Agent Q-Learning • The Environment is no longer Stationary • Therefore, the convergence is not guaranteed.
Bellman’s Heritage • Single agent Q-learning converges to optimal value function V* • Simple extension to multi-agent SG setting Q values updated without regard of opponents’ actions Justified if opponents’ choice of actions are stationary
Bellman’s Heritage • Cure: Define Q-values as a function of all agents’ actions Problem: How to update V? • Maximin Q-learning Problem: Motivated only for zero-sum SG
Mini-Max Learning • For Zero-Sum Games, or conservative play
Nash Q-Learning for GSSG • Max operator (Q-Learning) Nash operator(Nash-Q)
Friend or Foe Q-Learning • Adversarial Equilibrium • Coordination Equilibrium
Friend or Foe Q-Learning (2) • Opponent Considered as Friend: • Opponent Considered as Foe
Friend or Foe Q-Learning (3) • The Opponent may act differently! • Results on two common grid games
Correlated Q-Learning • What is Correlated Equilibrium? • Example • Benefits over Mixed Strategy Nash • Convex ploytope Linear Programming • Better Outcomes and Denial • Independent action selection with a shared signal
Correlated Q-Learning(2) • Need not be well-defined like Nash Value function • Generalizes the before mentioned functions
Correlated Q-Learning(3) • Utilitarian • Egalitarian • Republican • Libertarian
Platform for MARL • http://www.cs.ubc.ca/~kevinlb/malt • GAMUT Stanford
New Approach: Time-order Policy Update • Make the Environment stationary • How to observe rivals’ actions? • Keep the MAX operator! • No direct focus on equilibria