Mutually-guided Multi-agent Learning

Mutually-guided Multi-agent Learning Raghav Aras Alain Dutech François Charpillet (MAIA) June 2004

Outline • A review of some Multiagent Q-learning approaches • Our approach for Multiagent learning in a stochastic game • Some preliminary results

Multiagent Q-Learning (1) Q-Learning:(single-agent learning) • Q(st, at)  (1 - ) Q(st, at) +  [Rt +  maxa Q(st+1, a)] • Known to converge to optimal values minimax-Q:(for zero-sum, 2-player games) • V1(s) ← maxP1 ∈ Π(A1) mina2 ∈ A2a1∈A1 P1(a1)Q1(s,(a1, a2)) • Known to converge to optimal values

Multiagent Q-Learning (2) Nash-Q learning: (for n-agent, general sum SGs) • Qi(s, a1,..,an)  (1 - ) Qi(s, a1,..,an) +  [Ri +  NashQi(s’)] where NashQi(s’) = Qi(s’, 1(s’) 2(s’)…n(s’)) • Convergences under strict conditions (existence, uniqueness of Nash equilibria)

Drawbacks of NashQ learning • Coordination in choice of Nash Equilibrium • Observability of all actions and all rewards • Space complexity (each agent): n |S||A|n

The problem that we treat… n-agent SG <S, A1..An, R1..Rn, P>: • Ri: S x (A1 x…An)   • P: S x (A1 x…An) x S  {0,1} (deterministic) • i  S (set of equally good goal states) •  = 1 2 …n • ||  1 (atleast one common goal state) • An agent’s payoff is the same in all its goal states • Agents’ payoffs may be different in the common goal state

Our Interest • A more realistic assumption for SGs: (Actions, rewards of other agents hidden) • Investigating « independent » learning in SGs (Leading also to scalability) • Using communication to forge cooperation A single-agent learning algorithm giving maximum payoff to a maximum number of agents

Agent 1 sends 1 1 1 0 0 1 0 0 0 0 0 0 2 1 1 3 3 2 4 4 4 5 5 5 Agent 2 gets Agent 3 gets Communication in Our Approach • Agents send and receive ‘ping messages’ • Sending a message is an action A ping message, • …is an n-1 sized array of 0s and 1s • …has no content

Communication based Q-Values • Agent state = <game state, message received> • Agent action = <basic action, message to send> • 2n – 1 possible messages • Mi, agent i message set • State set = S x Mi • Action set = Ai x Mi • Size of Q-value set: S x Mi x Ai x Mi Agent policyi: S x Mi Ai x Mi

What do we envisage the messages doing? • Alert others of proximity to a goal state • Discover the common goal state • Enforce preference for the common goal state Main principle of our algorithm: • Play safe by inverting actual rewards • Create artificial rewards based on messages

The Q-comm Learning algorithm Agent i initial state i<S, > • Loop (each agent) • Select i  <ai, messsend>(Boltzmann, -greedy) • Execute i, observe reward Ri • i  <S’, messrecd> (next state) • RMi(Ri . messsend) + (Ri . messrecd) • Invert reward, Ri  -1 Ri • Qi(i, i)  (1 - ) Qi(i, i) + •  [Ri + RMi+  max Qi(i , )] • i  i,S S’ • Until S   (a goal state)

9 3 5 8 Test Problem: Find the Winning Number (FWN) An n-digit array (number) controlled by n agents • Each agent controls a digit • Actions: +1, -1, 0 • i, list of « winning » numbers for i (unknown) • Each num  i gives equal payoff to i •  = 1 2 …n •  contains a common « winning » number

Results (1): 3 agent FWN 1 = {2, 16, 119}, 2 = {68, 102, 119} , 3 = {37, 86, 119}

Results (2): 3 agent FWN 1 = {2, 16, 119}, 2 = {68, 102, 119} , 3 = {37, 86, 119}

Results (3): Multiple Common Goals Agents select one common goal

Results (4): 4 agent FWN Not all agents satisfied!

Summary of Results: • Empirically, Q-comm learning finds the common goal • Works with multiple common goals • Agents coordinate equilibrium choice • Works with upto 3 agents • Doesn’t always work for 4 or more agents

Future work: • Increase scalability by localising communication • Investigate how it can work for n  4 • Analyse convergence

Thank you! Your questions…

Mutually-guided Multi-agent Learning