190 likes | 355 Views
Mutually-guided Multi-agent Learning. Raghav Aras Alain Dutech François Charpillet (MAIA) June 2004. Outline. A review of some Multiagent Q-learning approaches Our approach for Multiagent learning in a stochastic game Some preliminary results. Multiagent Q-Learning (1).
E N D
Mutually-guided Multi-agent Learning Raghav Aras Alain Dutech François Charpillet (MAIA) June 2004
Outline • A review of some Multiagent Q-learning approaches • Our approach for Multiagent learning in a stochastic game • Some preliminary results
Multiagent Q-Learning (1) Q-Learning:(single-agent learning) • Q(st, at) (1 - ) Q(st, at) + [Rt + maxa Q(st+1, a)] • Known to converge to optimal values minimax-Q:(for zero-sum, 2-player games) • V1(s) ← maxP1 ∈ Π(A1) mina2 ∈ A2a1∈A1 P1(a1)Q1(s,(a1, a2)) • Known to converge to optimal values
Multiagent Q-Learning (2) Nash-Q learning: (for n-agent, general sum SGs) • Qi(s, a1,..,an) (1 - ) Qi(s, a1,..,an) + [Ri + NashQi(s’)] where NashQi(s’) = Qi(s’, 1(s’) 2(s’)…n(s’)) • Convergences under strict conditions (existence, uniqueness of Nash equilibria)
Drawbacks of NashQ learning • Coordination in choice of Nash Equilibrium • Observability of all actions and all rewards • Space complexity (each agent): n |S||A|n
The problem that we treat… n-agent SG <S, A1..An, R1..Rn, P>: • Ri: S x (A1 x…An) • P: S x (A1 x…An) x S {0,1} (deterministic) • i S (set of equally good goal states) • = 1 2 …n • || 1 (atleast one common goal state) • An agent’s payoff is the same in all its goal states • Agents’ payoffs may be different in the common goal state
Our Interest • A more realistic assumption for SGs: (Actions, rewards of other agents hidden) • Investigating « independent » learning in SGs (Leading also to scalability) • Using communication to forge cooperation A single-agent learning algorithm giving maximum payoff to a maximum number of agents
Agent 1 sends 1 1 1 0 0 1 0 0 0 0 0 0 2 1 1 3 3 2 4 4 4 5 5 5 Agent 2 gets Agent 3 gets Communication in Our Approach • Agents send and receive ‘ping messages’ • Sending a message is an action A ping message, • …is an n-1 sized array of 0s and 1s • …has no content
Communication based Q-Values • Agent state = <game state, message received> • Agent action = <basic action, message to send> • 2n – 1 possible messages • Mi, agent i message set • State set = S x Mi • Action set = Ai x Mi • Size of Q-value set: S x Mi x Ai x Mi Agent policyi: S x Mi Ai x Mi
What do we envisage the messages doing? • Alert others of proximity to a goal state • Discover the common goal state • Enforce preference for the common goal state Main principle of our algorithm: • Play safe by inverting actual rewards • Create artificial rewards based on messages
The Q-comm Learning algorithm Agent i initial state i<S, > • Loop (each agent) • Select i <ai, messsend>(Boltzmann, -greedy) • Execute i, observe reward Ri • i <S’, messrecd> (next state) • RMi(Ri . messsend) + (Ri . messrecd) • Invert reward, Ri -1 Ri • Qi(i, i) (1 - ) Qi(i, i) + • [Ri + RMi+ max Qi(i , )] • i i,S S’ • Until S (a goal state)
9 3 5 8 Test Problem: Find the Winning Number (FWN) An n-digit array (number) controlled by n agents • Each agent controls a digit • Actions: +1, -1, 0 • i, list of « winning » numbers for i (unknown) • Each num i gives equal payoff to i • = 1 2 …n • contains a common « winning » number
Results (1): 3 agent FWN 1 = {2, 16, 119}, 2 = {68, 102, 119} , 3 = {37, 86, 119}
Results (2): 3 agent FWN 1 = {2, 16, 119}, 2 = {68, 102, 119} , 3 = {37, 86, 119}
Results (3): Multiple Common Goals Agents select one common goal
Results (4): 4 agent FWN Not all agents satisfied!
Summary of Results: • Empirically, Q-comm learning finds the common goal • Works with multiple common goals • Agents coordinate equilibrium choice • Works with upto 3 agents • Doesn’t always work for 4 or more agents
Future work: • Increase scalability by localising communication • Investigate how it can work for n 4 • Analyse convergence
Thank you! Your questions…