A reinforcement learning scheme for a multi-agent card game: het leren van een POMDP

A reinforcement learning schemefor a multi-agent card game:het leren van een POMDP 1,3 1 2 Hajime Fujita, Yoichiro Matsuno, and Shin Ishii 1. Nara Institute of Science and Technology 2. Ricoh Co. Ltd. 3. CREST, Japan Science and Technology Corporation Met aanpassingen door L. Schomaker tbv KI2

Contents • Introduction • Preparation • Card game “Hearts” • Outline of our RL scheme • Proposed method • State transition on the observation state • Mean-field approximation • Action control • Action predictor • Computer simulation results • Summary 2003 IEEE International Conference on SMC

Completely observable problems Background • Games are well-defined test-beds for studying reinforcement learning (RL) schemes in various multi-agent environments • Black Jack (A.Perez-Uribe and A.Sanchez, 1998) • Othello (T.Yoshioka, S.Ishii and M.Ito, 1999) • Backgammon (G.Tesauro, 1994) • ook: het spel GO, afstudeerproject Reindert-Jan Ekker 2003 IEEE International Conference on SMC

Completely observable problems Background • Games are well-defined test-beds for studying reinforcement learning (RL) schemes in various multi-agent environments • Black Jack (A.Perez-Uribe and A.Sanchez, 1998) • Othello (T.Yoshioka, S.Ishii and M.Ito, 1999) • Backgammon (G.Tesauro, 1994) • What about partially observable problems? • estimate missing information? • predict environmental behaviors? 2003 IEEE International Conference on SMC

Challenging study Research field: Reinf. Learning • RL scheme applicable to a multi-agent environment which is partially observable • The card game “Hearts” (Hartenjagen) • Multi-agent (four players) environment • Objective is well-defined • Partially Observable Markov Decision Process (POMDP) • Cards in opponents’ hands are unobservable • Realistic problem • Huge state space • Number of unobservable variables is large. • Competitive game with four agents 2003 IEEE International Conference on SMC

13 penalty points 1 penalty point Card game “Hearts” • Hearts is a 4-player game (multi-agent environment). • Each player has 13 cards at the beginning of the game (partially observable) • Each player plays a card clock-wise • Particular cards have penalty points • Object : to score as few points as possible. • Players must contrive strategies to avoid these penalty cards (competitive situation) 2003 IEEE International Conference on SMC

Outline of learning scheme • Agent (player) predicts opponents’ actions using acquired environmental model The next player will probably not discard a spade. So my best action is … 2003 IEEE International Conference on SMC

Outline of learning scheme • Agent (player) predicts opponents’ actions using acquired environmental model The next player will probably not discard a spade. So my best action is … Computable by brute force? 2003 IEEE International Conference on SMC

Outline of learning scheme • Agent (player) predicts opponents’ actions using acquired environmental model The next player will probably not discard a spade. So my best action is … Computable by brute force? No!  size of search space  unknown utility of actions  unknown opponent strategies 2003 IEEE International Conference on SMC

Outline of Reinf. Learning scheme • Agent (player) predicts opponents’ actions using acquired environmental model The next player will probably not discard a spade. So my best action is … Predicted using acquired environmental model 2003 IEEE International Conference on SMC

Outline of our RL scheme • Agent (player) predicts opponents’ actions using acquired environmental model The next player will probably not discard a spade. So my best action is … Predicted using acquired environmental model .. (how?).. estimate unobservable part, reinforcement learning, simulated game training 2003 IEEE International Conference on SMC

Proposed method • State transition on the observation state • Mean-field approximation • Action control • Action predictor

State transition on the observation state • State transition on the observation state in the game can be calculated by: 2003 IEEE International Conference on SMC

State transition on the observation state • State transition on the observation state in the game can be calculated by: x observation (cards in hand+cards on table) a action (card to be played) s state (all observable and onobservable cards) Ф strategies of each of the opponents Hthistory of all xandauntil time t K knowledge of the game 2003 IEEE International Conference on SMC

Voorbeelden • a: “harten-2 opgooien” • s: • [niet observeerbaar deel] • Oost heeft kaarten u,v,w,…,z • West heeft kaarten a,b,… • Noord heeft kaarten r,s,… • [observeerbaar deel= x] • Ik heb kaarten f,g,… • op tafel liggen kaarten k,l,… • Ht: {{s0,a0}west,{s1,a1}noord,…,{st,at}oost } 2003 IEEE International Conference on SMC

State transition on the observation state • State transition on the observation state in the game can be calculated by: De kans op een bepaalde hand en uitgegooide kaarten op t+1 is: het produkt van de {som van de kans op alle mogelijke kaartverdelingen gegeven de historie op t en spelkennis K} met de {som van de producten van de kansen op alle mogelijke acties voor opponenten 1-3, gegeven elk hun strategie en de historie) 2003 IEEE International Conference on SMC

Summation of all states … (?)…. Need approximation State transition on the observation state • State transition on the observation state can be calculated by: • Calculation is intractable • Hearts has very huge state space. • About states ! 2003 IEEE International Conference on SMC

Summation of all states … (?)…. Need approximation State transition on the observation state • State transition on the observation state about game of Hearts can be calculated by: • Calculation is intractable • Hearts has very huge state space. • About states ! aantal manieren om 52 kaarten over 4 spelers te verdelen zodat elk 13 kaarten heeft 2003 IEEE International Conference on SMC

Mean-field approximation • Calculate mean estimated observation state for the opponent agent. • Een geschatte observatietoestand voor een opponent i is een gewogen som van de kans op observaties xt, gegeven een actie, een historie (en spelkennis K) • de (deel)kansen worden bekend gedurende het spel 2003 IEEE International Conference on SMC

mean observation state Mean-field approximation • Calculate mean estimated observation state for the opponent agent. • Transition probability is approximated as 2003 IEEE International Conference on SMC

Mean-field approximation • Calculate mean estimated observation state for the opponent agent. • Transition probability is approximated as mean observation state • zodat de kansverdeling van de voorwaardelijke kans op een actie door opponent i kan worden bepaald, dwz: gegeven diens geschatte “unobservable state” 2003 IEEE International Conference on SMC

Action control: TD Reinforcement Learning • An action is selected based on the expected TD error where • Using the expected TD error, action selection probability is given by 2003 IEEE International Conference on SMC

Action prediction • We use a function approximator (NGnet) for the utility function which is likely to be non-linear • Function approximators can be trained by using past games 2003 IEEE International Conference on SMC

・・・ ・・・・・・ Summary of proposed method • RL scheme based on • Estimation of unobservable state variables • Prediction of opponent agents’ actions • Estimation of unobservable state variables by mean-field approximation ・・・ • Learning agent determines its action based on prediction by environmental behaviors ・・・・・・ 2003 IEEE International Conference on SMC

Computer simulations • Rule-based agent • Single agent learning in a stationary environment • Learning by multiple agents in a multi-agent environment

Computer simulations • Three experiments to evaluate learning agent by using a rule-based agent • Single agent learning in a stationary environment • (A) learning agent, rule-based agent x3 • Learning by multiple agents in a multi-agent environment • (B) learning agent, actor-critic agent, rule-based agent x2 • (C) learning agent x2, rule-based agent x2 • A rule-based agent has more than 50 rules, and it is an “experienced” level Hearts player. 2003 IEEE International Conference on SMC

better player Proposed RL agent Average penalty ratio Rule-based agent x3 2003 IEEE International Conference on SMC Number of games

Actor-critic agent Average penalty ratio Proposed RL agent better player Rule-based agent x2 Number of games 2003 IEEE International Conference on SMC

better player Proposed RL agent x2 Average penalty ratio Rule-based agent x2 2003 IEEE International Conference on SMC Number of games

Summary • We proposed a RL scheme for making an autonomous learning agent that plays the multi-player card game “Hearts”. • Our RL agent estimates unobservable state variables using mean-field approximation,learns and predicts environmental behaviors. • Computer simulations showed our method is applicable to a realistic multi-agent problem. 2003 IEEE International Conference on SMC

NAra Institute of Science and Technology (NAIST) Hajime FUJITA hajime-f@is.aist-nara.ac.jp http://hawaii.aist-nara.ac.jp/~hajime-f/

A reinforcement learning scheme for a multi-agent card game: het leren van een POMDP