10 likes | 173 Views
Using Reinforcement Learning to Model True Team Behavior in Uncertain Multiagent Settings in Interactive DIDs Muthukumaran Chandrasekaran THINC Lab, CS Department The University of Georgia mkran@uga.edu. Department of Computer Science UGA. Introduction. Approach. Discussion.
E N D
Using Reinforcement Learning to Model True Team Behavior in Uncertain Multiagent Settings in Interactive DIDs Muthukumaran Chandrasekaran THINC Lab, CS Department The University of Georgia mkran@uga.edu Department of Computer Science UGA Introduction Approach Discussion • This work’s contribution is two-fold: • We investigate modeling true team behavior of agents using reinforcement learning (RL) in the context of the Interactive Dynamic Influence Diagram (I-DID) framework for solving multiagent decision making problems under uncertain settings. This presents us with a non-trivial problem that requires setting up learning agents in such a way that they could learn the existence of external factors such as the existence of other agents, thereby modeling teamwork in a non-traditional but systematic way. • So far, we have only experimented with I-DIDs on 2-agent settings. So, we seek to show the framework’s flexibility by designing & testing I-DIDs for 3 or more agent scenarios. Implementing Teamwork: A Paradox Are we modeling true teamwork correctly? - An issue common to all existing sequential decision making frameworks. Consider a 2 agent team setting where agents i and j are levels 1 and 0 in their thinking capacity respectively. Thus, according to j, it is acting alone in the environment. So, technically agent j doesn’t know that it is a part of a team. So agent i (which models j) is forced to follow j’s actions in order to maximize the joint reward even though it may not be the team’s optimal joint action. Algorithm When the environment emits a joint observation o ∈ O, indicative of how the environment has changed after actions have been performed by both agents in the previous step, the agents choose a joint action a from a finite set A(o) based on these observations. A deterministic reactive team policy, π, maps each o ∈ O to some a ∈ A(o). The goal of the agents is to compute a team policy, π : o → a based on the experience gathered. The algorithm begins by assuming a start state distribution for the agents and episodic tasks with one or more terminal states. First, agent j performs its actions according to the MCESP algorithm and agent i performs its actions based on the Dec-POMDP policy of the other agent, j’, that i assumes, is a part of its team. The Dec-POMDP policy computed using the JESP algorithm [3] using the MADP toolkit [4]. Then based on the agents’ joint actions, the next state is first sampled using the transition function, followed by the observations for both agents using the observation function. Agent j follows MCESP while Agent i follows agent j’’s DEC-POMDP policy to make their respective next decisions. This way, a trajectory τ={o0, a0, r0, o0, a0, r0, ..., oT} (where oTis an observation corresponding to a terminal state) in the MCESP algorithm is generated. The rest of the MCESP procedure is similar to the original one described in [2]. Fig. 4: RL in POMDPs (using MCESP) This work is significant as it bridges the Interactive POMDP and Decentralized POMDP frameworks together. We model true team behavior of agents using RL in the context of I-DIDs and can be extended to any multiagent decision making framework. In future, we also hope to show flexibility of the I-DID framework by conducting scalability tests and extending it to scenarios containing 3 or more agents. Background I-DIDs have nodes (decision (rectangle), chance (oval), utility (diamond), model (hexagon)), arcs (functional, conditional, informational), links (policy (dashed), model update (dotted)) (shown in Fig 1). I-DIDs are graphical counterparts of I-POMDPs [1]. Fig. 1 shows a two time slice level l, I-DID. In order to induce team behavior, our algorithm uses a variant of the RL algorithm called Monte-Carlo Exploring Starts for POMDPs (MCESP) [2] for learning the level 0 policies that uses the new definition of action value, that provides info about the value of policies in a local neighborhood of the current policy. Fig. 4 shows an overview of RL in POMDPs. References P. Doshi, Y. Zeng, and Q. Chen, Graphical models for interactive POMDPs: Representations and solutions, JAAMAS, 2009. T. J. Perkins, Reinforcement Learning for POMDPs based on Action Values and Stochastic Optimization, AAAI, 2002. R. Nair, M. Tambe, M. Yokoo, D. Pynadath, and S. Marsella, Taming Decentralized POMDPs: Towards Efficient Policy Computation for Multiagent Settings, IJCAI, 2003. Matthijs T. J. Spaan and Frans A. Oliehoek, The MultiAgent Decision Process toolbox, MSDM Workshop, AAMAS, 2008. Experiments Acknowledgments Test Domains – Variants of the Multiagent Box Pushing (fig. 3) and the Multiagent Meeting in a Grid (fig. 2) Problems Implementation – Exact-BE [1] and MCESP [2] using HUGIN C++ API for Dynamic Influence Diagrams (DIDs) I thank Dr. Prashant Doshi, Dr. Yifeng Zeng and his students for their valuable contributions in the implementation of this work. This research is partially supported by the Obel Family Foundation (Denmark) and the NSFC (#60974089 and #60975052) grant to Dr. Yifeng Zeng and NSF CAREER grant (#IIS-0845036) to Dr. Prashant Doshi. Fig. 1. A Generic Two Time Slice Level l I-DID