190 likes | 282 Views
Between Collaboration and Competition: An Initial Formalization using Distributed POMDPs. Praveen Paruchuri, Milind Tambe University of Southern California Spiros Kapetanakis University of York,UK Sarit Kraus Bar-Ilan University,Israel University of Maryland, College Park July 2003.
E N D
Between Collaboration and Competition: An Initial Formalization using Distributed POMDPs Praveen Paruchuri, Milind Tambe University of Southern California Spiros Kapetanakis University of York,UK Sarit Kraus Bar-Ilan University,Israel University of Maryland, College Park July 2003
Motivation • Many domains present where agents act in team but need to maintain some self interest. • Electric Elves – Agents take decisions for users but act as a team like arranging a meeting etc. • SDR – Software for Distributed Robotics where 100+ robots must locate and protect objects. • Robots must ensure their survival like refilling batteries
The Problem • Framework for teams of agents maintaining private goals for stochastic, complex and dynamic environments. • Agents need to maximize joint objectives and yet honor private preferences. • Private versus Team Interest – Might be conflicting • Build framework based on Distributed POMDPs for policy generation • Analyze complexity of policy generation
Previous work • Distributed POMDPs like COM-MTDP • Have single joint reward • Optimal policy maximizes joint value (Ex1) • Solution not stable • Stochastic Games • Have individual rewards. • Policy finds equilibrium solution. Stability, key concept (Ex2) • Solution not favorable to both individually and as team Ex1: Ex2:
Motivation : Simple examples • One shot game without stochastic elements Ex1: Two people need to meet, one prefers 4pm, other 5pm • When should they meet ?? • Need to compromise some extent, but not totally. • No meeting is bad for both. Agree on mutually acceptable solution. Ex2: Team of robots work on task • Limited battery • Last n% battery for re-fuelling itself. Otherwise die. • Need to achieve team goal while they don’t die.
MTDP: A Distributed POMDP Model • An MTDP is a tuple <S,A(α),P,Ω(α),O(α),B(α),R> where, • S is a set of world states. • A(α) is a set of allowed team actions. A(α) = π ( A(i) ) , A(i) is a set of domain level actions for each agent i. • P is a probability distribution that governs the effect of domain level actions.( P( s,a,s1) = Pr ( s1/s,a) ) • Ω(α) is the joint set of observations. • B(α) is the combination of all the agent’s set of possible belief states. • R is the common reward for the team. R:S * A(α) R
E-MTDP: Formally Defined • An E-MTDP is a tuple <S,A(α),P,Ω(α),O(α),B(α),R>where S,A(α),P,Ω(α),O(α),B(α) are as defined in MTDP. • R = < R1, R2,…….., Rn, Rα > where, • R1,R2,..,Rn are rewards of agents 1,2,..,n • Rα is the joint reward for the n agents where Rα = γ*R1 + δ*R2 +……… • Both individual and joint rewards can be expressed.
E-MTDP Policy • Policy maps belief states to actions - Π : Bi Ai • Centralized Policy generator. • Policy π is such that: V1(π) > T1 , V2(π) > T2 For π’ <> π, where V1(π’) > T1 and V2(π’)>T2, V(π) > V(π’) where, T1 and T2 are thresholds for agents 1 and 2. V1 is value from policy for agent1 and V2 for agent2. V is overall value of policy without splitting.
Novelties of E-MTDP • Maintains individual rewards for each agent and a joint reward for the team. • Solution concept is novel because optimal policy both • Maximizes joint reward and • Ensures certain minimum expected value for individual team members.
Experimental Validation • Goal: Show utility of EMTDP • A real system called Electric(E)-Elves based on MDPs. • Based on maximizing single joint reward. • Expressed as EMTDP and helped improve performance. • E-Elves- A published real world multi agent system • Used at USC/ISI for 6 months. • Agents called proxies - Reschedule meetings, Decide to present talks on behalf of user, Order meals, Track user location etc etc.
Electric Elves • Focus on task of rescheduling meetings. • Used single agent MDP to model an agent • Actions like delaying/canceling meeting, asking user etc. • Asking user for his input is critical. • Time constraints might prevent agent asking user for input. • Policy generator uses the notion of team reward for deciding actions. • No notion of individual reward.
Perceived Problem and Improvement • Original formulation had R(α) and R(user) terms[1]. • However R(α) + R(user) is maximized in policy generation. • As R(α) increased with R(user) constant, agent stopped asking user. • As R(α) increases, cost(Uncertainty in getting response from user) > δ ( Increase in quality of decision due to user’s feedback ). Hence, decision taken without asking. • User might want to have different decision. • User can set his importance to meeting using R(user) • If user important, agent needs to make a correct decision regarding user. • User’s opinion becomes important affecting # of asks.
Original Elves Result • x-axis: Value of meeting withoutthe user. • y-axis: # of times the agent asks the user. • Number of asks decrease as R(alpha) increases. • Agents sometime cancel important meeting without asking user ( Very high cost )[1].
E-MTDP based E-Elves • Solving using E-MTDP • Let there be two agents • Priv1 = R(user), agent 1’s private reward • Priv2 = R(alpha), agent 2’s private reward • Set priv1 >= Threshold. • # of asks now dependent on Threshold. • User importance(priv1) set high. Agent asks the user for his input before deciding unlike earlier. • Setting threshold is important to obtain the required behavior.
E-MTDP result From graph above, giving flexibility to the user to set his threshold can result in agent asking him more times. • User opinion taken into consideration. • “Flexibility” is the key word. Users like control over their agents.
Conclusions • A framework for teams of self-interested agents. • E-MTDP presented as a solution concept. • E-MTDP applied to E-Elves • Improvement in performance of system measured in terms of number of asks. • Fine-tuning of agents, according to user needs, now possible.
Future Work • Fine tune the existing E-MTDP framework. • Need to analyze complexity of E-MTDP policies. • Analyze stability of the E-MTDP solutions. References 1. Towards Adjustable Autonomy for the Real World Paul Scerri, David V.Pynadath and Milind Tambe, JAIR-02 THANK YOU • Any Questions ??
Stability of solution • Designed a multistage game for E-MTDP policy to be stable.