420 likes | 547 Views
Software Multiagent Systems: Lecture 13. Milind Tambe University of Southern California tambe@usc.edu. Teamwork. When agents act together. Understanding Teamwork. Ordinary traffic Driving in a convoy Two friends A & B together drive in a convoy B is secretly following A
E N D
Software Multiagent Systems: Lecture 13 Milind Tambe University of Southern California tambe@usc.edu
Teamwork When agents act together
Understanding Teamwork • Ordinary traffic • Driving in a convoy • Two friends A & B together drive in a convoy • B is secretly following A • Pass play in Soccer • Contracting with a software company • Orchestra
Understanding Teamwork • Together • Joint Goal • Co-labor Collaborate • Not just a union of simultaneous coordinated actions • Different from contracting
Why Teams • Robust organizations • Responsibility to substitute • Mutual assistance • Information communicated to peers • Still capable of structure (not necessarily flat) • Subteams, subsubteams • Variations in capabilities and limitations
Approach Theory Practical teamwork architectures
Key Approaches in Multiagent Systems Distributed Constraint Optimization (DCOP) Distributed POMDP Market mechanisms Auctions Belief-Desire-Intention (BDI) Logics and Psychology Hybrid DCOP/ POMDP/ AUCTIONS/ BDI • Essential in large-scale multiagent teams • Synergistic interactions (JPG p (MBp) ۸ (MGp)۸ (Until [(MB p) ۷ (MBp)] (WMGp)) x1 x2 x3 x4
Key Approaches for Multiagent Teams Local interactions Local interactions Uncertainty Uncertainty Human usability & plan structure Human usability & plan structure Local utility Local utility DCOP DCOP Dis POMDPs Dis POMDPs BDI BDI Markets Markets BDI-POMDP Hybrid
Distributed POMDPs Three papers on the web pages: What to read: Ignore all the proofs Ignore complexity results JAIR article: the model and the results at the end Understand fundamental principles
Multiagent Team Decision Problem (MTDP) • MTDP: < S, A, P, W, O, R> • S: s1, s2, s3… • Single global world state, one per epoch • A: domain-level actions; A = {A1, A2, A3,…An} • Ai is a set of actions for each agent i • Joint action
MTDP • P: Transition function: • P(s’ | s, a1, a2, …an) • RA: Reward • R(s, a1, a2,…an) • One common reward; not separate • Central to teamwork
MTDP (cont’d) • W: observations • Each agent: different finite sets of possible observations • W1, W2... • O: probability of observation • O(destination-state, joint-action, joint-observation) • P(o1,o2..om | a1, a2,…am, s’)
Simple Scenario • Cost of action: -0.2 • Must fight fires together • Observe own location and fire status +20 +40
MTDP Policy The problem: Find optimal JOINT policies • One policy for each agent • pi: Action policy • Maps belief state into domain actions • (Bi A) for each agent • Belief state: sequence of observations
MTDP Domain Types • Collectively partially observable: general case, no assumptions • Collectively observable: Team (as a whole) observes state • For all joint observations, there is a state s, such that, for all other states s’ not equal to s, Pr (o1,o2…on | s’) = 0 • Pr (o1, o2, …on | s ) = ? • Pr (s | o1,o2..on) = ? • Individually observable: each agent observes the state • For all individual observations, there is a state s, such that for all other states s’ not equal to s, Pr (oi | s’) = 0
From MTDP to COM-MTDP • Two separate actions: communication vs domain actions • Two separate reward types: • Communication rewards and domain rewards • Total reward: sum two rewards • Explicit treatment of communication • Analysis
Communicative MTDPs(COM-MTDPs) • S: communication capabilities, possible “speech acts” • e.g., “I am moving to fire1.” • RS: communication cost (over messages) • e.g., saying, “I am moving to fire1,” has a cost • RS <= 0 • Why ever communicate?
Two Stage Decision Process World • P1: Communication • policy • P2: Action policy • Two state • estimators • Two belief • State updates Actions Observes Communications to and from Agent SE1 SE2 P1 P2 b1 b2
COM-MTDP Continued • B: Belief state (each Bi history of observations, Communication) • Two stage belief update • Stage 1: Pre-communication belief state for agent i (updates just from observations) < <Wi0, S0 >, <Wi1, S1 > .. <Wi t-1, S t-1 >, <Wi t, . > > • Stage 2: Post-communication belief state for i (updates from observations and communication) < <Wi0, S0 >, <Wi1, S1 > .. <Wi t-1, S t-1 >, <Wi t, S t > > • Cannot create probability distribution over states
COM-MTDP Continued The problem: Find optimal JOINT policies • One policy for each agent • pS: Communication policy • Maps pre-communication belief state into message • (Bi S) for each agent • pA: Action policy • Maps post-communication belief state into domain actions • (Bi A) for each agent
More Domain Types • General Communication: no assumptions on RS • Free communication: RS(s,s) = 0 • No communication: RS(s,s) is negatively infinite
True or False • If agents communicated all their observations at each step then the distributed POMDP would be essentially a single agent POMDP • In distributed POMDPs, each agent plans its own policy • Solving Distributed POMDPs with two agents is of same complexity as solving two separate individual POMDPs
NEXP-complete • No known efficient algorithms • Brute force search 1. Generate space of possible joint policies 2. For each policy in policy space 3. Evaluate over finite horizon T • Complexity: Cost of evaluation No. of policies
Locally optimal search Joint equilibrium based search for policies JESP
Nash Equilibrium in Team Games • Nash equilibrium vs Global optimal reward for the team B B u v u v x x A A y y z z
JESP: Locally Optimal Joint Policy • Iterate keeping one agent’s policy fixed • More complex policies the same way B u v w x A y z
Joint Equilibrium-based Search • Description of algorithm: 1. Repeat until convergence 2. For each agent i 3. Fix policy of all agents apart from i 4. Find policy for i that maximizes joint reward • Exhaustive-JESP: • brute force search in policy space of agent I • Expensive
JESP: Joint Equilibrium Search (Nair et al, IJCAI 03) • Repeat until convergence to local equilibrium, for each agent K: • Fix policy for all except agent K • Find optimal response policy for agent K Optimal response policy for K, given fixed policies for others in MTDP: • Transformed to a single-agent POMDP problem: • “Extended” state defined as not as • Define new transition function • Define new observation function • Define multiagent belief state • Dynamic programming over belief states • Fast computation of optimal response
Extended State, Belief State • Sample progression of beliefs: HL and HR are observations a2: Listen
Is JESP guaranteed to find the global optimal? Random restarts
Not All Agents are Equal • Scaling up Distributed POMDPs for Agent Networks
POMDP vs. distributed POMDP • Distributed POMDPs more complex • Joint transition and observation functions • Better policy • Free communication = POMDP • Less dependency = lower complexity