1 / 23

Networked Distributed POMDPs: DCOP-Inspired Distributed POMDPs

Networked Distributed POMDPs: DCOP-Inspired Distributed POMDPs. Ranjit Nair, Honeywell Labs Pradeep Varakantham, USC Milind Tambe, USC Makoto Yokoo, Kyushu University. Background: DPOMDP. Distributed Partially Observable Markov Decision Problems (DPOMDP): a decision theoretic approach

michon
Download Presentation

Networked Distributed POMDPs: DCOP-Inspired Distributed POMDPs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Networked Distributed POMDPs: DCOP-Inspired Distributed POMDPs Ranjit Nair, Honeywell Labs Pradeep Varakantham, USC Milind Tambe, USC Makoto Yokoo, Kyushu University

  2. Background: DPOMDP • Distributed Partially Observable Markov Decision Problems (DPOMDP): a decision theoretic approach • Performance linked to optimality of decision making • Explicitly reasons about (+/-ve) rewards and uncertainty. • Current methods use centralized planning and distributed execution • The complexity of finding optimal policy is NEXP-Complete • In many domains,not all agents can interact or affect each other • Most current DPOMDP algorithms do not exploit locality of interaction Distributed sensors Disaster Rescue simulations Battlefield simulations

  3. Background: DCOP x1 x1 di djf(di,dj) 1 2 2 0 Cost = 0 Cost = 7 x2 x2 x3 x4 x3 x4 • Distributed Constraint Optimization Problem (DCOP): • Constraint Graph (V,E) • Vertices are agent’s variables (x1, ..,, x4) each with a domain d1, …, d4 • Edges represent rewards • DCOP algorithms exploit locality of interaction • DCOP algorithms do not reason about uncertainty

  4. Key ideas and contributions • Key ideas: • Exploit locality of interaction to enable scale-up • Hybrid DCOP –DPOMDP approach to collaboratively find joint policy • Distributed offline planning and distributed execution • Key contributions: • ND-POMDP • Distributed POMDP model that captures locality of interaction • Locally Interacting Distributed Joint Equilibrium-based Search for Policies (LID-JESP) • Hill climbing like Distributed Breakout Algorithm (DBA) • Distributed Parallel Algorithm for Finding Locally Optimal Joint Policy • Globally Optimal Algorithm (GOA) • Variable Elimination

  5. Outline • Sensor net domain • Networked Distributed POMDPs (ND-POMDPs) • Locally interacting distributed joint equilibrium-based search for policies (LID-JESP) • Globally optimal algorithm • Experiments • Conclusions and Future Work

  6. Example Domain Ag2 Ag3 Ag1 target1 target2 N W E Sec2 Sec1 S Sec3 Sec4 Sec5 Ag5 Ag4 • Two independent targets • Each changes position based on its stochastic transition function • Sensing agents cannot affect each other or target’s position • False positives and false negatives in observing targets possible • Reward obtained if two agents track a target correctly together • Cost for leaving sensor on

  7. Networked Distributed POMDP • ND-POMDP for set of n agents Ag:<S, A, P, O, Ω, R, b> • World state s ∈ S where S = S1× …× Sn× Su • Each agent i ∈ Ag has local state si∈ Si • E.g. Is sensor on or off? • Su is the part of the state that no agent can affect • E.g. Location of the two targets • b is the initial belief state, a probability distribution over S • b = b1… bn. bu • A = A1× …× An , where Ai is set of actions for agent i • E.g. “Scan East”, “Scan West”, “Turn Off” • No communication during execution • Agents communicate during planning

  8. ND-POMDP • Transition independence: Agent i’s local state cannot be affected by other agents • Pi : Si × Su × Ai × Si → [0,1] • Pu : Su × Su → [0,1] • Ω = Ω1× …× Ωn , where Ωi is set of observations for agent i • E.g. Target present in sector • Observation independence: Agent i’s observations not dependent on others • Oi: Si× Su × Ai × Ωi → [0,1] • Reward function R is decomposable • R(s,a) = ∑lRl (sl1, … slk, su, al1, … alk) • l Ag, and k = |l| • Goal: To find a joint policy π = < π1, …, πn> where πi is the local policy of agent i such that πmaximizes the expected joint reward over finite horizon T

  9. ND-POMDP as a DCOP Ag1 Ag2 Ag3 R12 R1 Ag5 Ag4 • Inter-agent interactions captured by an interaction hypergraph (Ag, E) • Each agent is a node • Set of hyperedges E = {l| l Ag and Rlis a component of R} • Neighborhood of agent i: Set of i’s neighbors • Ni = {j ∈ Ag| j ≠ i, l∈ E,i ∈ l and j ∈ l} • Agents are solving a DCOP where: • Constraint graph is the interaction hypergraph • Variable at each node is the local policy of that agent • Optimize expected joint reward R1: Ag1’s cost for scanning R12: Reward for Ag1 and Ag2 tracking target

  10. ND-POMDP theorems • Theorem 1: For an ND-POMDP, expected reward for a policy  is the sum of expected rewards for each of the links for policy  • Global value function is decomposable into value functions for each link • Local Neighborhood Utility: V[Ni]: Expected reward obtained from all links involving agent i for executing policy  • Theorem 2: Locality of interaction: For policies  and ’, if i = ’i and Ni = ’Ni then V[Ni] = V’[Ni] • Given its neighbor’s policies, local neighborhood utility of agent i does not depend on any non-neighbor’s policy

  11. LID-JESP • LID-JESP Algorithm (based on Distributed Breakout Algorithm): • Choose local policy randomly • Communicate local policy to neighbors • Compute local neighborhood utility of current policy wrt to neighbors’ policies • Compute local neighborhood utility of best response policy wrt neighbors (GetValue) • Communicate the gain (4 - 3) to neighbors • If gain is greater than gain of neighbors • Change local policy to best response policy • Communicate changed policy to neighbors • Else • If not reached termination go to step 3 • Theorem 3: Global Utility is strictly increasing with each iteration until local optimum is reached

  12. Termination Detection • Each agent maintains a termination counter • Reset to zero is gain > 0 else increment by 1 • Exchange counter with neighbors • Set counter to min of own counter and neighbors’ counters • Termination detected if counter = d (diameter of graph) • Theorem 4: LID-JESP will terminate within d cycles of reaching local optimum • Theorem 5: If LID-JESP terminates, agents are in a local optimum • From Theorems 3-5, LID-JESP will terminate in a local optimum within d cyles

  13. Computing best response policy • Given neighbors’ fixed policies, each agent is faced with solving a single agent POMDP • State is • Note: state is not fully observable • Transition function: • Observation function: • Reward function: • Best response computed using Bellman backup approach

  14. Global Optimal Algorithm (GOA) • Similar to variable elimination • Relies on a tree structured interaction graph • Cycle cutset algorithm to eliminate cycles • Assumes only binary interactions • Phase 1: Values are propagated upwards from leaves to root • For each policy, sum up values of its children’s optimal responses • Compute value of optimal response to each of the parent’s policies • Communicate these values to parent • Phase 2: Policies are propagated downwards from root to leaves. • Agent chooses policy corresponding to optimal response to parent’s policy • Communicates its policy to child

  15. Experiments Compared to: LID-JESP-no-n/w: ignores interaction graph JESP: Centralized solver (Nair2003) • 3 agent chain • LID-JESP exponentially faster than GOA • 4 agent chain • LID-JESP is faster than JESP and LID-JESP-no-nw • LID-JESP exponentially faster than GOA

  16. Experiments • 5 agent chain • LID-JESP is much faster than JESP and LID-JESP-no-nw • Values: • LID-JESP values are comparable to GOA • Random restarts can be used to find global optimal

  17. Experiments • Reasons for speedup: • C: No. of cycles • G: No. of GetValue calls • W: No. of agents that change their policies in a cycle • LID-JESP converges in fewer cycles (column C) • LID-JESP allows multiple agents to change their policies in a single cycle (column W) • JESP has fewer GetValue calls than LID-JESP • But each such call was slower

  18. Complexity • Complexity of best response: • JESP: O(|S|2. |Ai|. ∏j|Ωj|T) • depends on entire world state • depends on observation histories of all agents • LID-JESP: O(|Su×Si×SNi|2. |Ai|. ∏jNi|Ωj|T) • depends on observation histories of only neighbors • depends only on Su, Si and SNi • Increasing number of agents does not affect complexity • Fixed number of neighbors • Complexity of GOA: • Brute force global optimal: O(∏j|πj|.|S|2.∏j|Ωj|T) • GOA: O(n.|πj|.|Su×Si×Sj|2. |Ai|.|Ωi|T.|Ωj|T) • Increasing number of agents will cause linear increase run time

  19. Conclusions • DCOP algorithms are applied to finding solution to Distributed POMDP • Exploiting “locality of interaction” reduces run time • LID-JESP based on DBA • Agents converge to locally optimal joint policy • GOA based on variable elimination • First distributed parallel algorithms for Distributed POMDPs • Exploiting “locality of interaction” reduces run time • Complexity increases linearly with increased number of agents • Fixed number of neighbors

  20. Future Work • How can communication be incorporated? • Will introducing communication cause agents to lose locality of interaction • Remove assumption of transition independence • May cause all agents to be dependent on each other • Other globally optimal algorithms • Increased parallelism

  21. Backup slides

  22. Global Optimal • Consider only binary constraints. Can be applied to n-ary constraints • Run distributed cycle cutset algorithm in case graph is not a tree • Algorithm: • Convert graph into trees and a cycle cutset C • For each possible joint policy πCof agents in C • Val[πC] = 0 • For each tree of agents • Val[πC] = + DP-Global (tree, πC) • Choose joint policy with highest value

  23. Global Optimal Algorithm (GOA) • Similar to variable elimination • Relies on a tree structured interaction graph • Cycle cutset algorithm to eliminate cycles • Assumes only binary interactions • Phase 1: Values are propagated upwards from leaves to root From the deepest nodes in the tree to the root, do 1. For each of agent i’s policies, πido      eval(πi) ←∑ci valueπi ci          where valueπi ci is received from child ci. 2. for each parent's policy πj do valueπji ← 0 for each of agent i’s policy πi do set current-eval ← expected-reward(πj , πi) + eval(πi) if valueπji < current-eval then valueπji ← current-eval send valueπji to parent j; • Phase 2: Policies are propagated downwards from root to leaves.

More Related