Networked Distributed POMDPs: DCOP-Inspired Distributed POMDPs

Networked Distributed POMDPs: DCOP-Inspired Distributed POMDPs Ranjit Nair, Honeywell Labs Pradeep Varakantham, USC Milind Tambe, USC Makoto Yokoo, Kyushu University

Background: DPOMDP • Distributed Partially Observable Markov Decision Problems (DPOMDP): a decision theoretic approach • Performance linked to optimality of decision making • Explicitly reasons about (+/-ve) rewards and uncertainty. • Current methods use centralized planning and distributed execution • The complexity of finding optimal policy is NEXP-Complete • In many domains,not all agents can interact or affect each other • Most current DPOMDP algorithms do not exploit locality of interaction Distributed sensors Disaster Rescue simulations Battlefield simulations

Background: DCOP x1 x1 di djf(di,dj) 1 2 2 0 Cost = 0 Cost = 7 x2 x2 x3 x4 x3 x4 • Distributed Constraint Optimization Problem (DCOP): • Constraint Graph (V,E) • Vertices are agent’s variables (x1, ..,, x4) each with a domain d1, …, d4 • Edges represent rewards • DCOP algorithms exploit locality of interaction • DCOP algorithms do not reason about uncertainty

Key ideas and contributions • Key ideas: • Exploit locality of interaction to enable scale-up • Hybrid DCOP –DPOMDP approach to collaboratively find joint policy • Distributed offline planning and distributed execution • Key contributions: • ND-POMDP • Distributed POMDP model that captures locality of interaction • Locally Interacting Distributed Joint Equilibrium-based Search for Policies (LID-JESP) • Hill climbing like Distributed Breakout Algorithm (DBA) • Distributed Parallel Algorithm for Finding Locally Optimal Joint Policy • Globally Optimal Algorithm (GOA) • Variable Elimination

Outline • Sensor net domain • Networked Distributed POMDPs (ND-POMDPs) • Locally interacting distributed joint equilibrium-based search for policies (LID-JESP) • Globally optimal algorithm • Experiments • Conclusions and Future Work

Example Domain Ag2 Ag3 Ag1 target1 target2 N W E Sec2 Sec1 S Sec3 Sec4 Sec5 Ag5 Ag4 • Two independent targets • Each changes position based on its stochastic transition function • Sensing agents cannot affect each other or target’s position • False positives and false negatives in observing targets possible • Reward obtained if two agents track a target correctly together • Cost for leaving sensor on

Networked Distributed POMDP • ND-POMDP for set of n agents Ag:<S, A, P, O, Ω, R, b> • World state s ∈ S where S = S1× …× Sn× Su • Each agent i ∈ Ag has local state si∈ Si • E.g. Is sensor on or off? • Su is the part of the state that no agent can affect • E.g. Location of the two targets • b is the initial belief state, a probability distribution over S • b = b1… bn. bu • A = A1× …× An , where Ai is set of actions for agent i • E.g. “Scan East”, “Scan West”, “Turn Off” • No communication during execution • Agents communicate during planning

ND-POMDP • Transition independence: Agent i’s local state cannot be affected by other agents • Pi : Si × Su × Ai × Si → [0,1] • Pu : Su × Su → [0,1] • Ω = Ω1× …× Ωn , where Ωi is set of observations for agent i • E.g. Target present in sector • Observation independence: Agent i’s observations not dependent on others • Oi: Si× Su × Ai × Ωi → [0,1] • Reward function R is decomposable • R(s,a) = ∑lRl (sl1, … slk, su, al1, … alk) • l Ag, and k = |l| • Goal: To find a joint policy π = < π1, …, πn> where πi is the local policy of agent i such that πmaximizes the expected joint reward over finite horizon T

ND-POMDP as a DCOP Ag1 Ag2 Ag3 R12 R1 Ag5 Ag4 • Inter-agent interactions captured by an interaction hypergraph (Ag, E) • Each agent is a node • Set of hyperedges E = {l| l Ag and Rlis a component of R} • Neighborhood of agent i: Set of i’s neighbors • Ni = {j ∈ Ag| j ≠ i, l∈ E,i ∈ l and j ∈ l} • Agents are solving a DCOP where: • Constraint graph is the interaction hypergraph • Variable at each node is the local policy of that agent • Optimize expected joint reward R1: Ag1’s cost for scanning R12: Reward for Ag1 and Ag2 tracking target

ND-POMDP theorems • Theorem 1: For an ND-POMDP, expected reward for a policy  is the sum of expected rewards for each of the links for policy  • Global value function is decomposable into value functions for each link • Local Neighborhood Utility: V[Ni]: Expected reward obtained from all links involving agent i for executing policy  • Theorem 2: Locality of interaction: For policies  and ’, if i = ’i and Ni = ’Ni then V[Ni] = V’[Ni] • Given its neighbor’s policies, local neighborhood utility of agent i does not depend on any non-neighbor’s policy

LID-JESP • LID-JESP Algorithm (based on Distributed Breakout Algorithm): • Choose local policy randomly • Communicate local policy to neighbors • Compute local neighborhood utility of current policy wrt to neighbors’ policies • Compute local neighborhood utility of best response policy wrt neighbors (GetValue) • Communicate the gain (4 - 3) to neighbors • If gain is greater than gain of neighbors • Change local policy to best response policy • Communicate changed policy to neighbors • Else • If not reached termination go to step 3 • Theorem 3: Global Utility is strictly increasing with each iteration until local optimum is reached

Termination Detection • Each agent maintains a termination counter • Reset to zero is gain > 0 else increment by 1 • Exchange counter with neighbors • Set counter to min of own counter and neighbors’ counters • Termination detected if counter = d (diameter of graph) • Theorem 4: LID-JESP will terminate within d cycles of reaching local optimum • Theorem 5: If LID-JESP terminates, agents are in a local optimum • From Theorems 3-5, LID-JESP will terminate in a local optimum within d cyles

Computing best response policy • Given neighbors’ fixed policies, each agent is faced with solving a single agent POMDP • State is • Note: state is not fully observable • Transition function: • Observation function: • Reward function: • Best response computed using Bellman backup approach

Global Optimal Algorithm (GOA) • Similar to variable elimination • Relies on a tree structured interaction graph • Cycle cutset algorithm to eliminate cycles • Assumes only binary interactions • Phase 1: Values are propagated upwards from leaves to root • For each policy, sum up values of its children’s optimal responses • Compute value of optimal response to each of the parent’s policies • Communicate these values to parent • Phase 2: Policies are propagated downwards from root to leaves. • Agent chooses policy corresponding to optimal response to parent’s policy • Communicates its policy to child

Experiments Compared to: LID-JESP-no-n/w: ignores interaction graph JESP: Centralized solver (Nair2003) • 3 agent chain • LID-JESP exponentially faster than GOA • 4 agent chain • LID-JESP is faster than JESP and LID-JESP-no-nw • LID-JESP exponentially faster than GOA

Experiments • 5 agent chain • LID-JESP is much faster than JESP and LID-JESP-no-nw • Values: • LID-JESP values are comparable to GOA • Random restarts can be used to find global optimal

Experiments • Reasons for speedup: • C: No. of cycles • G: No. of GetValue calls • W: No. of agents that change their policies in a cycle • LID-JESP converges in fewer cycles (column C) • LID-JESP allows multiple agents to change their policies in a single cycle (column W) • JESP has fewer GetValue calls than LID-JESP • But each such call was slower

Complexity • Complexity of best response: • JESP: O(|S|2. |Ai|. ∏j|Ωj|T) • depends on entire world state • depends on observation histories of all agents • LID-JESP: O(|Su×Si×SNi|2. |Ai|. ∏jNi|Ωj|T) • depends on observation histories of only neighbors • depends only on Su, Si and SNi • Increasing number of agents does not affect complexity • Fixed number of neighbors • Complexity of GOA: • Brute force global optimal: O(∏j|πj|.|S|2.∏j|Ωj|T) • GOA: O(n.|πj|.|Su×Si×Sj|2. |Ai|.|Ωi|T.|Ωj|T) • Increasing number of agents will cause linear increase run time

Conclusions • DCOP algorithms are applied to finding solution to Distributed POMDP • Exploiting “locality of interaction” reduces run time • LID-JESP based on DBA • Agents converge to locally optimal joint policy • GOA based on variable elimination • First distributed parallel algorithms for Distributed POMDPs • Exploiting “locality of interaction” reduces run time • Complexity increases linearly with increased number of agents • Fixed number of neighbors

Future Work • How can communication be incorporated? • Will introducing communication cause agents to lose locality of interaction • Remove assumption of transition independence • May cause all agents to be dependent on each other • Other globally optimal algorithms • Increased parallelism

Backup slides

Global Optimal • Consider only binary constraints. Can be applied to n-ary constraints • Run distributed cycle cutset algorithm in case graph is not a tree • Algorithm: • Convert graph into trees and a cycle cutset C • For each possible joint policy πCof agents in C • Val[πC] = 0 • For each tree of agents • Val[πC] = + DP-Global (tree, πC) • Choose joint policy with highest value

Global Optimal Algorithm (GOA) • Similar to variable elimination • Relies on a tree structured interaction graph • Cycle cutset algorithm to eliminate cycles • Assumes only binary interactions • Phase 1: Values are propagated upwards from leaves to root From the deepest nodes in the tree to the root, do 1. For each of agent i’s policies, πido eval(πi) ←∑ci valueπi ci where valueπi ci is received from child ci. 2. for each parent's policy πj do valueπji ← 0 for each of agent i’s policy πi do set current-eval ← expected-reward(πj , πi) + eval(πi) if valueπji < current-eval then valueπji ← current-eval send valueπji to parent j; • Phase 2: Policies are propagated downwards from root to leaves.

Networked Distributed POMDPs: DCOP-Inspired Distributed POMDPs

Networked Distributed POMDPs: DCOP-Inspired Distributed POMDPs

Presentation Transcript

Distributed File Systems

Distributed Systems

Distributed Systems: Concepts and Design

Distributed Object-Based Systems

Distributed Systems

Distributed Object-Based Systems

Distributed Operating Systems

Distributed Cluster Computing Platforms

Distributed Object-Based Systems

CS 347: Parallel and Distributed Data Management Notes X: S4

Distributed Systems

Chapter Two - The Origin of Networked Virtual Environments

Distributed Systems

Distributed Systems

DISTRIBUTED COMPUTING

Locality Sensitive Distributed Computing

Distributed Databases

CS 591x Clutter Computing and Programming Parallel Computers

Distributed Systems

Chapter 23

Midterm Review CS 230 – Distributed Systems (ics.uci/~cs230)

DISTRIBUTED SYSTEMS