310 likes | 443 Views
Concurrent Markov Decision Processes. Mausam, Daniel S. Weld University of Washington Seattle. What action next? . Planning. Environment. Percepts. Actions. Motivation. Two features of real world planning domains : Concurrency (widely studied in the Classical Planning literature)
E N D
Concurrent Markov Decision Processes Mausam, Daniel S. Weld University of Washington Seattle
What action next? Planning Environment Percepts Actions
Motivation • Two features of real world planning domains : • Concurrency (widely studied in the Classical Planning literature) • Some instruments may warm up • Others may perform their tasks • Others may shutdown to save power. • Uncertainty (widely studied in the MDP literature) • All actions (pick up the rock, send data etc.) have a probability of failure. • Need both!
Probabilistic Planning • Probabilistic Planning typically modeled as Markov Decision Processes. • Traditional MDPs assume a “single action per decision epoch”. • Solving Concurrent MDPs in the naïve way incurs exponential blowups in running times.
Outline of the talk • MDPs • Concurrent MDPs • Present sound pruning rules to reduce the blowup. • Present sampling techniques to obtain orders of magnitude speedups. • Experiments • Conclusions and Future Work
Markov Decision Process • S : a set of states, factored into Boolean variables. • A : a set of actions • Pr (S£ A£ S! [0,1]): the transition model • C (A!R) : the cost model • : discount factor ( 2 (0,1]) • s0 : the start state • G : a set of absorbing goals
GOAL of an MDP • Find a policy (S ! A) which: • minimises expected discounted cost of reaching a goal • for an infinite horizon • for a fully observable • Markov decision process.
Bellman Backup • Define J*(s) {optimal cost} as the minimum expected cost to reach a goal from this state. • Given an estimate of J* function (say Jn) • Backup Jn function at state s to calculate a new estimate (Jn+1) as follows Value Iteration Perform Bellman updates at all states in each iteration. Stop when costs have converged at all states.
Min s Bellman Backup Jn Qn+1(s,a) Jn a1 Jn a2 Jn+1(s) Jn a3 Jn Jn Ap(s) Jn
Min s a3 RTDP Trial Jn Qn+1(s,a) amin = a2 Jn a1 Jn Goal a2 Jn+1(s) Jn Jn Jn Ap(s) Jn
Real Time Dynamic Programming(Barto, Bradtke and Singh’95) • Trial : Simulate greedy policy; Perform Bellman backup on visited states • Repeat RTDP Trials until cost function converges • Anytime behaviour • Only expands reachable state space • Complete convergence is slow • Labeled RTDP (Bonet & Geffner’03) • Admissible, if started with admissible cost function. • Monotonic; converges quickly
Concurrent MDPs • Redefining the Applicability function • Ap : S!P(P(A)) • Inheriting mutex definitions from Classical planning: • Conflicting preconditions • Conflicting effects • Interfering preconditions and effects a1 : if p1 set x1 a2 : if : p1 set x1 a1 : set x1 (pr=0.5) a2 : toggle x1 (pr=0.5) a1 : if p1 set x1 a2 : toggle p1 (pr=0.5)
Concurrent MDPs (contd) • Ap(s) = {Acµ A | • All actions in Ac are individually applicable in s. • No two actions in Ac are mutex. } • )The actions in Ac don’t interact with each other. Hence,
Concurrent MDPs (contd) • Cost Model • C : P(A)!R • Typically, C(Ac) <a2 AcC({a}) • Time component • Resource component • (if C(Ac) = … then optimal sequential policy is optimal for concurrent MDP)
Jn Jn Jn Jn Jn Jn Jn Jn Jn Jn Jn Jn Jn Jn Jn Jn Jn Jn s Bellman Backup (Concurrent MDP) Min Exponential blowup to calculate a Bellman Backup! a1 a2 a3 Jn+1(s) a1,a2 a1,a3 a2,a3 a1,a2,a3 Ap(s)
Outline of the talk • MDPs • Concurrent MDPs • Present sound pruning rules to reduce the blowup. • Present sampling techniques to obtain orders of magnitude speedups. • Experiments • Conclusions and Future Work
Combo skipping (proven sound pruning rule) If d Jn(s)e < 1-kQn(s,{a1}) + func(Ac,) Then prune Ac for state s in this backup. Choose a1 as the action with maximum Qn(s,{a1}) to obtain maximum pruning. Use Qn(s,Aprev) as an upper bound of Jn(s). Skips a combination only for current iteration
Combo elimination (proven sound pruning rule) If b Q*(s,Ac)c > d J*(s)e then eliminate Ac from applicability set of state s. Use J*sing(s) (the optimal cost for single-action MDP as an upper bound of J*(s). Use Qn(s,Ac) as a lower bound of Q*(s,Ac). Eliminates the combination Ac from applicable list of s for all subsequent iterations.
Pruned RTDP • RTDP with modified Bellman Backups. • Combo-skipping • Combo-elimination • Guarantees: • Convergence • Optimality
Experiments • Domains • NASA Rover Domain • Factory Domain • Switchboard domain • Cost function • Time Component 0.2 • Resource Component 0.8 • State variables : 20-30 • Avg(Ap(s)) : 170 - 12287
Stochastic Bellman Backups • Sample a subset of combinations for a Bellman Backup. • Intuition : • Actions with low Q-values have high likelihood to be in the optimal combination. • Sampling Distribution : • (i) Calculate all single action Q-values. (ii) Bias towards choosing combinations containing actions with low Q-values. • Best combinations for this state in the previous iteration (memoization).
Sampled RTDP • Non-monotonic • Inadmissible • )Convergence, Optimality not proven. • Heuristics • Complete backup phase (labeling). • Run Pruned RTDP with value function from Sampled RTDP (after scaling).
Varying the num_samples Optimality Efficiency
Contributions • Modeled Concurrent MDPs • Sound, optimal pruning methods • Combo-skipping • Combo-elimination • Fast sampling approaches • Close to optimal solution • Heuristics to improve optimality • Our techniques are general and can be applied to any algorithm – VI, LAO*, etc.
Related Work • Factorial MDPs (Mealeau etal’98, Singh & Cohn’98) • Multiagent planning (Guestrin, Koller, Parr’01) • Concurrent Markov Options (Rohanimanesh & Mahadevan’01) • Generate, test and debug paradigm (Younes & Simmons’04) • Parallelization of sequential plans (Edelkamp’03, Nigenda & Kambhampati’03)
Future Work • Find error bounds, prove convergence for Sampled RTDP • Concurrent Reinforcement Learning • Modeling durative actions (Concurrent Probabilistic Temporal Planning) • Initial Results – Mausam & Weld’04, (AAAI Workshop on MDPs)
Concurrent Probabilistic Temporal Planning (CPTP) • Concurrent MDP • CPTP • Our solution (AAAI Workshop on MDPs) • Model CPTP as a Concurrent MDP in an augmented state space. • Present admissible heuristics to speed up the search and manage the state space blowup.