300 likes | 370 Views
Bob Givan Joint work w/ Alan Fern and SungWook Yoon. Knowledge Representation Meets Stochastic Planning. Electrical and Computer Engineering Purdue University. Overview. We present a form of approximate policy iteration specifically designed for large relational MDPs .
E N D
Bob Givan Joint work w/ Alan Fern and SungWook Yoon Knowledge RepresentationMeetsStochastic Planning Electrical and Computer Engineering Purdue University
Overview • We present a form of approximate policy iteration specifically designed for large relational MDPs. • We describe a novel application viewing entire planning domains as MDPs • we automatically induce domain-specific planners • Induced planners are state-of-the-art on: • deterministic planning benchmarks • stochastic variants of planning benchmarks Bob Givan Electrical and Computer Engineering Purdue University
Ideas from Two Communities Traditional Planning Induction of Control Knowledge Planning Heuristics Two views of the new technique Iterative improvementof control knowledge API with a policy space bias Decision-theoretic Planning Approximate Policy Iteration(API) Policy Rollout Bob Givan Electrical and Computer Engineering Purdue University
Planning Problems • States: First-order Interpretations of a particular language • A planning problem gives: • a current state • a goal state • a list of actions and their semantics (may be stochastic) ? Available actions: Pickup(x) PutDown(y) Current State Goal State/Region Bob Givan Electrical and Computer Engineering Purdue University
? ? ? ? Planning Domains • Distributions over problems sharing one set of actions (but with different domains and sizes) Available actions: Pickup(x) PutDown(y) Blocks World Domain Bob Givan Electrical and Computer Engineering Purdue University
? Control Knowledge • Traditional planners solve problems, not domains. • little or no generalization between problems in a domain • Planning domains “solved” by control knowledge • pruning some actions, typically eliminating search e.g. “don’t pick up a solved block” X ? X ? X Bob Givan Electrical and Computer Engineering Purdue University
Recent Control Knowledge Research • Human-written c. k. often eliminates search [Bacchus & Kabanza, 1996]TL-Plan • Helpful c. k. can be learned from “small problems” [Khardon, 1996 & 1999] Learning Horn clause action strategies [Huang, Selman & Kautz, 2000]Learning action selection & action rejection rules [Martin & Geffner, 2000]Learning generalized policies in concept languages [Yoon, Fern & Givan, 2002]Inductive policy selection for stochastic planning domains Bob Givan Electrical and Computer Engineering Purdue University
Unsolved Problems • Finding control knowledge without immediate access to small problems • Can we learn directly in a large domain? • Improving buggy control knowledge • All previous techniques produce unreliable control knowledge…with occasional fatal flaws. • Our approach: view control knowledge as an MDP policy and apply policy improvement A policy is a choice of action for each MDP state Bob Givan Electrical and Computer Engineering Purdue University
? ? ? ? Planning Domains as MDPs View domain as one big statespace, each state a planning problem This view facilitates generalization between problems. Available actions: Pickup(x) PutDown(y) Pickup(Purple) Blocks World Domain Bob Givan Electrical and Computer Engineering Purdue University
Ideas from Two Communities Traditional Planning Induction of Control Knowledge Planning Heuristics Two views of the new technique Iterative improvementof control knowledge API with a policy space bias Decision-theoretic Planning Approximate Policy Iteration(API) Policy Rollout Bob Givan Electrical and Computer Engineering Purdue University
Policy Iteration • Given a policy p and a state s, can we improve p(s)? • If Vp(s) < Qp(s,b), then p(s) can be improved to blue. • Can make such improvements at all states at once: s1 Vp(s) = Qp(s,o) = Ro + g Es’{s1…sk} Vp(s’) Ro … p(s) sk s t1 Qp(s,b) = Rb + g Es’{t1…tn} Vp(s’) Rb … tn improved policy base policy Policy Improvement Bob Givan Electrical and Computer Engineering Purdue University
Compute Qpfor each actionat all states Compute Vpat all states Flowchart View of Policy Iteration Qp Improved Policy p’ Choose best actionat each state Vp Problem: too many states p Current Policy Bob Givan Electrical and Computer Engineering Purdue University
s1 Ra Qp(s,•) Sample s’ from s1…sk a Compute Qpfor each actionat all states … s Choose best actionat each state sk s p’(s) s at s at s s’ Vp(s’) Trajectories under p … … … Compute Vpat all states at s’ p(s’) s’ … … s” p(s”) Current Policy p Flowchart View of Policy Rollout Qp Improved Policy Vp p Bob Givan Electrical and Computer Engineering Purdue University
Compute Qpfor each actionat state s Choose best actionat state s Approximate Policy Iteration Idea: use machine learning to control the number of samples needed draw a training set of pairs (s,p’(s)) learn a policyrepeat Qp(s,•) p’(s) s s s’ Vp(s’) Compute Vpat state s’ Refinement: use pairs (s,Qp(s,•)) to define mis- classification costs s” p(s”) Current Policy Bob Givan Electrical and Computer Engineering Purdue University
? A A Challenge Problem Consider the following stochastic blocks world problem: Goal: Clear(A) Assume: Block color affects pickup() success Optimal policy is compact, but value function is not – state value depends on set of colors above A Bob Givan Electrical and Computer Engineering Purdue University
1. 2. ? A ? A A A Policy for Example Problem A compact policy for this problem: 1. If holding a block, put it down on the table, else… 2. Pick up a clear block above A. How can we formalize this policy? Bob Givan Electrical and Computer Engineering Purdue University
? A A Action Selection Rules [Martin&Geffner, KR2000] Pickup a clear block above block A… Action selection rules based on classes of objects • Apply action a to an object in class C (if possible). • abbreviated C:a How can we describe the object classes? ? A A Bob Givan Electrical and Computer Engineering Purdue University
? A A Formal Policy for Example Problem 1. 2. ? A A We find this policy with a heuristic search guided by the training data Bob Givan Electrical and Computer Engineering Purdue University
Ideas from Two Communities Traditional Planning Induction of Control Knowledge Planning Heuristics Two views of the new technique Iterative improvementof control knowledge API with a policy space bias Decision-theoretic Planning Approximate Policy Iteration(API) Policy Rollout Bob Givan Electrical and Computer Engineering Purdue University
API with a Policy Language Bias Compute Qpfor each actionat state s Qp(s,•) p’(s) Choose best actionat state s Train a new policy p’ s s s’ Vp(s’) Compute Vpat state s’ p’ s” p(s”) Current Policy Bob Givan Electrical and Computer Engineering Purdue University
Trajectories under p … … … p(s’) s’ … … Incorporating Value Estimates • What happens if the policy can’t find reward? • For learning control knowledge, we use the FF-plan plangraph heuristic Use a value estimate at these states Bob Givan Electrical and Computer Engineering Purdue University
Initial Policy Choice • Policy iteration requires an initial base policy • Options include: • random policy • greedy policy with respect to a planning heuristic • policy learned from small problems Bob Givan Electrical and Computer Engineering Purdue University
Experimental Domains SBW(n) SPW(n) SLW(t,p,c) (Stochastic)Painted Blocks World (Stochastic)Blocks World (Stochastic)Logistics World Bob Givan Electrical and Computer Engineering Purdue University
API Results Starting with flawed policies learned from small problems Success Rate Success Rate Bob Givan Electrical and Computer Engineering Purdue University
API Results Starting with a policy greedy with respect to adomain independent heuristic We used the heuristic of FF-plan (Hoffman and Nebel ’02 JAIR) Bob Givan Electrical and Computer Engineering Purdue University
How Good is the Induced Planner? Bob Givan Electrical and Computer Engineering Purdue University
Conclusions • Using a policy space bias, we can learn good policies for extremely large structured MDPs. • We can automatically learn domain-specific planners that compete favorably with the state-of-the-art domain-independent planners. Bob Givan Electrical and Computer Engineering Purdue University
Approximate Policy Iteration Sample states s, and compute Q values at each: Form a training set of tuples (s,b,Qp,b(s)). Learn a new policy from this training set. Computing Qp,b(s): s1 Ro … Estimate Rb + g Es’{t1…tn} Vp(s’) by • Sampling states ti from t1…tn • Drawing trajectories under p from tito estimate Vp sk s t1 Rb … tn Bob Givan Electrical and Computer Engineering Purdue University
Markov Decision Process (MDP) • Ingredients: • System state x in state space X • Control action a in A(x) • Reward R(x,a) • State-transition probability P(x,y,a) • Find control policy to maximize objective fun Bob Givan Electrical and Computer Engineering Purdue University
Control Knowledge vs. Policy • Perhaps the biggest difference in communities: • deterministic planning works with action sequences • decision-theoretic planning works with policies • Policies are needed because uncertainty may carry you to any state. • compare: control knowledge also handles every state • Good c.k. eliminates search • defines a policy over the possible state/goal pairs Bob Givan Electrical and Computer Engineering Purdue University