360 likes | 388 Views
GraphPlan, Satplan and Markov Decision Processes. Sungwook Yoon*. * Based in part on slides by Alan Fern. GraphPlan. Many planning systems use ideas from Graphplan: IPP, STAN, SGP, Blackbox, Medic Can run much faster than POP-style planners History
E N D
GraphPlan, Satplan and Markov Decision Processes Sungwook Yoon* * Based in part on slides by Alan Fern
GraphPlan • Many planning systems use ideas from Graphplan: • IPP, STAN, SGP, Blackbox, Medic • Can run much faster than POP-style planners • History • Before GraphPlan came out, most planning researchers were working on POP-style planners • GraphPlan started them thinking about other more efficient algorithms • Recent planning algorithms run much faster than GraphPlan • However, most of them of them have been influenced by GraphPlan
Big Picture • A big source of inefficiency in search algorithms is the large branching factor • GraphPlan reduces the branching factor by searching in a special data structure • Phase 1 – Create a Planning Graph • built from initial state • contains actions and propositions that are possibly reachable from initial state • does not include unreachable actions or propositions • Phase 2 - Solution Extraction • Backward search for the Search for solution in the planning graph • backward from goal
Planning Graph • Sequence of levels that correspond to time-steps in the plan: • Each level contains a set of literals and a set of actions • Literals are those that could possibly be true at the time step • Actions are those that their preconditions could be satisfied at the time step. • Idea: construct superset of literals that could be possible achieved after an n-level layered plan • Gives a compact (but approximate) representation of states that are reachable by n level plans A literal is just a positive or negative propositon
state-level 0: propositions true in s0 state-level n: literals that may possibly be true after some n level plan action-level n: actions that may possibly be applicable after some n level plan s0 sn an Sn+1 … … … … … … … … … … Planning Graph propositions actions
… … … … … … … … … … Planning Graph • maintenance action (persistence actions) • represents what happens if no action affects the literal • include action with precondition c and effect c, for each literal c propositions actions
Graph expansion • Initial proposition layer • Just the initial conditions • Action layer n • If all of an action’s preconditions are in proposition layer n,then add action to layer n • Proposition layer n+1 • For each action at layer n (including persistence actions) • Add all its effects (both positive and negative) at layer n+1 (Also allow propositions at layer n to persist to n+1 • Propagate mutex information (we’ll talk about this in a moment)
Example • stack(A,B) • precondition: holding(A), clear(B) • effect: ~holding(A), ~clear(B), on(A,B), clear(B), handempty s0 a0 s1 holding(A) holding(A) ~holding(A) handempty stack(a,b) ~clear(B) on(A,B) clear(B) clear(B)
Example • stack(A,B) • precondition: holding(A), clear(B) • effect: ~holding(A), ~clear(B), on(A,B), clear(B), handempty s0 a0 s1 holding(A) holding(A) ~holding(A) handempty stack(A,B) ~clear(B) on(A,B) clear(B) clear(B) Notice that not all literals in s1 can be made true simultaneously after 1 level: e.g. holding(A), ~holding(A) and on(A,B), clear(B)
Mutual Exclusion (Mutex) • Between pairs of actions • no valid plan could contain both at layer n • E.g., stack(a,b), unstack(a,b) • Between pairs of literals • no valid plan could produce both at layer n • E.g., clear(a), ~clear(a) on(a,b), clear(b) • GraphPlan checks pairs only • mutex relationships can help rule out possibilities during search in phase 2 of Graphplan
Solution Extraction: Backward Search • Repeat until goal set is empty • If goals are present & non-mutex: • 1) Choose set of non-mutex actions to achieve each goal • 2) Add preconditions to next goal set
Searching for a solution plan • Backward chain on the planning graph • Achieve goals level by level • At level k, pick a subset of non-mutex actions to achieve current goals. Their preconditions become the goals for k-1 level. • Build goal subset by picking each goal and choosing an action to add. Use one already selected if possible (backtrack if can’t pick non-mutex action) • If we reach the initial proposition level and the current goals are in that level (i.e. they are true in the initial state) then we have found a successful layered plan
GraphPlan algorithm • Grow the planning graph (PG) to a level n such that all goals are reachable and not mutex • necessary but insufficient condition for the existence of an n level plan that achieves the goals • if PG levels off before non-mutex goals are achieved then fail • Search the PG for a valid plan • If none found, add a level to the PG and try again • If the PG levels off and still no valid plan found, then return failure Correctness follows from PG properties
Important Ideas • Plan graph construction is polynomial time • Though construction can be expensive when there are many “objects” and hence many propositions • The plan graph captures important properties of the planning problem • Necessarily unreachable literals and actions • Possibly reachable literals and actions • Mutually exclusive literals and actions • Significantly prunes search space compared to POP style planners • The plan graph provides a sound termination procedure • Knows when no plan exists • Plan graphs can also be used for deriving admissible (and good non-admissible) heuristics • See your book (we may come back to this idea later)
Encoding Planning as Satisfiability: Basic Idea • Bounded planning problem (P,n): • P is a planning problem; n is a positive integer • Find a solution for P of length n • Create a propositional formula that represents: • Initial state • Goal • Action Dynamics for n time steps • We will define the formula for (P,n) such that: 1) any model (i.e. satisfying truth assignment) of the formula represent a solution to (P,n) 2) if (P,n) has a solution then the formula is satisfiable
Example of Complete Formula for (P,1) [ at(r1,l1,0) at(r1,l2,0) ] at(r1,l2,1) [ move(r1,l1,l2,0) at(r1,l1,0) ] [ move(r1,l1,l2,0) at(r1,l2,1) ] [ move(r1,l1,l2,0) at(r1,l1,1) ] [ move(r1,l2,l1,0) at(r1,l2,0) ] [ move(r1,l2,l1,0) at(r1,l1,1) ] [ move(r1,l2,l1,0) at(r1,l2,1) ] [ move(r1,l1,l2,0) move(r1,l2,l1,0) ] [ at(r1,l1,0) at(r1,l1,1) move(r1,l2,l1,0) ] [ at(r1,l2,0) at(r1,l2,1) move(r1,l1,l2,0) ] [ at(r1,l1,0) at(r1,l1,1) move(r1,l1,l2,0) ] [ at(r1,l2,0) at(r1,l2,1) move(r1,l2,l1,0) ] Formula has propositions for actions and states variablesat each possible timestep We’ll now discuss how to construct such a formula
Overall Approach • Do iterative deepening like we did with Graphplan: • for n = 0, 1, 2, …, • encode (P,n) as a satisfiability problem • if is satisfiable, then • From the set of truth values that satisfies , a solution plan can be constructed, so return it and exit • With a complete satisfiability tester, this approach will produce optimal layered plans for solvable problems • We can use a GraphPlan analysis to determine an upper bound on n, giving a way to detect unsolvability
Fluents (will be used as propositons) • If plan a0, a1, …, an–1 is a solution to (P,n), then it generates a sequence of states s0, s1, …, sn–1 • Afluent is a proposition used to describe what’s true in each si • on(A,B,i) is a fluent that’s true iff at(r1,loc1) is in si • We’ll use eito denote the fluent for a fact e in state si • e.g. if e = at(r1,loc1) then ei = at(r1,loc1,i) • ai is a fluent saying that a is a action taken at step i • e.g., if a = move(r1,loc2,loc1) then ai = move(r1,loc2,loc1,i) • The set of all possible fluents for (P,n) form the set of primitive propositions used to construct our formula for (P,n)
Encoding Planning Problems • We can encode (P,n) so that we consider either layered plans or totally ordered plans • an advantage of considering layered plans is that fewer time steps are necessary (i.e. smaller n translates into smaller formulas) • for simplicity we first consider totally-ordered plans • Encode (P,n) as a formula such that a0, a1, …, an–1 is a solution for (P,n) if and only if can be satisfied in a way that makes the fluents a0, …, an–1 true • will be conjunction of many other formulas …
Formulas in • Formula describing the initial state: (let E be the set of possible facts in the planning problem) /\{e0 | es0} /\{e0 | eE – s0} Describes the complete initial state (both positive and negative fact) • E.g. on(A,B,0) on(B,A,0) • Formula describing the goal: (G is set of goal facts) /\{en | e G} says that the goal facts must be true in the final state at timestep n • E.g. on(B,A,n) • Is this enough? • Of course not. The formulas say nothing about actions.
Formulas in • For every action a and timestep i, formula describing what fluents must be true if a were the i’th step of the plan: • ai /\ {ei | e Precond(a)}, a’s preconditions must be true • ai /\{ei+1 | e ADD(a)}, a’s ADD effects must be true in i+1 • ai /\{ei+1 | e DEL(a)}, a’s DEL effects must be false in i+1 • Complete exclusion axiom: • For all actions a and b and timesteps i, formulas saying a and b can’t occur at the same time ai bi • this guarantees there can be only one action at a time • Is this enough? • The formulas say nothing about what happens to facts if they are not effected by an action • This is known as the frame problem
Example • Planning domain: • one robot r1 • two adjacent locations l1, l2 • one operator (move the robot) • Encode (P,n) where n = 1 • Initial state: {at(r1,l1)} Encoding: at(r1,l1,0) at(r1,l2,0) • Goal: {at(r1,l2)} Encoding: at(r1,l2,1) • Action Schema: see next slide
Extracting a Plan • Suppose we find an assignment of truth values that satisfies . • This means P has a solution of length n • For i=0,…,n-1, there will be exactly one action a such that ai = true • This is the i’th action of the plan. • Example (from the previous slides): • can be satisfied with move(r1,l1,l2,0) = true • Thus move(r1,l1,l2,0) is a solution for (P,0) • It’s the only solution - no other way to satisfy
What SATPLAN Shows • General propositional reasoning can compete with state of the art specialized planning systems • New, highly tuned variations of DP surprising powerful • Radically new stochastic approaches to SAT can provide very low exponential scaling • Why does it work? • More flexible than forward or backward chaining • Randomized algorithms less likely to get trapped along bad paths
BlackBox (GraphPlan + SatPlan) • The BlackBox procedure combines planning-graph expansion and satisfiability checking • It is roughly as follows: • for n = 0, 1, 2, … • Graph expansion: • create a “planning graph” that contains n “levels” • Check whether the planning graph satisfies a necessary(but insufficient) condition for plan existence • If it does, then • Encode (P,n) as a satisfiability problem but include only the actions in the planning graph • If is satisfiable then return the solution
Blackbox Can be thought of as an implementation of GraphPlan that uses an alternative plan extraction technique than the backward chaining of GraphPlan. Mutex computation Plan Graph STRIPS Translator CNF Simplifier General Stochastic / Systematic SAT engines Solution CNF
Classical Planning Assumptions Actions Percepts World sole sourceof change perfect ???? deterministic fully observable instantaneous
Stochastic/Probabilistic Planning: Markov Decision Process (MDP) Model Actions Percepts World sole sourceof change perfect ???? stochastic fully observable instantaneous
Types of Uncertainty • Disjunctive (used by non-deterministic planning) Next state could be one of a set of states. • Stochastic/Probabilistic Next state is drawn from a probability distribution over the set of states. How are these models related?
Markov Decision Processes • An MDP has four components: S, A, R, T: • (finite) state set S (|S| = n) • (finite) action set A (|A| = m) • (Markov) transition function T(s,a,s’) = Pr(s’ | s,a) • Probability of going to state s’ after taking action a in state s • How many parameters does it take to represent? • bounded, real-valued reward function R(s) • Immediate reward we get for being in state s • For example in a goal-based domain R(s) may equal 1 for goal states and 0 for all others • Can be generalized to include action costs: R(s,a) • Can be generalized to be a stochastic function • Can easily generalize to countable or continuous state and action spaces (but algorithms will be different)
Graphical View of MDP At+1 At St St+2 St+1 Rt+2 Rt+1 Rt
Assumptions • First-Order Markovian dynamics (history independence) • Pr(St+1|At,St,At-1,St-1,..., S0) = Pr(St+1|At,St) • Next state only depends on current state and current action • First-Order Markovian reward process • Pr(Rt|At,St,At-1,St-1,..., S0) = Pr(Rt|At,St) • Reward only depends on current state and action • As described earlier we will assume reward is specified by a deterministic function R(s) • i.e. Pr(Rt=R(St) | At,St) = 1 • Stationary dynamics and reward • Pr(St+1|At,St) = Pr(Sk+1|Ak,Sk) for all t, k • The world dynamics do not depend on the absolute time • Full observability • Though we can’t predict exactly which state we will reach when we execute an action, once it is realized, we know what it is
Policies (“plans” for MDPs) • Nonstationary policy • π:S x T → A, where T is the non-negative integers • π(s,t) is action to do at state s with t stages-to-go • What if we want to keep acting indefinitely? • Stationary policy • π:S → A • π(s)is action to do at state s (regardless of time) • specifies a continuously reactive controller • These assume or have these properties: • full observability • history-independence • deterministic action choice Why not just consider sequences of actions? Why not just replan?
Value of a Policy • How good is a policy π? • How do we measure “accumulated” reward? • Value function V: S →ℝ associates value with each state (or each state and time for non-stationary π) • Vπ(s) denotes value of policy at state s • Depends on immediate reward, but also what you achieve subsequently by following π • An optimal policy is one that is no worse than any other policy at any state • The goal of MDP planning is to compute an optimal policy (method depends on how we define value)
Policy Evaluation • Value equation for fixed policy • How can we compute the value function for a policy? • we are given R and Pr • simple linear system with n variables (each variables is value of a state) and n constraints (one value equation for each state) • Use linear algebra (e.g. matrix inverse)
Value Iteration vs. Policy Iteration • Which is faster? VI or PI • It depends on the problem • VI takes more iterations than PI, but PI requires more time on each iteration • PI must perform policy evaluation on each step which involves solving a linear system • Complexity: • There are at most exp(n) policies, so PI is no worse than exponential time in number of states • Empirically O(n) iterations are required • Still no polynomial bound on the number of PI iterations (open problem)!