600 likes | 769 Views
4/3. General model has no initial state; complex cost and reward functions, and finite/infinite/indefinite horizons Standard algorithms are Value and Policy iteration Have to look at the entire state space Can be made even more general with Partial observability (POMDPs)
E N D
General model has no initial state; complex cost and reward functions, and finite/infinite/indefinite horizons Standard algorithms are Value and Policy iteration Have to look at the entire state space Can be made even more general with Partial observability (POMDPs) Continuous state spaces Multiple agents (DECPOMDPS/MDPS) Durative actions Conurrent MDPs Semi-MDPs Directions Efficient algorithms for special cases TODAY & 4/10 Combining “Learning” of the model and “planning” with the model Reinforcement Learning—4/8 (FO)MDPs: The plan
Markov Decision Process (MDP) S: A set of states A: A set of actions Pr(s’|s,a): transition model (aka Mas,s’) C(s,a,s’): cost model G: set of goals s0: start state : discount factor R(s,a,s’): reward model Value function: expected long term reward from the state Q values: Expected long term reward of doing a in s V(s) = max Q(s,a) Greedy Policy w.r.t. a value function Value of a policy Optimal value function
Examples of MDPs Goal-directed, Indefinite Horizon, Cost Minimization MDP <S, A, Pr, C, G, s0> Most often studied in planning community Infinite Horizon, Discounted Reward Maximization MDP <S, A, Pr, R, > Most often studied in reinforcement learning Goal-directed, Finite Horizon, Prob. Maximization MDP <S, A, Pr, G, s0, T> Also studied in planning community Oversubscription Planning: Non absorbing goals, Reward Max. MDP <S, A, Pr, G, R, s0> Relatively recent model
MDPs don’t have a notion of an “initial” and “goal” state. (Process orientation instead of “task” orientation) Goals are sort of modeled by reward functions Allows pretty expressive goals (in theory) Normal MDP algorithms don’t use initial state information (since policy is supposed to cover the entire search space anyway). Could consider “envelope extension” methods Compute a “deterministic” plan (which gives the policy for some of the states; Extend the policy to other states that are likely to happen during execution RTDP methods SSSP are a special case of MDPs where (a) initial state is given (b) there are absorbing goal states (c) Actions have costs. All states have zero rewards A proper policy for SSSP is a policy which is guaranteed to ultimately put the agent in one of the absorbing states For SSSP, it would be worth finding a partial policy that only covers the “relevant” states (states that are reachable from init and goal states on any optimal policy) Value/Policy Iteration don’t consider the notion of relevance Consider “heuristic state search” algorithms Heuristic can be seen as the “estimate” of the value of a state. SSPP—Stochastic Shortest Path Problem An MDP with Init and Goal states
<S, A, Pr, C, G, s0> Define J*(s) {optimal cost} as the minimum expected cost to reach a goal from this state. J* should satisfy the following equation: Bellman Equations for Cost Minimization MDP(absorbing goals)[also called Stochastic Shortest Path] Q*(s,a)
<S, A, Pr, R, s0, > Define V*(s) {optimal value} as the maximum expected discounted reward from this state. V* should satisfy the following equation: Bellman Equations for infinite horizon discounted reward maximization MDP
<S, A, Pr, G, s0, T> Define P*(s,t) {optimal prob.} as the maximum probability of reaching a goal from this state at tth timestep. P* should satisfy the following equation: Bellman Equations for probability maximization MDP
Modeling Softgoal problems as deterministic MDPs • Consider the net-benefit problem, where actions have costs, and goals have utilities, and we want a plan with the highest net benefit • How do we model this as MDP? • (wrong idea): Make every state in which any subset of goals hold into a sink state with reward equal to the cumulative sum of utilities of the goals. • Problem—what if achieving g1 g2 will necessarily lead you through a state where g1 is already true? • (correct version): Make a new fluent called “done” dummy action called Done-Deal. It is applicable in any state and asserts the fluent “done”. All “done” states are sink states. Their reward is equal to sum of rewards of the individual states.
Use heuristic search (and reachability information) LAO*, RTDP Use execution and/or Simulation “Actual Execution” Reinforcement learning (Main motivation for RL is to “learn” the model) “Simulation” –simulate the given model to sample possible futures Policy rollout, hindsight optimization etc. Use “factored” representations Factored representations for Actions, Reward Functions, Values and Policies Directly manipulating factored representations during the Bellman update Ideas for Efficient Algorithms..
VI and PI approaches use Dynamic Programming Update Set the value of a state in terms of the maximum expected value achievable by doing actions from that state. They do the update for every statein the state space Wasteful if we know the initial state(s) that the agent is starting from Heuristic search (e.g. A*/AO*) explores only the part of the state space that is actually reachable from the initial state Even within the reachable space, heuristic search can avoid visiting many of the states. Depending on the quality of the heuristic used.. But what is the heuristic? An admissible heuristic is a lowerbound on the cost to reach goal from any given state It is a lowerbound on V*! Heuristic Search vs. Dynamic Programming (Value/Policy Iteration)
Connection with Heuristic Search s0 s0 s0 ? ? ? ? G G G regular graph acyclic AND/OR graph cyclic AND/OR graph
Connection with Heuristic Search s0 s0 s0 ? ? ? ? G G G regular graph soln:(shortest) path A* acyclic AND/OR graph soln:(expected shortest) acyclic graph AO* [Nilsson’71] cyclic AND/OR graph soln:(expected shortest) cyclic graph LAO* [Hansen&Zil.’98] All algorithms able to make effective use of reachability information! Sanity check: Why can’t we handle the cycles by duplicate elimination as in A* search?
LAO* [Hansen&Zilberstein’98] add s0 in the fringe and in greedy graph repeat expand a state on the fringe (in greedy graph) initialize all new states by their heuristic value perform value iteration for all expanded states recompute the greedy graph until greedy graph is free of fringe states output the greedy graph as the final policy
LAO* [Iteration 1] s0 s0 ? ? G add s0 in the fringe and in greedy graph
LAO* [Iteration 1] s0 s0 ? ? ? ? G expand a state on fringe in greedy graph
LAO* [Iteration 1] J1 s0 s0 ? ? ? ? h h h h G • initialise all new states by their heuristic values • perform VI on expanded states
LAO* [Iteration 1] J1 s0 s0 ? ? ? ? h h h h G recompute the greedy graph
LAO* [Iteration 2] J1 s0 s0 ? ? ? ? h h h h G h h expand a state on the fringe initialise new states
LAO* [Iteration 2] J2 s0 s0 ? ? ? ? J2 h h h G h h perform VI compute greedy policy
LAO* [Iteration 3] J2 s0 s0 ? ? ? ? J2 h h G G h h expand fringe state
LAO* [Iteration 3] J3 s0 s0 ? ? ? ? J3 J3 h h G G h h perform VI recompute greedy graph
LAO* [Iteration 4] J4 s0 s0 ? ? ? ? J4 J4 J4 h G G h h h
LAO* [Iteration 4] J4 s0 s0 ? ? ? ? J4 J4 J4 h G G h h h Stops when all nodes in greedy graph have been expanded
Comments Dynamic Programming + Heuristic Search admissible heuristic ⇒ optimal policy expands only part of the reachable state space outputs a partial policy one that is closed w.r.t. to Pr and s0 Speedups expand all states in fringe at once perform policy iteration instead of value iteration perform partial value/policy iteration weighted heuristic: f = (1-w).g + w.h ADD based symbolic techniques (symbolic LAO*)
How to derive heuristics? • Deterministic shortest route is a heuristic on the expected cost J*(s) • But how do you compute it? • Idea 1: [Most likely outcome determinization] Consider the most likely transition for each action • Idea 2: [All outcome determinization] For each stochastic action, make multiple deterministic actions that correspond to the various outcomes • Which is admissible? Which is “more” informed? • How about Idea 3: [Sampling based determinization] • Construct a sample determinization by “simulating” each stochastic action to pick the outcome. Find the cost of shortest path in that determinization • Take multiple samples, and take the average of the shortest path. Determinization involves converting “And” arcs in the And/Or graph to “Or” arcs
Real Time Dynamic Programming[Barto, Bradtke, Singh’95] Trial: simulate greedy policy starting from start state; perform Bellman backup on visited states RTDP: repeat Trials until cost function converges Notice that you can also do the “Trial” above by executing rather than “simulating”. In that case, we will be doing reinforcement learning. (In fact, RTDP was originally developed for reinforcement learning)
RTDP Trial Min s0 Jn Qn+1(s0,a) agreedy = a2 Jn ? a1 Jn Goal a2 ? Jn+1(s0) Jn a3 ? Jn Jn Jn
Greedy “On-Policy” RTDP without execution Using the current utility values, select the action with the highest expected utility (greedy action) at each state, until you reach a terminating state. Update the values along this path. Loop back—until the values stabilize
Labeled RTDP [Bonet&Geffner’03] Initialise J0 with an admissible heuristic ⇒Jn monotonically increases Label a state as solved if the Jn for that state has converged Backpropagate ‘solved’ labeling Stop trials when they reach any solved state Terminate when s0 is solved high Q costs s ? G t best action ) J(s) won’t change! high Q costs s G both s and t get solved together
Properties admissible J0⇒ optimal J* heuristic-guided explores a subset of reachable state space anytime focusses attention on more probable states fast convergence focusses attention on unconverged states terminates in finite time
Other Advances Ordering the Bellman backups to maximise information flow. [Wingate & Seppi’05] [Dai & Hansen’07] Partition the state space and combine value iterations from different partitions. [Wingate & Seppi’05] [Dai & Goldsmith’07] External memory version of value iteration [Edelkamp, Jabbar & Bonet’07] …
Use heuristic search (and reachability information) LAO*, RTDP Use execution and/or Simulation “Actual Execution” Reinforcement learning (Main motivation for RL is to “learn” the model) “Simulation” –simulate the given model to sample possible futures Policy rollout, hindsight optimization etc. Use “factored” representations Factored representations for Actions, Reward Functions, Values and Policies Directly manipulating factored representations during the Bellman update Ideas for Efficient Algorithms..
Factored Representations: Actions • Actions can be represented directly in terms of their effects on the individual state variables (fluents). The CPTs of the BNs can be represented compactly too! • Write a Bayes Network relating the value of fluents at the state before and after the action • Bayes networks representing fluents at different time points are called “Dynamic Bayes Networks” • We look at 2TBN (2-time-slice dynamic bayes nets) • Go further by using STRIPS assumption • Fluents not affected by the action are not represented explicitly in the model • Called Probabilistic STRIPS Operator (PSO) model
Factored Representations: Reward, Value and Policy Functions • Reward functions can be represented in factored form too. Possible representations include • Decision trees (made up of fluents) • ADDs (Algebraic decision diagrams) • Value functions are like reward functions (so they too can be represented similarly) • Bellman update can then be done directly using factored representations..
Direct manipulation of ADDs in SPUDD
FF-Replan: A Baseline for Probabilistic Planning Sungwook Yoon Alan fern Robert Givan FF-Replan : Sungwook Yoon
Replanning Approach • Deterministic Planner for Probabilistic Planning? • Winner of IPPC-2004 and (unofficial) winner of IPPC-2006 • Why was it conceived? • Why it worked? • Domain by domain analysis • Any extension? FF-Replan : Sungwook Yoon
IPPC-2004 Pre-released Domains Blocksworld Boxworld FF-Replan : Sungwook Yoon
IPPC Performance Test • Client Server Interaction • The problem definition is known apriori • Performance is recorded in the server log • For one problem, 30 repetitive test is conducted FF-Replan : Sungwook Yoon