5/6: Summary and Decision Theoretic Planning

5/6: Summary and Decision Theoretic Planning Last homework socket opened (two more problems to be added—Scheduling, MDPs) Project 3 due today Sapa homework points sent..

Current Grades…

Sapa homework grades

Dynamic Stochastic Partially Observable Durative Continuous Contingent/Conformant Plans, Interleaved execution MDP Policies Contingent/Conformant Plans, Interleaved execution Numeric Constraint reasoning (LP/ILP) Temporal Reasoning POMDP Policies Semi-MDP Policies Replanning/ Situated Plans Deterministic Static Observable Instantaneous Propositional “Classical Planning”

Actions, Proofs, Planning Strategies (Week 2; 1/28;1/30) More PO planning, dealing with partially instantiated actions, and start of deriving heuristics. (Week 3; 2/4;2/6) Reachability Heuristics contd. (2/11;/13) Heuristics for Partial order planning; Graphplan search (2/18; 2/20). EBL for Graphplan; Solving planning graph by compilation strategies (2/25;2/27). Compilation to SAT, ILP and Naive Encoding(3/4;3/6). Knowledge-based planners. Metric-Temporal Planning: Issues and Representation. Search Techniques; Heuristics. Tracking multiple objective heuristics (cost propagation); partialization; LPG Temporal Constraint Networks; Scheduling 4/22;4/24 Incompleteness and Unertainty; Belief States; Conformant planning 4/29;5/1 Conditional Planning Decision Theoretic Planning… All that water under the bridge…

Incompleteness in the initial state Un (partial) observability of states Non-deterministic actions Uncertainty in state or effects Complex reward functions (allowing degrees of satisfaction) Conformant Plans: Don’t look—just do Sequences Contingent/Conditional Plans: Look, and based on what you see, Do; look again Directed acyclic graphs Policies: If in (belief) state S, do action a (belief) stateaction tables MDP POMDP Problems, Solutions, Success Measures:3 orthogonal dimensions • Deterministic Success: Must reach goal-state with probability 1 • Probabilistic Success: Must succeed with probability >= k (0<=k<=1) • Maximal Expected Reward: Maximize the expected reward (an optimization problem)

The Trouble with Probabilities… Once we have probabilities associated with the action effects, as well as the constituents of a belief state, • The belief space size explodes… • Infinitely largemay be able to find a plan if one exists, but exhaustively searching to prove plan doesn’t exist is out of the question • Conformant Probabilistic planning is known to be Semi-decidable • So, solving POMDPs is semi-decidable too. • Introduces the notion of “partial satisfaction” and “expected value” of the plan… (rather than 0-1 valuation)

Useful as normative modeling tools In tons of places: --planning, (reinforcement) learning, multi-agent interactions.. MDPs are generalizations of Markov chains where transitions are under the control of an agent. HMMs are thus generalized to POMDPs

[aka action cost C(a,s)] If Mij matrix is not known a priori, then we have a reinforcement learning scenario..

Policies change with rewards.. - -

Updates can be done synchronously OR asynchronously --convergence guaranteed as long as each state updated infinitely often Why are values coming down first? Why are some states reaching optimal value faster?

Policies converge earlier than values Given a utility vector Ui we can compute the greedy policy pui The policy loss of pis ||Up-–U*|| (max norm difference of two vectors is the maximum amount by which they differ on any dimension) So search in the space of policies

We can either solve the linear eqns exactly, or solve them approximately by running the value iteration a few times (the update wont have max factor)

MDP models are quite easy to specify and understand conceptually. The big issue is “compactness” and “effciency” Policy construction is polynomial in the size of state space (which is bad news…!) For POMDPs, the state space is the belief space (infinite ) Compact representations needed for Actions Reward function Policy Value Efficient methods needed for Policy/value update Representations that have been tried include: Decision trees Neural nets, Bayesian nets ADDs (algebraic decision diagrams—which are a general case of BDDs—where the leaf nodes can have real-valued valuation instead of T/F). The Big Computational Issues in MDP

SPUDD: Using ADDs to Represent Actions, Rewards and Policies

MDPs and Planning Problems • FOMDPS (fully observable MDPS) can be used to model planning problems with fully observable states, but non-deterministic transitions • POMDPs (partially observable MDPs)—a generalization of MDP framework, where the current state can only be partially observed—will be needed to handle planning problems with partial observability • POMDPs can be solved by converting them into FOMDPs—but the conversion takes us from world states to belief states (which is a continuous space)

MDPs don’t have a notion of an “initial” and “goal” state. (Process orientation instead of “task” orientation) Goals are sort of modeled by reward functions Allows pretty expressive goals (in theory) Normal MDP algorithms don’t use initial state information (since policy is supposed to cover the entire search space anyway). Could consider “envelope extension” methods Compute a “deterministic” plan (which gives the policy for some of the states; Extend the policy to other states that are likely to happen during execution RTDP methods SSSP are a special case of MDPs where (a) initial state is given (b) there are absorbing goal states (c) Actions have costs. Goal states have zero costs. A proper policy for SSSP is a policy which is guaranteed to ultimately put the agent in one of the absorbing states For SSSP, it would be worth finding a partial policy that only covers the “relevant” states (states that are reachable from init and goal states on any optimal policy) Value/Policy Iteration don’t consider the notion of relevance Consider “heuristic state search” algorithms Heuristic can be seen as the “estimate” of the value of a state. (L)AO* or RTDP algorithms (or envelope extension methods) SSPP—Stochastic Shortest Path Problem An MDP with Init and Goal states

AO* search for solving SSP problems Main issues: -- Cost of a node is expected cost of its children -- The And tree can have LOOPS Cost backup is complicated Intermediate nodes given admissible heuristic estimates --can be just the shortest paths (or their estimates)

LAO*--turning bottom-up labeling into a full DP

RTDP Approach: Interleave Planning & Execution (Simulation) Start from the current state S. Expand the tree (either uniformly to k-levels, or non-uniformly—going deeper in some branches) Evaluate the leaf nodes; back-up the values to S. Update the stored value of S. Pick the action that leads to best value Do it {or simulate it}. Loop back. Leaf nodes evaluated by Using their “cached” values If this node has been evaluated using RTDP analysis in the past, you use its remembered value else use the heuristic value If not use heuristics to estimate a. Immediate reward values b. Reachability heuristics Sort of like depth-limited game-playing (expectimax) --Who is the game against? Can also do “reinforcement learning” this way  The Mij are not known correctly in RL

Greedy “On-Policy” RTDP without execution Using the current utility values, select the action with the highest expected utility (greedy action) at each state, until you reach a terminating state. Update the values along this path. Loop back—until the values stabilize

Envelope Extension Methods • For each action, take the most likely outcome and discard the rest. • Find a plan (deterministic path) from Init to Goal state. This is a (very partial) policy for just the states that fall on the maximum probability state sequence. • Consider states that are most likely to be encountered while traveling this path. • Find policy for those states too. • Tricky part is to show that we can converge to the optimal policy

Incomplete observability(the dreaded POMDPs) • To model partial observability, all we need to do is to look at MDP in the space of belief states (belief states are fully observable even when world states are not) • Policy maps belief states to actions • In practice, this causes (humongous) problems • The space of belief states is “continuous” (even if the underlying world is discrete and finite). • Even approximate policies are hard to find (PSPACE-hard). • Problems with few dozen world states are hard to solve currently • “Depth-limited” exploration (such as that done in adversarial games) are the only option…

5/6: Summary and Decision Theoretic Planning