MDP Exact Solutions II and Appl.

MDP Exact Solutions II and Appl. • Jeffrey Chyan • Department of Computer Science • Rice University • Slides adapted from Mausam and Andrey Kolobov

Outline • Policy Iteration (3.3) • Value Iteration (3.4) • Prioritization in Value Iteration (3.5) / Partitioned Value Iteration (3.6) • Linear Programming Formulation (3.7) • Infinite-Horizon Discounted-Reward MDPs (3.8) • Finite-Horizon MDPs (3.9) • MDPs with Dead Ends (3.10)

Solving MDPs • Finding the best policy for MDPs • Policy Iteration • Value Iteration • Linear Programming

Recall SSP MDPs • Agent pays a cost to achieve goal • Exists at least one proper policy • Every improper policy incurs a cost of infinite from every state from which it does not reach the goal with P=1 • IHDR and FH ⊆ SSP • For this presentation, assume SSP unless stated otherwise

Recall Value and Evaluation • Value Function: maps the domain of a policy excluding the action set to a scalar value • Value of a policy is the expected utility of the reward sequence from executing the policy • Policy Evaluation: Given a policy, compute the value function for each state • Solving system of equations • Iterative Approach

Motivation • Find the best policy • Brute-force algorithm: given all policies are proper, enumerate all policies, evaluate them, and return the best one • Exponential number of policies, computationally intractable • Need a more intelligent search for best policy

The Q-Value Under a Value Function V • Q-value under a Value Function: the one-step lookahead computation of the value of taking an action a in state s • Under the belief that value function V is the true expected cost to reach a goal • Denoted as QV(s,a) • QV(s,a) = Ʃs’∈SΤ(s,a,s’) [C(s,a,s’) + V(s’)]

Greedy Action/Policy • Action Greedy w.r.t. a Value Function: an action that has the lowest Q-value • a = argmina’QV(s,a’) • Greedy Policy: a policy with all greedy actions w.r.t. V for each state

Policy Iteration • Initialize π0 as a random proper policy • Repeat • Policy Evaluation: Compute Vπn-1 • Policy Improvement: Construct πn greedy w.r.t. Vπn-1 • Until πn == πn-1 • Return πn

Policy Improvement • Computes a greedy policy under Vπn-1 • First compute the Q-value of each action under Vπn-1 in a given state • Then assign a greedy action in the state as πn

Properties of Policy Iteration • Policy Iteration for an SSP (initialized with a proper policy π0) • Successively improves the policy in each iteration • Vπn(s) ≤ Vπn-1(s) • Converges to an optimal policy

Modified Policy Iteration • Use iterative procedure for policy evaluation instead of system of equations • Use final value function from the previous iteration Vπn-1 instead of arbitrary value initialization V0πn

Modified Policy Iteration • Initialize π0 as a random proper policy • Repeat • Approximate Policy Evaluation: Compute Vπn-1 • by running only a few iterations of iterative policy evaluation • Policy Improvement: Construct πn greedy w.r.t. Vπn-1 • Until πn == πn-1 • Return πn

Limitations of Policy Iteration • Why do we need to start with a proper policy? • Policy evaluation step will diverge • How to get a proper policy? • No domain independent algorithm • Policy iteration for SSPs is not generically applicable

From Policy Iteration To Value Iteration • Search space changes • Policy Iteration • Search over policies • Compute the resulting value • Value Iteration • Search over values • Compute the resulting policy

Bellman Equations • Value Iteration based on set of Bellman equations • Bellman equations mathematically express the optimal solution of an MDP • Recursive expansion to compute optimal value function

Bellman Equations • Optimal Q-value of a state-action pair: the minimum expected cost to reach a goal starting in state s if the agent’s first action is a • Denoted as Q*(s,a) • V*(s) = 0 (if s ∈ G) = mina ∈ AQ*(s,a) (s not ∈ G) • Q*(s,a) = Ʃs’ ∈ SΤ(s,a,s’) [C(s,a,s’) + V*(s’)] • Restatement of optimality principle for SSP MDPs

Bellman Equations Expected cost to first execute action in state, and then follow optimal policy • Q*(s,a) = Ʃs’ ∈ SΤ(s,a,s’) [C(s,a,s’) + V*(s’)] • V*(s) = 0 (if s ∈ G) = mina ∈ AQ*(s,a) (s not ∈ G) • The minimization over all actions makes the equations non-linear Already at goal, don’t need to take action Pick best action, minimize expected cost

Bellman Backup • Iterative refinement • Vn(s)  mina ∈ AƩs’∈SΤ(s,a,s’) [C(s,a,s’) + Vn-1(s’)] • Bellman Backup: computes a new value at state s by backing up the successor values V(s’) • Uses Vn(s)  mina ∈ AƩs’∈SΤ(s,a,s’) [C(s,a,s’) + Vn-1(s’)]

Value Iteration No restriction on initial value function Termination condition ϵ-consistency

Example • All costs are 1 except for a40=5 and a41=2 • All V0 values initialized as the distance from the goal • Order of back up states is s0, s1, …, s4

Example – First Iteration • V1(s0) = min{1+V0(s1), 1+V0(s2)} = 3, similar computations for s1, s2, s3 • Q1(s4,a41) = 2+0.6*0+0.4*2 = 2.8 and Q1(s4,a40) = 5 • V1(s4) = min{2.8, 5} = 2.8

Example

VI - Convergence and Optimality • Value Iteration converges to the optimal value in the limit without restrictions • For an SSP MDP, ∀s ∈ S • limn∞ Vn(s) = V*(s) irrespective of the initialization

VI - Termination • Residual at state s (Bellman Backup): the magnitude of the change in the value of state s if Bellman backup is applied to V at s once • Denoted ResV(s) • Residual, ResV, is the maximum residual across all states • ϵ-consistency: a state s is called ϵ-consistent w.r.t. a value function Vif the residual at s w.r.t. V is less than ϵ • A value function V is ϵ-consistent if it is ϵ-consistent at all states • Terminate VI if all residuals are small

VI - Running Time • Each Bellman backup: • Go over all states and all successors: O(|S||A|) • Each Iteration: • Backup all states: O(|S|2|A|) • Number of iterations: • General SSPs: non-trivial bounds don’t exist

Monotonicity • For all n > k • Vk≤p V*  Vn≤p V* (Vn monotonic from below) • Vk≥p V*  Vn≥p V* (Vn monotonic from above) • If a value function V1 is componentwise greater (or less) than another value function V2, then the same inequality holds true between T(V1) and T(V2) • Bellman backup operator in VI is monotonic

Value Iteration to Asynchronous Value Iteration • Value iteration requires full sweeps of state space • It is not essential to backup all state in an iteration • Asynchronous value iteration requires additional restriction so that no state is starved and convergence holds • Termination condition checks if current value function is ϵ-consistent

Asynchronous Value Iteration

Wasteful Backup

Priority Backup • There are wasteful backups that occur in value iteration • Need to choose intelligent backup order and define a priority • Higher priority states backed up earlier

Prioritized Value Iteration

What State to Prioritize? • Avoid backing up a state when: • None of the successors of the state have had a change in value since the last backup • This means backing up the state will not change its value

Prioritized Sweeping • If a state’s value changes prioritize its predecessors • Estimate the expected change in the value of a state if a backup were to be performed • Converges to optimal in limit if all initial priorities are non-zero

Generalized Prioritized Sweeping • Instead of estimating residual, compute exact value as priority • First backup, then push state in queue

Improved Prioritized Sweeping • Low V(s) states (closer to goal) are higher priority initially • As residual reduces for states closer to goal, priority of other states increase

Backward Value Iteration • Prioritized Value Iteration without priority queue • Backup states in reverse order starting from goal • No overhead of priority queue and good information flow

Which priority algorithm to use? • Synchronous Value Iteration: when states highly interconnected • Prioritized Sweeping/Generalized Prioritize Sweeping: sequential dependencies • Improved Prioritized Sweeping: specific way to tradeoff proximity to goal/info flow • Backward Value Iteration: better for domains with fewer predecessors

Partitioned Value Iteration • Partition the state space • Need to stabilize mutual co-dependencies before focusing attention on states in other partitions

Benefits of Partitioning • External-memory algorithms • PEMVI • Cache-efficient algorithms • P-EVA algorithm • Parallelized algorithms • P3VI

Linear Programming for MDPs • α(s) are the state-relevance weights • For exact solution, unimportant and can be set to positive number (e.g. 1)

Linear Programming for MDPs • |S| variables • |S| |A| constraints • Computing exact solution is slower than value iteration • Better for specific kind of approximation where the value of a state is approximated with a sum of basis functions

Infinite-Horizon Discounted-Reward MDPs • V*(s) = maxa ∈ A Ʃs’ ∈ SΤ(s,a,s’) [R(s,a,s’) + γV*(s’)] • Value Iteration and Policy Iteration work even better than SSPs • Policy Iteration does not require a “proper” policy • Convergence stronger, bounds tighter • Can bound number of iterations

Finite-Horizon MDPs • V*(s,t) = 0 if t > L = maxa ∈ A Ʃs’ ∈ SΤ(s,a,s’) [R(s,a,s’) + V*(s’,t+1)] • Finite-Horizon MDPs are acyclic • There exists an optimal backup order • t = Tmax to 0 • Returns optimal values (not just ϵ-consistent) • Performs one backup per augmented state

MDPs with Dead Ends • Dead-End State: a state s in S such that no policy can reach the goal from s in any number of time steps • SSP MDPs unable to model domains with dead ends • If allow dead-end states in value iteration, V*(s) not defined for dead-end state and diverges

Finite-Penalty SSP MDPs with Dead-Ends • fSSPDE: a tuple <S, A, T, C, G, P> • S, A, T, C, G are same as in SSP MDP • P ∈ ℝ+ denotes the penalty incurred when an agent decides to abort the process in a non-goal state, under the condition: • For every improper stationary deterministic Markovian policy π, for every s ∈ S where π is improper, the value of π at s under the expected linear additive utility without the possibility of stopping the process by paying the penalty is infinite

Comparison • Policy Iteration • Convergence dependent on initial policy being proper (unless IHDR) • Value Iteration • Doesn’t require initial proper policy • For IHDR has stronger error bounds upon reaching ϵ-consistency • Linear Programming • Computing exact solution is slower than value iteration

Summary • Policy Iteration • Value Iteration • Prioritizing and Partitioning Value Iteration • Linear Programming (alternative solution) • Special Cases: • IHDR MDP and FH MDP • Dead-End States

MDP Exact Solutions II and Appl.

MDP Exact Solutions II and Appl.

Presentation Transcript

SOLUTIONS II

KI2 – MDP / POMDP

MDP

MDP

PART II Solutions

Observables and initial conditions from exact rotational hydro solutions

Observables and initial conditions from exact rotational hydro solutions

“Exact”

6B Symmetry and Exact Values – exact values

Advanced MDP Topics

MDP Reinforcement Learning

MDP Problems and Exact Solutions I

Exact Accumulation and 

Solutions Part II

PART II Solutions

Exact Analytic Solutions in Three-Body Problems

Partially Observable MDP

MDP 301

Introduction of MDP

MDP

PREPARING SOLUTIONS AND REAGENTS II