480 likes | 645 Views
MDP Exact Solutions II and Appl. Jeffrey Chyan Department of Computer Science Rice University Slides adapted from Mausam and Andrey Kolobov. Outline. Policy Iteration (3.3) Value Iteration (3.4) Prioritization in Value Iteration (3.5) / Partitioned Value Iteration (3.6)
E N D
MDP Exact Solutions II and Appl. • Jeffrey Chyan • Department of Computer Science • Rice University • Slides adapted from Mausam and Andrey Kolobov
Outline • Policy Iteration (3.3) • Value Iteration (3.4) • Prioritization in Value Iteration (3.5) / Partitioned Value Iteration (3.6) • Linear Programming Formulation (3.7) • Infinite-Horizon Discounted-Reward MDPs (3.8) • Finite-Horizon MDPs (3.9) • MDPs with Dead Ends (3.10)
Solving MDPs • Finding the best policy for MDPs • Policy Iteration • Value Iteration • Linear Programming
Recall SSP MDPs • Agent pays a cost to achieve goal • Exists at least one proper policy • Every improper policy incurs a cost of infinite from every state from which it does not reach the goal with P=1 • IHDR and FH ⊆ SSP • For this presentation, assume SSP unless stated otherwise
Recall Value and Evaluation • Value Function: maps the domain of a policy excluding the action set to a scalar value • Value of a policy is the expected utility of the reward sequence from executing the policy • Policy Evaluation: Given a policy, compute the value function for each state • Solving system of equations • Iterative Approach
Motivation • Find the best policy • Brute-force algorithm: given all policies are proper, enumerate all policies, evaluate them, and return the best one • Exponential number of policies, computationally intractable • Need a more intelligent search for best policy
The Q-Value Under a Value Function V • Q-value under a Value Function: the one-step lookahead computation of the value of taking an action a in state s • Under the belief that value function V is the true expected cost to reach a goal • Denoted as QV(s,a) • QV(s,a) = Ʃs’∈SΤ(s,a,s’) [C(s,a,s’) + V(s’)]
Greedy Action/Policy • Action Greedy w.r.t. a Value Function: an action that has the lowest Q-value • a = argmina’QV(s,a’) • Greedy Policy: a policy with all greedy actions w.r.t. V for each state
Policy Iteration • Initialize π0 as a random proper policy • Repeat • Policy Evaluation: Compute Vπn-1 • Policy Improvement: Construct πn greedy w.r.t. Vπn-1 • Until πn == πn-1 • Return πn
Policy Improvement • Computes a greedy policy under Vπn-1 • First compute the Q-value of each action under Vπn-1 in a given state • Then assign a greedy action in the state as πn
Properties of Policy Iteration • Policy Iteration for an SSP (initialized with a proper policy π0) • Successively improves the policy in each iteration • Vπn(s) ≤ Vπn-1(s) • Converges to an optimal policy
Modified Policy Iteration • Use iterative procedure for policy evaluation instead of system of equations • Use final value function from the previous iteration Vπn-1 instead of arbitrary value initialization V0πn
Modified Policy Iteration • Initialize π0 as a random proper policy • Repeat • Approximate Policy Evaluation: Compute Vπn-1 • by running only a few iterations of iterative policy evaluation • Policy Improvement: Construct πn greedy w.r.t. Vπn-1 • Until πn == πn-1 • Return πn
Limitations of Policy Iteration • Why do we need to start with a proper policy? • Policy evaluation step will diverge • How to get a proper policy? • No domain independent algorithm • Policy iteration for SSPs is not generically applicable
From Policy Iteration To Value Iteration • Search space changes • Policy Iteration • Search over policies • Compute the resulting value • Value Iteration • Search over values • Compute the resulting policy
Bellman Equations • Value Iteration based on set of Bellman equations • Bellman equations mathematically express the optimal solution of an MDP • Recursive expansion to compute optimal value function
Bellman Equations • Optimal Q-value of a state-action pair: the minimum expected cost to reach a goal starting in state s if the agent’s first action is a • Denoted as Q*(s,a) • V*(s) = 0 (if s ∈ G) = mina ∈ AQ*(s,a) (s not ∈ G) • Q*(s,a) = Ʃs’ ∈ SΤ(s,a,s’) [C(s,a,s’) + V*(s’)] • Restatement of optimality principle for SSP MDPs
Bellman Equations Expected cost to first execute action in state, and then follow optimal policy • Q*(s,a) = Ʃs’ ∈ SΤ(s,a,s’) [C(s,a,s’) + V*(s’)] • V*(s) = 0 (if s ∈ G) = mina ∈ AQ*(s,a) (s not ∈ G) • The minimization over all actions makes the equations non-linear Already at goal, don’t need to take action Pick best action, minimize expected cost
Bellman Backup • Iterative refinement • Vn(s) mina ∈ AƩs’∈SΤ(s,a,s’) [C(s,a,s’) + Vn-1(s’)] • Bellman Backup: computes a new value at state s by backing up the successor values V(s’) • Uses Vn(s) mina ∈ AƩs’∈SΤ(s,a,s’) [C(s,a,s’) + Vn-1(s’)]
Value Iteration No restriction on initial value function Termination condition ϵ-consistency
Example • All costs are 1 except for a40=5 and a41=2 • All V0 values initialized as the distance from the goal • Order of back up states is s0, s1, …, s4
Example – First Iteration • V1(s0) = min{1+V0(s1), 1+V0(s2)} = 3, similar computations for s1, s2, s3 • Q1(s4,a41) = 2+0.6*0+0.4*2 = 2.8 and Q1(s4,a40) = 5 • V1(s4) = min{2.8, 5} = 2.8
VI - Convergence and Optimality • Value Iteration converges to the optimal value in the limit without restrictions • For an SSP MDP, ∀s ∈ S • limn∞ Vn(s) = V*(s) irrespective of the initialization
VI - Termination • Residual at state s (Bellman Backup): the magnitude of the change in the value of state s if Bellman backup is applied to V at s once • Denoted ResV(s) • Residual, ResV, is the maximum residual across all states • ϵ-consistency: a state s is called ϵ-consistent w.r.t. a value function Vif the residual at s w.r.t. V is less than ϵ • A value function V is ϵ-consistent if it is ϵ-consistent at all states • Terminate VI if all residuals are small
VI - Running Time • Each Bellman backup: • Go over all states and all successors: O(|S||A|) • Each Iteration: • Backup all states: O(|S|2|A|) • Number of iterations: • General SSPs: non-trivial bounds don’t exist
Monotonicity • For all n > k • Vk≤p V* Vn≤p V* (Vn monotonic from below) • Vk≥p V* Vn≥p V* (Vn monotonic from above) • If a value function V1 is componentwise greater (or less) than another value function V2, then the same inequality holds true between T(V1) and T(V2) • Bellman backup operator in VI is monotonic
Value Iteration to Asynchronous Value Iteration • Value iteration requires full sweeps of state space • It is not essential to backup all state in an iteration • Asynchronous value iteration requires additional restriction so that no state is starved and convergence holds • Termination condition checks if current value function is ϵ-consistent
Priority Backup • There are wasteful backups that occur in value iteration • Need to choose intelligent backup order and define a priority • Higher priority states backed up earlier
What State to Prioritize? • Avoid backing up a state when: • None of the successors of the state have had a change in value since the last backup • This means backing up the state will not change its value
Prioritized Sweeping • If a state’s value changes prioritize its predecessors • Estimate the expected change in the value of a state if a backup were to be performed • Converges to optimal in limit if all initial priorities are non-zero
Generalized Prioritized Sweeping • Instead of estimating residual, compute exact value as priority • First backup, then push state in queue
Improved Prioritized Sweeping • Low V(s) states (closer to goal) are higher priority initially • As residual reduces for states closer to goal, priority of other states increase
Backward Value Iteration • Prioritized Value Iteration without priority queue • Backup states in reverse order starting from goal • No overhead of priority queue and good information flow
Which priority algorithm to use? • Synchronous Value Iteration: when states highly interconnected • Prioritized Sweeping/Generalized Prioritize Sweeping: sequential dependencies • Improved Prioritized Sweeping: specific way to tradeoff proximity to goal/info flow • Backward Value Iteration: better for domains with fewer predecessors
Partitioned Value Iteration • Partition the state space • Need to stabilize mutual co-dependencies before focusing attention on states in other partitions
Benefits of Partitioning • External-memory algorithms • PEMVI • Cache-efficient algorithms • P-EVA algorithm • Parallelized algorithms • P3VI
Linear Programming for MDPs • α(s) are the state-relevance weights • For exact solution, unimportant and can be set to positive number (e.g. 1)
Linear Programming for MDPs • |S| variables • |S| |A| constraints • Computing exact solution is slower than value iteration • Better for specific kind of approximation where the value of a state is approximated with a sum of basis functions
Infinite-Horizon Discounted-Reward MDPs • V*(s) = maxa ∈ A Ʃs’ ∈ SΤ(s,a,s’) [R(s,a,s’) + γV*(s’)] • Value Iteration and Policy Iteration work even better than SSPs • Policy Iteration does not require a “proper” policy • Convergence stronger, bounds tighter • Can bound number of iterations
Finite-Horizon MDPs • V*(s,t) = 0 if t > L = maxa ∈ A Ʃs’ ∈ SΤ(s,a,s’) [R(s,a,s’) + V*(s’,t+1)] • Finite-Horizon MDPs are acyclic • There exists an optimal backup order • t = Tmax to 0 • Returns optimal values (not just ϵ-consistent) • Performs one backup per augmented state
MDPs with Dead Ends • Dead-End State: a state s in S such that no policy can reach the goal from s in any number of time steps • SSP MDPs unable to model domains with dead ends • If allow dead-end states in value iteration, V*(s) not defined for dead-end state and diverges
Finite-Penalty SSP MDPs with Dead-Ends • fSSPDE: a tuple <S, A, T, C, G, P> • S, A, T, C, G are same as in SSP MDP • P ∈ ℝ+ denotes the penalty incurred when an agent decides to abort the process in a non-goal state, under the condition: • For every improper stationary deterministic Markovian policy π, for every s ∈ S where π is improper, the value of π at s under the expected linear additive utility without the possibility of stopping the process by paying the penalty is infinite
Comparison • Policy Iteration • Convergence dependent on initial policy being proper (unless IHDR) • Value Iteration • Doesn’t require initial proper policy • For IHDR has stronger error bounds upon reaching ϵ-consistency • Linear Programming • Computing exact solution is slower than value iteration
Summary • Policy Iteration • Value Iteration • Prioritizing and Partitioning Value Iteration • Linear Programming (alternative solution) • Special Cases: • IHDR MDP and FH MDP • Dead-End States