Value and Planning in MDPs

Value and Planning in MDPs

Administrivia • Reading 3 assigned today • Mahdevan, S., “Representation Policy Iteration”. In Proc. of 21st Conference on Uncertainty in Artificial Intelligence (UAI-2005). • http://www.cs.umass.edu/~mahadeva/papers/uai-final-paper.pdf • Due: Apr 20 • Groups assigned this time

Where we are • Last time: • Expected value of policies • Principle of maximum expected utility • The Bellman equation • Today: • A little intuition (pictures) • Finding π*: the policy iteration algorithm • The Q function • On to actual learning (maybe?)

The Bellman equation • The final recursive equation is known as the Bellman equation: • Unique soln to this eqn gives value of a fixed policy π when operating in a known MDP M=〈S,A,T,R〉 • When state/action spaces are discrete, can think of V and R as vectors and Tπ as matrix, and get matrix eqn:

Exercise • Solve the matrix Bellman equation (i.e., find V): • I formulated the Bellman equations for “state-based” rewards: R(s) • Formulate & solve the B.E. for: • “state-action” rewards (R(s,a)) • “state-action-state” rewards (R(s,a,s’))

Exercise • Solve the matrix Bellman equation (i.e., find V): • Formulate & solve the B.E. for: • “state-action” rewards (R(s,a)) • “state-action-state” rewards (R(s,a,s’))

Policy values in practice “Robot” navigation in a grid maze Goal state

The MDP formulation • State space: • Action space: • Reward function: • Transition function: ...

The MDP formulation • Transition function: • If desired direction is unblocked • Move in desired direction with probability 0.7 • Stay in same place w/ prob 0.1 • Move “forward right” w/ prob 0.1 • Move “forward left” w/ prob 0.1 • If desired direction is blocked (wall) • Stay in same place w/ prob 1.0

Policy values in practice Optimal policy, π* EAST SOUTH WEST NORTH

Policy values in practice Value function for optimal policy, V* Why does it look like this?

Walls Doors A harder “maze”...

A harder “maze”... Optimal policy, π*

A harder “maze”... Value function for optimal policy, V*

Still more complex...

Still more complex... Optimal policy, π*

Still more complex... Value function for optimal policy, V*

Planning: finding π* • So we know how to evaluate a single policy, π • How do you find the best policy? • Remember: still assuming that we know M=〈S,A,T,R〉

Planning: finding π* • So we know how to evaluate a single policy, π • How do you find the best policy? • Remember: still assuming that we know M=〈S,A,T,R〉 • Non-solution: iterate through all possible π, evaluating each one; keep best

Policy iteration & friends • Many different solutions available. • All exploit some characteristics of MDPs: • For infinite-horizon discounted reward in a discrete, finite MDP, there exists at least one optimal, stationary policy (may exist more than one equivalent policy) • The Bellman equation expresses recursive structure of an optimal policy • Leads to a series of closely related policy solutions: policy iteration, value iteration, generalized policy iteration, etc.

The policy iteration alg. • Function: policy_iteration • Input: MDPM=〈S,A,T,R〉, discount γ • Output: optimal policyπ*; opt. value func.V* • Initialization: chooseπ0arbitrarily • Repeat { • Vi=eval_policy(M,πi,γ) // from Bellman eqn • πi+1=local_update_policy(πi,Vi) • } Until (πi+1==πi) • Function: π’=local_update_policy(π,V) • for i=1..|S| { • π’(si)=argmaxa∈A( sumj(T(si,a,sj)*V(sj)) ) • }

Why does this work? • 2 explanations: • Theoretical: • The local update w.r.t. the policy value is a contractive mapping, ergo a fixed point exists and will be reached • See, “contraction mapping”, “Banach fixed-point theorem”, etc. • http://math.arizona.edu/~restrepo/475A/Notes/sourcea/node22.html • http://planetmath.org/encyclopedia/BanachFixedPointTheorem.html • Contracts w.r.t. the Bellman Error:

Why does this work? • The intuitive explanation • It’s doing a dynamic-programming “backup” of reward from reward “sources” • At every step, the policy is locally updated to take advantage of new information about reward that is propagated back by the evaluation step • Value “propagates away” from sources and the policy is able to say “hey! there’s reward over there! I can get some of that action if I change a bit!”

P.I. in action Iteration 0 Policy Value

P.I. in action Iteration 6: done Policy Value

Properties • Policy iteration • Known to converge (provable) • Observed to converge exponentially quickly • # iterations is O(ln(|S|)) • Empirical observation; strongly believed but no proof (yet) • O(|S|3) time per iteration (policy evaluation)

Variants • Other methods possible • Linear program (poly time soln exists) • Value iteration • Generalized policy iter. (often best in practice)

Value and Planning in MDPs