Policy Evaluation & Policy Iteration

Policy Evaluation & Policy Iteration • S&B: Sec 4.1, 4.3; 6.5

The Bellman equation • The final recursive equation is known as the Bellman equation: • Unique soln to this eqn gives value of a fixed policy π when operating in a known MDP M=〈S,A,T,R〉 • When state/action spaces are discrete, can think of V and R as vectors and Tπ as matrix, and get matrix eqn:

Exercise • Solve the matrix Bellman equation (i.e., find V): • I formulated the Bellman equations for “state-based” rewards: R(s) • Formulate & solve the B.E. for “state-action” rewards (R(s,a)) and “state-action-state” rewards (R(s,a,s’))

Policy values in practice “Robot” navigation in a grid maze

Policy values in practice Optimal policy, π*

Policy values in practice Value function for optimal policy, V*

A harder “maze”...

A harder “maze”... Optimal policy, π*

A harder “maze”... Value function for optimal policy, V*

Still more complex...

Still more complex... Optimal policy, π*

Still more complex... Value function for optimal policy, V*

Planning: finding π* • So we know how to evaluate a single policy, π • How do you find the best policy? • Remember: still assuming that we know M=〈S,A,T,R〉

Planning: finding π* • So we know how to evaluate a single policy, π • How do you find the best policy? • Remember: still assuming that we know M=〈S,A,T,R〉 • Non-solution: iterate through all possible π, evaluating each one; keep best

Policy iteration & friends • Many different solutions available. • All exploit some characteristics of MDPs: • For infinite-horizon discounted reward in a discrete, finite MDP, there exists at least one optimal, stationary policy (may exist more than one equivalent policy) • The Bellman equation expresses recursive structure of an optimal policy • Leads to a series of closely related policy solutions: policy iteration, value iteration, generalized policy iteration, etc.

The policy iteration alg. • Function: policy_iteration • Input: MDPM=〈S,A,T,R〉, discount γ • Output: optimal policyπ*; opt. value func.V* • Initialization: chooseπ0arbitrarily • Repeat { • Vi=eval_policy(M,πi,γ) // from Bellman eqn • πi+1=local_update_policy(πi,Vi) • } Until (πi+1==πi) • Function: π’=local_update_policy(π,V) • for i=1..|S| { • π’(si)=argmaxa∈A{sumj(T(si,a,sj)*V(sj))} • }

Why does this work? • 2 explanations: • Theoretical: • The local update w.r.t. the policy value is a contractive mapping, ergo a fixed point exists and will be reached • See, “contraction mapping”, “Banach fixed-point theorem”, etc. • http://math.arizona.edu/~restrepo/475A/Notes/sourcea/node22.html • http://planetmath.org/encyclopedia/BanachFixedPointTheorem.html • Contracts w.r.t. the Bellman Error:

Why does this work? • The intuitive explanation • It’s doing a dynamic-programming “backup” of reward from reward “sources” • At every step, the policy is locally updated to take advantage of new information about reward that is propagated back by the evaluation step • Value “propagates away” from sources and the policy is able to say “hey! there’s reward over there! I can get some of that action if I change a bit!”

P.I. in action Iteration 0 Policy Value

P.I. in action Iteration 6: done Policy Value

Properties & Variants • Policy iteration • Known to converge (provable) • Observed to converge exponentially quickly • # iterations is O(ln(|S|)) • Empirical observation; strongly believed but no proof (yet) • O(|S|3) time per iteration (policy evaluation) • Other methods possible • Linear program (poly time soln exists) • Value iteration • Generalized policy iter. (often best in practice)

Q: A key operative • Critical step in policy iteration • π’(si)=argmaxa∈A{sumj(T(si,a,sj)*V(sj))} • Asks “What happens if I ignore π for just one step, and do a instead (and then resume doing π thereafter)?” • Often used operation. Gets a special name: • Definition: the Q function, is: • Policy iter says: “Figure out Q, act greedily according to Q, then update Q and repeat, until you can’t do any better...”

What to do with Q • Can think of Q as a big table: one entry for each state/action pair • “If I’m in state s and take action a, this is my expected discounted reward...” • A “one-step” exploration: “In state s, if I deviate from my policy π for one timestep, then keep doing π, is my life better or worse?” • Can get V and π from Q:

Policy Evaluation & Policy Iteration