Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science

ECE-517: Reinforcement Learningin Artificial IntelligenceLecture 6: Optimality Criterion in MDPs September 8, 2011 Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science The University of Tennessee Fall 2011

Outline • Optimal value functions (cont.) • Implementation considerations • Optimality and approximation

Recap on Value Functions • We define the state-value function for policy p as • Similarly, we define the action-value function for • The Bellman equation • The value function Vp(s) is the unique solution to its Bellman equation 0 ∆ 0 0

Optimal Value Functions • A policy p is defined to be better than or equal to a policy p*, if its expected return is greater than or equal to that of p* for all states, i.e. • There is always at least one policy (a.k.a. optimal policy) that is better than or equal to all other policies • Optimal policies also share the same optimal action-value function, defined as

Optimal Value Functions (cont.) • The latter gives the expected return for taking action a in state s and thereafter following an optimal policy • Thus, we can write • Since V*(s) is the value function for a policy, it must satisfy the Bellman equation • This is called the Bellman optimality equation • Intuitively, the Bellman optimality equation expresses the fact that the value of a state under an optimal policy must equal the expected return for the best action from that state ∆

Optimal Value Functions (cont.) 0 ∆

Optimal Value Functions (cont.) • The Bellman optimality equation for Q* is • Backup diagrams arcs have been added at the agent's choice points to represent that the maximum over that choice is taken rather than the expected value (given some policy) 0 0

Optimal Value Functions (cont.) • For finite MDPs, the Bellman optimality equation has a unique solution independent of the policy • The Bellman optimality equation is actually a system of equations, one for each state • N equations (one for each state) • N variables – V*(s) • This assumes you know the dynamics of the environment • Once one has V*(s), it is relatively easy to determine an optimal policy … • For each state there will be one or more actions for which the maximum is obtained in the Bellman optimality equation • Any policy that assigns nonzero probability only to these actions is an optimal policy • This translates to a one-step search, i.e. greedy decisions will be optimal

Optimal Value Functions (cont.) • With Q*, the agent does not even have to do a one-step-ahead search • For any state s – the agent can simply find any action that maximizes Q*(s,a) • The action-value function effectively embeds the results of all one-step-ahead searches • It provides the optimal expected long-term return as a value that is locally and immediately available for each state-action pair • Agent does not need to know anything about the dynamics of the environment • Q: What are the implementation tradeoffs here? ∆

Implementation Considerations • Computational Complexity • How complex is it to evaluate the value and state-value functions? • In software • In hardware • Data flow constraints • Which part of the data needs to be globally vs. locally available? • Impact of memory bandwidth limitations ∆

Recycling Robot revisited • A transition graph is a useful way to summarize the dynamics of a finite MDP • State node for each possible state • Action node for each possible state-action pair 0

Bellman Optimality Equations for the Recycling Robot • To make things more compact, we abbreviate the states high and low, and the actions search, wait, and recharge respectively by h, l, s, w, and re 0

Optimality and Approximation • Clearly, an agent that learns an optimal policy has done very well, but in practice this rarely happens • Usually involves heavy computational load • Typically agents perform approximations to the optimal policy • A critical aspect of the problem facing the agent is always the computational resources available to it • In particular, the amount of computation it can perform in a single time step • Practical considerations are thus: • Computational complexity • Memory available • Tabular methods apply for small state sets • Communication overhead (for distributed implementations) • Hardware vs. software

Are approximations good or bad ? • RL typically relies on approximation mechanisms (see later) • This could be an opportunity • Efficient “Feature-extraction” type of approximation may actually reduce “noise” • Make it practical for us to address large-scale problems • In general, making “bad” decisions in RL result in learning opportunities (online) • The online nature of RL encourages learning more effectively from events that occur frequently • Supported in nature • Capturing regularities is a key property of RL

Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science