370 likes | 382 Views
An overview of tabular solution methods in reinforcement learning, including multi-armed bandits, finite Markov decision processes, dynamic programming, and more. Presented by Nicholas Roy.
E N D
Reinforcement Learning | Part I Tabular Solution Methods Mini-Bootcamp Richard S. Sutton & Andrew G. Barto 1st ed. (1998), 2nd ed. (2018) Presented by Nicholas Roy Pillow Lab Meeting June 27, 2019
RL of the tabular variety • What is special about RL? • “The most important feature distinguishing reinforcement learning from other types of learning is that it uses training information that evaluates the actions taken rather than instructs by giving correct actions. This is what creates the need for active exploration, for an explicit search for good behavior. Purely evaluative feedback indicates how good the action taken was, but not whether it was the best or the worst action possible.” • What is the point of Part I? • “We describe almost all the core ideas of reinforcement learning algorithms in their simplest forms: that in which the state and action spaces are small enough for the approximate value functions to be represented as arrays, or tables. In this case, the methods can often find exact solutions…” Pillow Lab Meeting, 06/27/19 Nicholas Roy Reinforcement Learning Mini-Bootcamp
Part I: Tabular Solution Methods • Ch2: Multi-armed Bandits • Ch3: Finite Markov Decision Processes • Ch4: Dynamic Programming • Ch5: Monte Carlo Methods • Ch6: Temporal-Difference Learning • Ch7: n-step Bootstrapping • Ch8: Planning and Learning with Tabular Methods Pillow Lab Meeting, 06/27/19 Nicholas Roy Reinforcement Learning Mini-Bootcamp
Part I: Tabular Solution Methods • Ch2: Multi-armed Bandits • Ch3: Finite Markov Decision Processes • Ch4: Dynamic Programming • Ch5: Monte Carlo Methods • Ch6: Temporal-Difference Learning • Ch7: n-step Bootstrapping • Ch8: Planning and Learning with Tabular Methods Pillow Lab Meeting, 06/27/19 Nicholas Roy Reinforcement Learning Mini-Bootcamp
Let’s get through the basics… • Agent/Environment • States • Actions • Rewards • Markov • MDP • Dynamics p(s’,r|s,a) • Returns • Discount factors • Episodic/Continuing tasks • Policies • State-/Action-value function • Bellman equation • Optimal policies Pillow Lab Meeting, 06/27/19 Nicholas Roy Reinforcement Learning Mini-Bootcamp
Agent vs. Environment • Agent: the learner and decision maker, interacts with… • Environment: everything else • In a finite Markov Decision Process (MDP), the sets of states S, actions A, and rewards R have finite number of elements Pillow Lab Meeting, 06/27/19 Nicholas Roy Reinforcement Learning Mini-Bootcamp
MDP Dynamics • Dynamics defined completely by p • Dynamics have Markov property, only depend on current (s,a) • Can collapse this 4D table to get other functions of interest: • state-transitions: • expected reward: Pillow Lab Meeting, 06/27/19 Nicholas Roy Reinforcement Learning Mini-Bootcamp
Rewards and Returns • Reward hypothesis: “…goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (called reward).” • way of communicating what you want to achieve, not how • Return: • Discounted Return: • With discount, even infinite time steps have finite Return Pillow Lab Meeting, 06/27/19 Nicholas Roy Reinforcement Learning Mini-Bootcamp
Policies & Value Functions • Policy: a mapping from states to probabilities of selecting each possible action, (a|s) • State-value function: for a given policy and state s, is defined as the expected return G when starting in s and following thereafter • Action-value function: for a given policy state s, and action a, the expected return G when starting in s, doing a, then following • The existence and uniqueness of v and q are guaranteed as long as either <1 or eventual termination from all states under the policy . Pillow Lab Meeting, 06/27/19 Nicholas Roy Reinforcement Learning Mini-Bootcamp
Bellman Property • Bellman Property: a recursive relationship satisfied by the unique functions v and q between the value of a state and the value of its successor states • Analogous for q(s,a), can also easily convert between v and q Pillow Lab Meeting, 06/27/19 Nicholas Roy Reinforcement Learning Mini-Bootcamp
Optimal Policies • There is always at least one policy that is better than or equal to all other policies. This is an optimal policy. Although there may be more than one, we denote all the optimal policies by * • They share the same state- and action-value functions, v* and q* Bellman optimality equations Pillow Lab Meeting, 06/27/19 Nicholas Roy Reinforcement Learning Mini-Bootcamp
Part I: Tabular Solution Methods • Ch2: Multi-armed Bandits • Ch3: Finite Markov Decision Processes • Ch4: Dynamic Programming • Ch5: Monte Carlo Methods • Ch6: Temporal-Difference Learning • Ch7: n-step Bootstrapping • Ch8: Planning and Learning with Tabular Methods Pillow Lab Meeting, 06/27/19 Nicholas Roy Reinforcement Learning Mini-Bootcamp
Dynamic Programming • DP algorithms can be used to compute optimal policies if given the complete dynamics, p(s’,r|s,a), of an MDP • A strong assumption and computationally expensive, but provides theoretical best case that other algorithms attempt to achieve • Chapter introduces Policy Evaluation then Policy Iteration: • Initialize an arbitrary value function v and random policy • Use Bellman update to move v toward v until convergence • Update ’ to be greedy w.r.t. v • Repeat from (2.) until v = v convergence, implying that v = v • * is just the greedy policy w.r.t. v* Pillow Lab Meeting, 06/27/19 Nicholas Roy Reinforcement Learning Mini-Bootcamp
Generalized Policy Iteration (GPI) We can actually skip the strict iteration and just update the policy to be greedy in real-time… Pillow Lab Meeting, 06/27/19 Nicholas Roy Reinforcement Learning Mini-Bootcamp
A Quick Example… Pillow Lab Meeting, 06/27/19 Nicholas Roy Reinforcement Learning Mini-Bootcamp
DP Summary • DP suffers from “curse of dimensionality” • (Coined by Bellman, as is “dynamic programming”!) • But exponentially better than direct search • Modern computers can handle millions of states, can run asynchronously • DP is essentially just Bellman equations turned into updates • Generalized Policy Methods proven to converge for DP • Bootstrapping: DP bootstraps, that is it updates estimates of values using other estimated values • Unlike the next set of methods… Pillow Lab Meeting, 06/27/19 Nicholas Roy Reinforcement Learning Mini-Bootcamp
Part I: Tabular Solution Methods • Ch2: Multi-armed Bandits • Ch3: Finite Markov Decision Processes • Ch4: Dynamic Programming • Ch5: Monte Carlo Methods • Ch6: Temporal-Difference Learning • Ch7: n-step Bootstrapping • Ch8: Planning and Learning with Tabular Methods Pillow Lab Meeting, 06/27/19 Nicholas Roy Reinforcement Learning Mini-Bootcamp
Motivation for Monte Carlo • What is v? The expected return G from each state under • So, why not just learn vby averaging returns G? Pillow Lab Meeting, 06/27/19 Nicholas Roy Reinforcement Learning Mini-Bootcamp
The difference between Monte Carlo and DP • MC operates on sample experience, not with full dynamics • DP : computing v:: MC : learning v • MC does not bootstrap, estimates v directly from returns G Advantages of MC > DP: • Can be used to learn optimal behavior directly from interaction with the environment, with no model of the environment’s dynamics • If there is a model, can learn from simulation (ex: Blackjack) • Easy and efficient to focus Monte Carlo methods on a small subset of states • No bootstrapping means less harmed by violations of the Markov property Pillow Lab Meeting, 06/27/19 Nicholas Roy Reinforcement Learning Mini-Bootcamp
Problems of Exploration • Problem is now non-stationary — the return after taking an action in one state depends on the actions taken in later states in the same episode • If is a deterministic policy, then in following one will observe returns for only one of the actions from each state. With no returns, the Monte Carlo estimates of the other actions will not improve with experience. • Must assure continual exploration for policy evaluation to work • Solutions: • Exploring starts (small chance of starting in each state) • On-policy : epsilon-greedy (choose a random action epsilon-% of the time) • Off-policy : importance sampling (use distinct policy b to explore and improve ) Pillow Lab Meeting, 06/27/19 Nicholas Roy Reinforcement Learning Mini-Bootcamp
On-policy vs. Off-policy for Exploration • In on-policy methods, the agent commits to always exploring and tries to find the best policy that still explores. • In off-policy methods, the agent also explores, but learns a deterministic optimal policy (usually greedy) that may be unrelated to the policy followed (b, the behavioral policy). • Off-policy prediction learning methods are based on some form of importance sampling, that is, on weighting returns by the ratio of the probabilities of taking the observed actions under the two policies, thereby transforming their expectations from the behavior policy to the target policy. Pillow Lab Meeting, 06/27/19 Nicholas Roy Reinforcement Learning Mini-Bootcamp
Part I: Tabular Solution Methods • Ch2: Multi-armed Bandits • Ch3: Finite Markov Decision Processes • Ch4: Dynamic Programming • Ch5: Monte Carlo Methods • Ch6: Temporal-Difference Learning • Ch7: n-step Bootstrapping • Ch8: Planning and Learning with Tabular Methods Pillow Lab Meeting, 06/27/19 Nicholas Roy Reinforcement Learning Mini-Bootcamp
A Comparison of Updates Pillow Lab Meeting, 06/27/19 Nicholas Roy Reinforcement Learning Mini-Bootcamp
Temporal-Difference Learning • “If one had to identify one idea as central and novel to reinforcement learning, it would undoubtedly be temporal-difference (TD) learning.” • Like MC methods, TD methods can learn directly from raw experience without a model of the environment’s dynamics. • Like DP, TD methods update estimates based in part on other learned estimates, without waiting for a final outcome (they bootstrap). Advantages of TD methods: • can be applied online • with a minimal amount of computation • using experience generated from interaction with an environment • expressed nearly completely by single equations, implemented with small computer programs. Pillow Lab Meeting, 06/27/19 Nicholas Roy Reinforcement Learning Mini-Bootcamp
TD Update & Error • TD Update: • TD Error: Pillow Lab Meeting, 06/27/19 Nicholas Roy Reinforcement Learning Mini-Bootcamp
Example: TD(0) vs. MC Pillow Lab Meeting, 06/27/19 Nicholas Roy Reinforcement Learning Mini-Bootcamp
Sarsa: on-policy TD control Pillow Lab Meeting, 06/27/19 Nicholas Roy Reinforcement Learning Mini-Bootcamp
Q-learning: off-policy TD control Pillow Lab Meeting, 06/27/19 Nicholas Roy Reinforcement Learning Mini-Bootcamp
Case Study: Cliff Walking Pillow Lab Meeting, 06/27/19 Nicholas Roy Reinforcement Learning Mini-Bootcamp
Part I: Tabular Solution Methods • Ch2: Multi-armed Bandits • Ch3: Finite Markov Decision Processes • Ch4: Dynamic Programming • Ch5: Monte Carlo Methods • Ch6: Temporal-Difference Learning • Ch7: n-step Bootstrapping • Ch8: Planning and Learning with Tabular Methods Pillow Lab Meeting, 06/27/19 Nicholas Roy Reinforcement Learning Mini-Bootcamp
n-step Methods • Specifically, n-step TD methods • Bridges gap between one-step TD(0) and -step Monte Carlo • With TD(0), the same time step determines both how often the action can be changed & the time interval for bootstrapping • want to update action values very fast to take into account any changes • but bootstrapping works best if it is over a length of time in which a significant and recognizable state change has occurred. • Will be superseded by Ch12 Eligibility Traces, continuous version Pillow Lab Meeting, 06/27/19 Nicholas Roy Reinforcement Learning Mini-Bootcamp
Part I: Tabular Solution Methods • Ch2: Multi-armed Bandits • Ch3: Finite Markov Decision Processes • Ch4: Dynamic Programming • Ch5: Monte Carlo Methods • Ch6: Temporal-Difference Learning • Ch7: n-step Bootstrapping • Ch8: Planning and Learning with Tabular Methods Pillow Lab Meeting, 06/27/19 Nicholas Roy Reinforcement Learning Mini-Bootcamp
Model-free vs. Model-based • Planning methods (model-based) : DP • Learning methods (model-free) : MC, TD • Both method types: • look ahead to future events, • compute a backed-up value, • using it as an update target for an approximate value function • DP: • MC: • TD: • Now seek to unify model-free/based Pillow Lab Meeting, 06/27/19 Nicholas Roy Reinforcement Learning Mini-Bootcamp
Dyna From experience you can: (1) improve your value function & policy (direct RL) (2) improve your model (model-learning, or indirect RL) Pillow Lab Meeting, 06/27/19 Nicholas Roy Reinforcement Learning Mini-Bootcamp
A Final Overview 3 key ideas in common They all seek to estimate value functions They all operate by backing up values along actual or possible state trajectories They all follow the general strategy of generalized policy iteration (GPI), meaning that they maintain an approximate value function and approximate policy, and continually try to improve each on the basis of the other. Ch7: n-step Methods Ch12: Eligibility Traces + 3rd dimension: On vs. Off-policy Pillow Lab Meeting, 06/27/19 Nicholas Roy Reinforcement Learning Mini-Bootcamp
Other method dimensions to consider… Pillow Lab Meeting, 06/27/19 Nicholas Roy Reinforcement Learning Mini-Bootcamp
The Rest of the Book • Part I: Tabular Solution Methods • Part II: Approximate Solution Methods • Ch 9: On-policy Prediction with Approximation • Ch10: On-policy Control with Approximation • Ch11: Off-policy Methods with Approximation • Ch12: Eligibility Traces • Ch13: Policy Gradient Methods • Part III: Looking Deeper • Neuroscience, Psychology, Applications and Case Studies, Frontiers Pillow Lab Meeting, 06/27/19 Nicholas Roy Reinforcement Learning Mini-Bootcamp