ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 14: Planning and Learning

ECE 517: Reinforcement Learningin Artificial Intelligence Lecture 14: Planning and Learning October 27, 2015 Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science The University of Tennessee Fall 2015

Final projects - logistics • Projects can be done in groups of up to 3 students • Details on projects will be posted soon • Students are encouraged to propose a topic • Please email me your top three choices for a project along with a preferred date for your presentation • Presentation dates: • Nov. 17, 19, 24 and Dec. 1 + additional time slot (TBD) • Format: 20 min presentation + 5 min Q&A • ~5 min for background and motivation • ~15 for description of your work, results, conclusions • Written report due: Monday, Dec. 7 • Format similar to project report

Final projects – sample topics DQN – Playing Atari games using RL Teris player using RL (and NN) Curiosity based TD learning* Reinforcement Learning of Local Shape in the Game of Go AIBO learning to walk Study of value function definitions for TD learning Imitation learning in RL

Outline • Introduction • Use of environment models • Integration of planning and learning methods

Introduction • Earlier we discussed Monte Carlo and temporal-difference methods as distinct alternatives • Then showed how they can be seamlessly integrated by using eligibility traces such as in TD(l) • Planning methods: e.g. Dynamic Programming and heuristic search • Rely on knowledge of a model • Model – any information that helps the agent predict the way the environment will behave • Learning methods: Monte Carlo and Temporal Difference Learning • Do not require a model • Our goal: Explore the extent to which the two methods can be intermixed

The original idea

The original idea (cont.)

Models • Model: anything the agent can use to predict how the environment will respond to its actions • Distribution models: provide description of all possibilities (of next states and rewards) and their probabilities • e.g. Dynamic Programming • Example - sum of a dozen dice – produce all possible sums and their probabilities of occurring • Sample models: produce just one sample experience • In our example - produce individual sums drawn according to this probability distribution • Both types of models can be used to (mimic) produce simulated experience • Often sample models are much easier to come by

Planning • Planning: any computational process that uses a model to create or improve a policy • Planning in AI: • State-space planning (such as in RL) – search for policy • Plan-space planning (e.g., partial-order planner) • e.g. evolutionary methods • We take the following (unusual) view: • All state-space planning methods involve computing value functions, either explicitly or implicitly • They all apply backups to simulated experience

Planning (cont.) • Classical DP methods are state-space planning methods • Heuristic search methods are state-space planning methods • Planning methods rely on “real” experience as input, but in many cases they can be applied to simulated experience just as well • Example: a planning method based on Q-learning: Random-Sample One-Step Tabular Q-Planning

Learning, Planning, and Acting • Two uses of real experience: • Model learning: to improve the model • Direct RL: to directly improve the value function and policy • Improving value function and/or policy via a model is sometimes called indirect RLormodel-based RL.Here,we call itplanning. • Q: What are the advantages/disadvantages of each?

Indirect methods: make fuller use of experience: get better policy with fewer environment interactions Direct vs. Indirect RL • Direct methods • simpler • not affected by bad models But they are very closely related and can be usefully combined: planning, acting, model learning, and direct RL can occur simultaneously and in parallel Q: Which scheme do you think applies to humans?

The Dyna-Q Architecture (Sutton 1990)

The Dyna-Q Algorithm Random-sample single-step tabular Q-planning method direct RL model learning(update) planning

Dyna-Q on a Simple Maze rewards = 0 untilgoal reached, when reward = 1

Dyna-Q Snapshots: Midway in 2nd Episode • Recall that in a planning context … • Exploration – trying actions that improve the model • Exploitation – Behaving in the optimal way given the current model • Balance between the two is always a key challenge!

Variations on the Dyna-Q agent • (Regular) Dyna-Q • Soft exploration/exploitation with constant rewards • Dyna-Q+ • Encourages exploration of state-action pairs that have not been visited in a long time (in real interaction with the environment) • If n is the number of steps elapsed between two consecutive visits to (s,a), then the reward is larger as a function of n • Dyna-AC • Actor-Critic learning rather that Q-learning

More on Dyna-Q+ ? • Uses an “exploration bonus”: • Keeps track of time since each state-action pair was tried for real • An extra reward is added for transitions caused by state-action pairs related to how long ago they were tried: the longer unvisited, the more reward for visiting • The agent (indirectly) “plans” how to visit long unvisited states

When the Model is Wrong: Blocking Maze (cont.) • The maze example was oversimplified • In reality many things could go wrong • Environment could be stochastic • Model can be imperfect (local minimum, stochasticity or no convergence) • Partial experience could be misleading • When the model is incorrect, the planning process will compute a suboptimal policy • This is actually a learning opportunity • Discovery and correction of the modeling error

When the Model is Wrong: Blocking Maze (cont.) The changed environment is harder

Shortcut Maze The changed environment is easier

Prioritized Sweeping • In the Dyna agents presented, simulated transitions are started in uniformly chosen state-action pairs • Probably not optimal • Which states or state-action pairs should be generated during planning? • Work backwards from states whose values have just changed: • Maintain a queue of state-action pairs whose values would change a lot if backed up, prioritized by the size of the change • When a new backup occurs, insert predecessors according to their priorities • Always perform backups from first in queue • Moore and Atkeson 1993; Peng and Williams, 1993

Prioritized Sweeping

Prioritized Sweeping vs. Dyna-Q Both use N = 5 backups per environmental interaction

Trajectory Sampling • Trajectory sampling: perform backups along simulated trajectories • This samples from the on-policy distribution • Distribution constructed from experience (visits) • Advantages when function approximation is used • Focusing of computation: can cause vast uninteresting parts of the state space to be (usefully) ignored: Initial states Irrelevant states Reachable under optimal control

Summary • Discussed close relationship between planning and learning • Important distinction between distributionmodels and sample models • Looked at some ways to integrate planning and learning • synergy among planning, acting, model learning • Distribution of backups: focus of the computation • prioritized sweeping • trajectory sampling: backup along trajectories

ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 14: Planning and Learning