Dynamic Programming and Reinforcement Learning in Artificial Intelligence

From Markov Decision Processes to Artificial Intelligence Rich Sutton with thanks to: Andy Barto Satinder Singh Doina Precup

The steady march of computing science is changing artificial intelligence • More computation-based approximate methods • Machine learning, neural networks, genetic algorithms • Machines are taking on more of the work • More data, more computation • Less handcrafted solutions, human understandability • More search • Exponential methods are still exponential…but compute-intensive methods increasingly winning • More general problems • stochastic, non-linear, optimal • real-time, large

Agent World The problem is to predict and control a doubly branching interactionunfolding over time,with a long-term goal state action state action state action state

Sequential, state-action-reward problems are ubiquitous • Walking • Flying a helicopter • Playing tennis • Logistics • Inventory control • Intruder detection • Repair or replace? • Visual search for objects • Playing chess, Go, Poker • Medical tests, treatment • Conversation • User interfaces • Marketing • Queue/server control • Portfolio management • Industrial process control • Pipeline failure prediction • Real-time load balancing

Markov Decision Processes (MDPs) state Discrete time States Actions Policy Transition probabilities Rewards action state action state action state

MDPs Part II: The Objective • “Maximize cumulative reward” • Define the value of being in a state under a policy aswhere delayed rewards are discounted by g [0,1] • Defines a partial ordering over policies, with at least one optimal policy: There are other possibilities... Needs proving

Markov Decision Processes • Extensively studied since 1950s • In Optimal Control • Specializes to Ricatti equations for linear systems • And to HJB equations for continuous time systems • Only general, nonlinear, optimal-control framework • In Operations Research • Planning, scheduling, logistics • Sequential design of experiments • Finance, marketing, inventory control, queuing, telecomm • In Artificial Intelligence (last 15 years) • Reinforcement learning, probabilistic planning • Dynamic Programming is the dominant solution method

Outline • Markov decision processes (MDPs) • Dynamic Programming (DP) • The curse of dimensionality • Reinforcement Learning (RL) • TD(l) algorithm • TD-Gammon example • Acrobot example • RL significantly extends DP methods for solving MDPs • RoboCup example • Conclusion, from the AI point of view • Spy plane example

The Principle of Optimality • Dynamic Programming (DP) requires a decomposition into subproblems • In MDPs this comes from the Independence of Path assumption • Values can be written in terms of successor values, e.g., “Bellman Equations”

Dynamic Programming: Sweeping through the states,updating an approximation to the optimal value function For example, Value Iteration: Initialize: Do forever: Pick any of the maximizing actions to get p*

DP is repeated backups, shallow lookahead searches s a s’ s’’ V V V(s’) V(s’’)

Dynamic Programming is the dominant solution method for MDPs • Routinely applied to problems with millions of states • Worst case scales polynomially in |S| and |A| • Linear Programming has better worst-case bounds but in practice scales 100s of times worse • On large stochastic problems, only DP is feasible

Perennial Difficulties for DP 1. Large state spaces “The curse of dimensionality” 2. Difficulty calculating the dynamics, e.g., from a simulation 3. Unknown dynamics

Bellman, 1961 The Curse of Dimensionality • The number of states grows exponentially with dimensionality -- the number of state variables • Thus, on large problems, • Can’t complete even one sweep of DP • Can’t enumerate states, need sampling! • Can’t store separate values for each state • Can’t store values in tables, need function approximation!

Reinforcement Learning:Using experience in place of dynamics Let be an observed sequence with actions selected by p For every time step, t, “Bellman Equation” which suggests the DP-like update: We don’t know this expected value, but we know the actual , an unbiased sample of it. In RL, we take a step toward this sample, e.g., half way “Tabular TD(0)”

Temporal-Difference Learning(Sutton, 1988) Updating a prediction based on its change (temporal difference) from one moment to the next. Tabular TD(0): Or, V is, e.g., a neural network with parameter q Then use gradient-descent TD(0): TD(l), l>0, uses differences from later predictions as well first prediction better, laterprediction Temporal difference

TD-Gammon T e s a u r o , 1 9 9 2 - 1 9 9 5 . . . ≈ probability of winning . . . . . . . . . 162 T D E r r o r A c t i o n s e l e c t i o n b y 2 - 3 p l y s e a r c h S t a r t w i t h a r a n d o m N e t w o r k P l a y m i l l i o n s o f g a m e s a g a i n s t i t s e l f L e a r n a v a l u e f u n c t i o n f r o m t h i s s i m u l a t e d e x p e r i e n c e T h i s p r o d u c e s a r g u a b l y t h e b e s t p l a y e r i n t h e w o r l d

(Tesauro, 1992) TD-Gammon vs an Expert-Trained Net TD-Gammon +features .70 TD-Gammon * .65 fraction of games won against Gammontool EP+features "Neurogammon" .60 .55 EP (BP net trained from expert moves) .50 .45 0 80 10 20 40 number of hidden units

Examples of Reinforcement Learning • Elevator Control Crites & Barto • (Probably) world's best down-peak elevator controller • Walking Robot Benbrahim & Franklin • Learned critical parameters for bipedal walking • Robocup Soccer Teams e.g., Stone & Veloso, Reidmiller et al. • RL is used in many of the top teams • Inventory Management Van Roy, Bertsekas, Lee & Tsitsiklis • 10-15% improvement over industry standard methods • Dynamic Channel Assignment Singh & Bertsekas, Nie & Haykin • World's best assigner of radio channels to mobile telephone calls • KnightCap and TDleaf Baxter, Tridgell & Weaver • Improved chess play from intermediate to master in 300 games

Does function approximation beat the curse of dimensionality? • Yes… probably • FA makes dimensionality per se largely irrelevant • With FA, computation seems to scale with the complexity of the solution (crinkliness of the value function) and how hard it is to find it • If you can get FA to work!

FA in DP and RL (1st bit) • Conventional DP works poorly with FA • Empirically [Boyan and Moore, 1995] • Diverges with linear FA [Baird, 1995] • Even for prediction (evaluating a fixed policy) [Baird, 1995] • RL works much better • Empirically [many applications and Sutton, 1996] • TD(l) prediction converges with linear FA [Tsitsiklis & Van Roy, 1997] • TD(l) control converges with linear FA [Perkins & Precup, 2002] • Why? Following actual trajectories in RL ensures that every state is updated at least as often as it is the basis for updating

DP+FA fails RL+FA works Real trajectories always leave a state after entering it More transitions can go in to a state than go out

Moore, 1990 The Mountain Car Problem SITUATIONS: car's position and velocity ACTIONS: three thrusts: forward, reverse, none REWARDS: always –1 until car reaches the goal No Discounting Goal Gravity wins Minimum-Time-to-Goal Problem

Sutton, 1996 Value Functions Learned while solving the Mountain Car problem Goal region Minimize Time-to-Goal Value = estimated time to goal Lower is better

Sparse, Coarse, Tile-Coding (CMAC) Car velocity Car position Albus, 1980

Albus, 1980 Tile Coding (CMAC) Example of Sparse Coarse-Coded Networks . . . . Linear last layer . fixed expansive Re-representation . . . . . . features Coarse: Large receptive fields Sparse: Few features present at one time

Torque applied here q 1 q 2 Sutton, 1996 The Acrobot Problem Goal: Raise tip above line e.g., Dejong & Spong, 1994 fixed base Minimum–Time–to–Goal: 4 state variables: 2 joint angles 2 angular velocities Tile coding with 48 tilings tip Reward = -1 per time step

The RoboCup Soccer Competition

13 Continuous State Variables(for 3 vs 2) 11 distances among the players, ball, and the center of the field 2 angles to takers along passing lanes Stone & Sutton, 2001

RoboCup Feature Vectors . . . Full soccer state . action values . Linear map q Sparse, coarse, tile coding . . . . . 13 continuous state variables . r Huge binary feature vector (about 400 1’s and 40,000 0’s) f s Stone & Sutton, 2001

Stone & Sutton, 2001 Learning Keepaway Results3v2 handcrafted takers Multiple, independent runs of TD(l)

Hajime Kimura’s RL Robots (dynamics knowledge) Before After Backward New Robot, Same algorithm

Assessment re: DP • RL has added some new capabilities to DP methods • Much larger MDPs can be addressed (approximately) • Simulations can be used without explicit probabilities • Dynamics need not be known or modeled • Many new applications are now possible • Process control, logistics, manufacturing, telecomm, finance, scheduling, medicine, marketing… • Theoretical and practical questions remain open

A lesson for AI:The Power of a “Visible” Goal • In MDPs, the goal (reward) is part of the data,part of the agent’s normal operation • The agent can tell for itself how well it is doing • This is very powerful… we should do more of it in AI • Can we give AI tasks visible goals? • Visual object recognition? Better would be active vision • Story understanding? Better would be dialog, eg call routing • User interfaces, personal assistants • Robotics… say mapping and navigation, or search • The usual trick is to make them into long-term prediction problems • Must be a way. If you can’t feel it, why care about it?

Assessment re: AI • DP and RL are potentially powerful probabilistic planning methods • But typically don’t use logic or structured representations • How is they as an overall model of thought? • Good mix of deliberation and immediate judgments (values) • Good for causality, prediction, not for logic, language • The link to data is appealing…but incomplete • MDP-style knowledge may be learnable, tuneable, verifiable • But only if the “level” of the data is right • Sometimes seems too low-level, too flat

Ongoing and Future Directions • Temporal abstraction [Sutton, Precup, Singh, Parr, others] • Generalize transitions to include macros, “options” • Multiple overlying MDP-like models at different levels • States representation [Littman, Sutton, Singh, Jaeger...] • Eliminate the nasty assumption of observable state • Get really real with data • Work up to higher-level, yet grounded, representations • Neuroscience of reward systems [Dayan, Schultz, Doya] • Dopamine reward system behaves remarkably like TD • Theory and practice of value function approximation[everybody]

Sutton & Ravindran, 2001 any state (106) sites only (6) Spy Plane Example(Reconnaissance Mission Planning) • Mission: Fly over (observe) most valuable sites and return to base • Stochasticweather affects observability (cloudy or clear) of sites • Limited fuel • Intractable with classical optimal control methods • Temporal scales: • Actions: which direction to fly now • Options: which site to head for • Options compress space and time • Reduce steps from ~600 to ~6 • Reduce states from ~1011 to ~106

Sutton & Ravindran, 2001 Spy Plane Results • SMDP planner: • Assumes options followed to completion • Plans optimal SMDP solution • SMDP planner with re-evaluation • Plans as if options must be followed to completion • But actually takes them for only one step • Re-picks a new option on every step • Static planner: • Assumes weather will not change • Plans optimal tour among clear sites • Re-plans whenever weather changes Expected Reward/Mission High Fuel Low Fuel SMDP planner with re-evaluation of options on each step SMDP Planner Static Re-planner Temporal abstraction finds better approximation than static planner, with little more computation than SMDP planner

Didn’t have time for • Action Selection • Exploration/Exploitation • Action values vs. search • How learning values leads to policy improvements • Different returns, e.g., the undiscounted case • Exactly how FA works, backprop • Exactly how options work • How planning at a high level can affect primitive actions • How states can be abstracted to affordances • And how this directly builds on the option work

Dynamic Programming and Reinforcement Learning in Artificial Intelligence

Dynamic Programming and Reinforcement Learning in Artificial Intelligence

Presentation Transcript

CPS 270: Artificial Intelligence http://www.cs.duke.edu/courses/fall08/cps270/ Markov decision processes, POMDPs

Concurrent Markov Decision Processes

Markov Decision Processes

Markov Decision Processes

CSE 473 Markov Decision Processes

CSE 473 Markov Decision Processes

Partially Observable Markov Decision Processes

Markov Decision Processes

Markov Decision Processes Basics Concepts

Markov Decision Processes

Markov Decision Processes

Markov Decision Processes

Solving Large Markov Decision Processes

Markov Decision Processes

Markov Decision Processes

Markov Decision Processes: Approximate Equivalence

Markov Decision Processes

Markov Decision Processes Chapter 17

CPS 570: Artificial Intelligence Markov decision processes, POMDPs

CPS 570: Artificial Intelligence Markov processes and Hidden Markov Models (HMMs)