1 / 41

Dynamic Programming and Reinforcement Learning in Artificial Intelligence

Explore the transition from Markov Decision Processes to cutting-edge Artificial Intelligence methods like machine learning, neural networks, and reinforcement learning. Discover the application of MDPs in diverse domains like logistics, gaming, finance, and more, emphasizing dynamic programming and the challenges faced by these methods. Learn about the principle of optimality, dynamic programming algorithms like TD(l) with examples like TD-Gammon, and the increasing role of RL in solving MDPs efficiently. Delve into the curse of dimensionality, scalability issues, and the implications for large stochastic problems. Experience the evolution of AI tools and techniques in handling complex decision-making tasks in real-time. The transformation from handcrafted solutions to data-driven, computationally intensive methods is powering the next generation of Artificial Intelligence.

furlow
Download Presentation

Dynamic Programming and Reinforcement Learning in Artificial Intelligence

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. From Markov Decision Processes to Artificial Intelligence Rich Sutton with thanks to: Andy Barto Satinder Singh Doina Precup

  2. The steady march of computing science is changing artificial intelligence • More computation-based approximate methods • Machine learning, neural networks, genetic algorithms • Machines are taking on more of the work • More data, more computation • Less handcrafted solutions, human understandability • More search • Exponential methods are still exponential…but compute-intensive methods increasingly winning • More general problems • stochastic, non-linear, optimal • real-time, large

  3. Agent World The problem is to predict and control a doubly branching interactionunfolding over time,with a long-term goal state action state action state action state

  4. Sequential, state-action-reward problems are ubiquitous • Walking • Flying a helicopter • Playing tennis • Logistics • Inventory control • Intruder detection • Repair or replace? • Visual search for objects • Playing chess, Go, Poker • Medical tests, treatment • Conversation • User interfaces • Marketing • Queue/server control • Portfolio management • Industrial process control • Pipeline failure prediction • Real-time load balancing

  5. Markov Decision Processes (MDPs) state Discrete time States Actions Policy Transition probabilities Rewards action state action state action state

  6. MDPs Part II: The Objective • “Maximize cumulative reward” • Define the value of being in a state under a policy aswhere delayed rewards are discounted by g [0,1] • Defines a partial ordering over policies, with at least one optimal policy: There are other possibilities... Needs proving

  7. Markov Decision Processes • Extensively studied since 1950s • In Optimal Control • Specializes to Ricatti equations for linear systems • And to HJB equations for continuous time systems • Only general, nonlinear, optimal-control framework • In Operations Research • Planning, scheduling, logistics • Sequential design of experiments • Finance, marketing, inventory control, queuing, telecomm • In Artificial Intelligence (last 15 years) • Reinforcement learning, probabilistic planning • Dynamic Programming is the dominant solution method

  8. Outline • Markov decision processes (MDPs) • Dynamic Programming (DP) • The curse of dimensionality • Reinforcement Learning (RL) • TD(l) algorithm • TD-Gammon example • Acrobot example • RL significantly extends DP methods for solving MDPs • RoboCup example • Conclusion, from the AI point of view • Spy plane example

  9. The Principle of Optimality • Dynamic Programming (DP) requires a decomposition into subproblems • In MDPs this comes from the Independence of Path assumption • Values can be written in terms of successor values, e.g., “Bellman Equations”

  10. Dynamic Programming: Sweeping through the states,updating an approximation to the optimal value function For example, Value Iteration: Initialize: Do forever: Pick any of the maximizing actions to get p*

  11. DP is repeated backups, shallow lookahead searches s a s’ s’’ V V V(s’) V(s’’)

  12. Dynamic Programming is the dominant solution method for MDPs • Routinely applied to problems with millions of states • Worst case scales polynomially in |S| and |A| • Linear Programming has better worst-case bounds but in practice scales 100s of times worse • On large stochastic problems, only DP is feasible

  13. Perennial Difficulties for DP 1. Large state spaces “The curse of dimensionality” 2. Difficulty calculating the dynamics, e.g., from a simulation 3. Unknown dynamics

  14. Bellman, 1961 The Curse of Dimensionality • The number of states grows exponentially with dimensionality -- the number of state variables • Thus, on large problems, • Can’t complete even one sweep of DP • Can’t enumerate states, need sampling! • Can’t store separate values for each state • Can’t store values in tables, need function approximation!

  15. Reinforcement Learning:Using experience in place of dynamics Let be an observed sequence with actions selected by p For every time step, t, “Bellman Equation” which suggests the DP-like update: We don’t know this expected value, but we know the actual , an unbiased sample of it. In RL, we take a step toward this sample, e.g., half way “Tabular TD(0)”

  16. Temporal-Difference Learning(Sutton, 1988) Updating a prediction based on its change (temporal difference) from one moment to the next. Tabular TD(0): Or, V is, e.g., a neural network with parameter q Then use gradient-descent TD(0): TD(l), l>0, uses differences from later predictions as well first prediction better, laterprediction Temporal difference

  17. TD-Gammon T e s a u r o , 1 9 9 2 - 1 9 9 5 . . . ≈ probability of winning . . . . . . . . . 162 T D E r r o r A c t i o n s e l e c t i o n b y 2 - 3 p l y s e a r c h S t a r t w i t h a r a n d o m N e t w o r k P l a y m i l l i o n s o f g a m e s a g a i n s t i t s e l f L e a r n a v a l u e f u n c t i o n f r o m t h i s s i m u l a t e d e x p e r i e n c e T h i s p r o d u c e s a r g u a b l y t h e b e s t p l a y e r i n t h e w o r l d

  18. (Tesauro, 1992) TD-Gammon vs an Expert-Trained Net TD-Gammon +features .70 TD-Gammon * .65 fraction of games won against Gammontool EP+features "Neurogammon" .60 .55 EP (BP net trained from expert moves) .50 .45 0 80 10 20 40 number of hidden units

  19. Examples of Reinforcement Learning • Elevator Control Crites & Barto • (Probably) world's best down-peak elevator controller • Walking Robot Benbrahim & Franklin • Learned critical parameters for bipedal walking • Robocup Soccer Teams e.g., Stone & Veloso, Reidmiller et al. • RL is used in many of the top teams • Inventory Management Van Roy, Bertsekas, Lee & Tsitsiklis • 10-15% improvement over industry standard methods • Dynamic Channel Assignment Singh & Bertsekas, Nie & Haykin • World's best assigner of radio channels to mobile telephone calls • KnightCap and TDleaf Baxter, Tridgell & Weaver • Improved chess play from intermediate to master in 300 games

  20. Does function approximation beat the curse of dimensionality? • Yes… probably • FA makes dimensionality per se largely irrelevant • With FA, computation seems to scale with the complexity of the solution (crinkliness of the value function) and how hard it is to find it • If you can get FA to work!

  21. FA in DP and RL (1st bit) • Conventional DP works poorly with FA • Empirically [Boyan and Moore, 1995] • Diverges with linear FA [Baird, 1995] • Even for prediction (evaluating a fixed policy) [Baird, 1995] • RL works much better • Empirically [many applications and Sutton, 1996] • TD(l) prediction converges with linear FA [Tsitsiklis & Van Roy, 1997] • TD(l) control converges with linear FA [Perkins & Precup, 2002] • Why? Following actual trajectories in RL ensures that every state is updated at least as often as it is the basis for updating

  22. DP+FA fails RL+FA works Real trajectories always leave a state after entering it More transitions can go in to a state than go out

  23. Outline • Markov decision processes (MDPs) • Dynamic Programming (DP) • The curse of dimensionality • Reinforcement Learning (RL) • TD(l) algorithm • TD-Gammon example • Acrobot example • RL significantly extends DP methods for solving MDPs • RoboCup example • Conclusion, from the AI point of view • Spy plane example

  24. Moore, 1990 The Mountain Car Problem SITUATIONS: car's position and velocity ACTIONS: three thrusts: forward, reverse, none REWARDS: always –1 until car reaches the goal No Discounting Goal Gravity wins Minimum-Time-to-Goal Problem

  25. Sutton, 1996 Value Functions Learned while solving the Mountain Car problem Goal region Minimize Time-to-Goal Value = estimated time to goal Lower is better

  26. Sparse, Coarse, Tile-Coding (CMAC) Car velocity Car position Albus, 1980

  27. Albus, 1980 Tile Coding (CMAC) Example of Sparse Coarse-Coded Networks . . . . Linear last layer . fixed expansive Re-representation . . . . . . features Coarse: Large receptive fields Sparse: Few features present at one time

  28. Torque applied here q 1 q 2 Sutton, 1996 The Acrobot Problem Goal: Raise tip above line e.g., Dejong & Spong, 1994 fixed base Minimum–Time–to–Goal: 4 state variables: 2 joint angles 2 angular velocities Tile coding with 48 tilings tip Reward = -1 per time step

  29. The RoboCup Soccer Competition

  30. 13 Continuous State Variables(for 3 vs 2) 11 distances among the players, ball, and the center of the field 2 angles to takers along passing lanes Stone & Sutton, 2001

  31. RoboCup Feature Vectors . . . Full soccer state . action values . Linear map q Sparse, coarse, tile coding . . . . . 13 continuous state variables . r Huge binary feature vector (about 400 1’s and 40,000 0’s) f s Stone & Sutton, 2001

  32. Stone & Sutton, 2001 Learning Keepaway Results3v2 handcrafted takers Multiple, independent runs of TD(l)

  33. Hajime Kimura’s RL Robots (dynamics knowledge) Before After Backward New Robot, Same algorithm

  34. Assessment re: DP • RL has added some new capabilities to DP methods • Much larger MDPs can be addressed (approximately) • Simulations can be used without explicit probabilities • Dynamics need not be known or modeled • Many new applications are now possible • Process control, logistics, manufacturing, telecomm, finance, scheduling, medicine, marketing… • Theoretical and practical questions remain open

  35. Outline • Markov decision processes (MDPs) • Dynamic Programming (DP) • The curse of dimensionality • Reinforcement Learning (RL) • TD(l) algorithm • TD-Gammon example • Acrobot example • RL significantly extends DP methods for solving MDPs • RoboCup example • Conclusion, from the AI point of view • Spy plane example

  36. A lesson for AI:The Power of a “Visible” Goal • In MDPs, the goal (reward) is part of the data,part of the agent’s normal operation • The agent can tell for itself how well it is doing • This is very powerful… we should do more of it in AI • Can we give AI tasks visible goals? • Visual object recognition? Better would be active vision • Story understanding? Better would be dialog, eg call routing • User interfaces, personal assistants • Robotics… say mapping and navigation, or search • The usual trick is to make them into long-term prediction problems • Must be a way. If you can’t feel it, why care about it?

  37. Assessment re: AI • DP and RL are potentially powerful probabilistic planning methods • But typically don’t use logic or structured representations • How is they as an overall model of thought? • Good mix of deliberation and immediate judgments (values) • Good for causality, prediction, not for logic, language • The link to data is appealing…but incomplete • MDP-style knowledge may be learnable, tuneable, verifiable • But only if the “level” of the data is right • Sometimes seems too low-level, too flat

  38. Ongoing and Future Directions • Temporal abstraction [Sutton, Precup, Singh, Parr, others] • Generalize transitions to include macros, “options” • Multiple overlying MDP-like models at different levels • States representation [Littman, Sutton, Singh, Jaeger...] • Eliminate the nasty assumption of observable state • Get really real with data • Work up to higher-level, yet grounded, representations • Neuroscience of reward systems [Dayan, Schultz, Doya] • Dopamine reward system behaves remarkably like TD • Theory and practice of value function approximation[everybody]

  39. Sutton & Ravindran, 2001 any state (106) sites only (6) Spy Plane Example(Reconnaissance Mission Planning) • Mission: Fly over (observe) most valuable sites and return to base • Stochasticweather affects observability (cloudy or clear) of sites • Limited fuel • Intractable with classical optimal control methods • Temporal scales: • Actions: which direction to fly now • Options: which site to head for • Options compress space and time • Reduce steps from ~600 to ~6 • Reduce states from ~1011 to ~106

  40. Sutton & Ravindran, 2001 Spy Plane Results • SMDP planner: • Assumes options followed to completion • Plans optimal SMDP solution • SMDP planner with re-evaluation • Plans as if options must be followed to completion • But actually takes them for only one step • Re-picks a new option on every step • Static planner: • Assumes weather will not change • Plans optimal tour among clear sites • Re-plans whenever weather changes Expected Reward/Mission High Fuel Low Fuel SMDP planner with re-evaluation of options on each step SMDP Planner Static Re-planner Temporal abstraction finds better approximation than static planner, with little more computation than SMDP planner

  41. Didn’t have time for • Action Selection • Exploration/Exploitation • Action values vs. search • How learning values leads to policy improvements • Different returns, e.g., the undiscounted case • Exactly how FA works, backprop • Exactly how options work • How planning at a high level can affect primitive actions • How states can be abstracted to affordances • And how this directly builds on the option work

More Related