1 / 38

Reinforcement Learning: How far can it Go?

This article explores reinforcement learning, an active and successful approach to AI that emphasizes learning from interaction and does not require complete knowledge of the world. It discusses world-class applications, strong theoretical foundations, and parallels in other fields such as operations research and psychology. The article also examines the past, present, and future of reinforcement learning, including trial and error learning, learning and planning values, and the potential for constructivism.

patsimmons
Download Presentation

Reinforcement Learning: How far can it Go?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Reinforcement Learning:How far can it Go? Rich Sutton University of Massachusetts ATT Research With thanks to Doina Precup, Satinder Singh, Amy McGovern, B. Ravindran, Ron Parr

  2. Reinforcement Learning • An active, popular, successful approach to AI • 15 – 50 years old • emphasizes learning from interaction • Does not assume complete knowledge of world • World-class applications • Strong theoretical foundations • Parallels in other fields: operations research, control theory, psychology, neuroscience • Seeks simple general principles How Far Can It Go ?

  3. World-Class Applications of RL • TD-Gammon and Jellyfish Tesauro, Dahl • World's best backgammon player • Elevator Control Crites & Barto • (Probably) world's best down-peak elevator controller • Job-Shop Scheduling Zhang & Dietterich • World’s best scheduler of space-shuttle payload processing • Dynamic Channel Assignment Singh & Bertsekas, Nie & Haykin • World's best assigner of radio channels to mobile telephone calls

  4. Outline • RL Past • Trial and Error Learning • RL Present • Learning and Planning Values • RL Future • Constructivism 1950 1985 2000

  5. RL began with dissatisfactionwith previous learning problems • Such as • Learning from examples • Unsupervised learning • Function optimization • None seemed to be purposiveful • Where is the learning to how to get something? • Where is the learning by trial and error? Need rewards and penalties, interaction with the world!

  6. Rooms Example Early learning methods could not learn how to get reward

  7. The Reward Hypothesis • Is this reasonable? • Is it demeaning? • Is there no other choice? • It seems to be adequate That purposes can be adequately represented as maximization of the cumulative sum of a scalar reward signal received from the environment

  8. RL Past – Trial and Error Learning • Learned only a policy(a mapping from states to actions) • Maximized only • Short-term reward (e.g., learning automata) • Or delayed reward via simple action traces • Assumed good/bad rewards immed. distinguishable • E.g., positive is good, negative is bad • An implicitly known reinforcement baseline • Next steps were to learn baselines and internal rewards Taking these next steps quickly led to modern value functions and temporal-difference learning

  9. A Policy Movement is in the wrong direction 1/3 of the time

  10. Problems with Value-less RL Methods

  11. Outline • RL Past • Trial and Error Learning • RL Present • Learning and Planning Values • RL Future • Constructivism 1950 1985 2000

  12. The Value-Function Hypothesis • Value functions = Measures of expected rewardfollowing states: V: States  Expected future reward or following state-action pairs: Q: States x Actions  Expected future reward • All efficient methods for optimal sequential decision making estimate value functions • The hypothesis: That the dominant purpose of intelligence is to approximate these value functions

  13. State-Value Function

  14. RL Present • Accepts reward and value hypotheses • Many real-world applications, some impressive • Theory strong and active, yet still with more questions than answers • Strong links to Operations Research • A part of modern AI’s interest in uncertainty: • MDPs, POMDPs, Bayes nets, connectionism • Includes deliberativeplanning Learning and Planning Values

  15. Back- prop New Applications of RL • CMUnited Robocup Soccer Team Stone & Veloso • World’s best player of Robocup simulated soccer, 1998 • KnightCap and TDleaf Baxter, Tridgell & Weaver • Improved chess play from intermediate to master in 300 games • Inventory Management Van Roy, Bertsekas, Lee & Tsitsiklis • 10-15% improvement over industry standard methods • Walking Robot Benbrahim & Franklin • Learned critical parameters for bipedal walking Real-world applications using on-line learning

  16. Exhaustive Dynamic search programming full backups sample Temporal- backups difference learning l bootstrapping, shallow deep backups backups RL Present, Part II:The Space of Methods Also: Function Approx. Explore/Exploit Planning/Learning Action/state values Actor-Critic . . . Monte Carlo

  17. The TD Hypothesis That all value learning is driven by TD errors • Even “Monte Carlo” methods can benefit • TD methods enable them to be done incrementally • Evenplanning can benefit • Trajectory following improves function approximation and state sampling • Sample backups reduce effect of branching factor • Psychologicalsupport • TD models of reinforcement, classical conditioning • Physiological support • Reward neurons show TD behavior (Schultz et al.)

  18. Interaction with world RL Alg. Value/Policy Imagined interaction Planning value/policy acting • Modern RL includes planning • As in planning for MDPs • A form of state-space planning • Still controversial for some • Planning and learning are near identical in RL • The same algorithms on real or imagined experience • Same value functions, backups, function approximation planning direct RL model experience model learning

  19. Planning with Imagined Experience Real experience Imagined experience

  20. Outline • RL Past • Trial and Error Learning • RL Present • Learning and Planning Values • RL Future • Constructivism 1950 1985 2000

  21. Piaget Drescher Constructivism The active construction of representationsand models of the world to facilitate the learning and planning of values Representations and Models Value functions Great flexibility here Policy

  22. Constructivist Prophecy • Whereas RL present is about solving an MDP, • RL future will be about representing the • States • Actions • Transitions • Rewards • Features to construct an MDP. • Constructing the world to be the waywe want it: • Markov  Linear  Small • Reliable  Independent  Shallow • Deterministic Additive  Low branching The RL agent as active world modeler

  23. State Features Values Representing State, Part I: Features and Function Approximation • Linear-in-the-features methods are state of the art •  Memory-based methods • Two-stage architecture: • Compute feature values • Nonlinear, expansive, fixed or slowly changing mapping • Map the feature values linearly to the result • Linear, convergent, fast changing mapping • Works great if features are appropriate • Fast, reliable, local learning; good generalization • Feature construction best done by hand ...or by methods yet to be found Constructive Induction

  24. Good Features Bad Features Features correspond to regions of similar value Features unrelated to values

  25. Representing State, Part II:Partial Observability • Not as big a deal as widely thought • A greater problem for theory than for practice • Need not use POMDP ideas • Can treat as function approximation issue • Making do with imperfect observations/features • Finding the right memories to add as new features • The key is to construct state representations that make the world more Markov–McCallum’s thesis When immediate observations do not uniquely identify the current state; non-Markov problems

  26. Representations of Action • Nominally, actions in RL are low-level • The lowest level at which behavior can vary • But people work mostly with courses of action • We decide among these • We make predictions at this level • We plan at this level • Remarkably, all this can be incorporated in RL • Course of action = policy + termination condition • Almost all RL ideas, algorithms and theory extend • Wherever actions are used, courses of action can be substituted Parr, Bradtke & Duff, Precup, Singh, Dietterich, Kaelbling, Huber & Grupen, Szepesvari, Dayan, Ryan & Pendrith, Hauskrecht, Lin...

  27. Room-to-Room Courses of Action A course of action for each hallway from each room (2 of 8 shown)

  28. Representing Transitions • Models can also be learned for courses of action • What state will we be in at termination? • How much reward will we receive along the way? • Mathematical form of models follows from the theory of semi-Markov decision processes • Permits planning at a higher level

  29. Planning (Value Iteration)with Courses of Action

  30. Reconnaissance Example • Mission: Fly over (observe) most valuable sites and return to base • Stochasticweather affects observability (cloudy or clear) of sites • Limited fuel • Intractable with classical optimal control methods • Actions: • Primitives: which direction to fly • Courses: which site to head for • Courses compress space and time • Reduce steps from ~600 to ~6 • Reduce states from ~1011 to ~106 • Enable finding of best solutions 2 5 1 5 ( r e w a r d ) ( m e a n t i m e b e t w e e n 2 5 5 0 w e a t h e r c h a n g e s ) 8 ? 5 0 5 1 0 0 1 0 5 0 B a s e 1 0 0 d e c i s i o n s t e p s B. Ravindran, UMass

  31. Courses of actionpermit enormous flexibility

  32. Subgoals • Courses of action are often goal-oriented • E.g., drive-to-work, open-the-door • A course can be learned to achieve its goal • Many can be learned at once, independently • Solves classic problem of subgoal credit assignment • Solves psychological puzzle of goal-oriented action • Goal-oriented courses of action create better MDP • Fewer states, smaller branching factor • Compartmentalizes dependencies • Their models are also goal-oriented recognizers...

  33. charger Dockable region Perception • Real perception, like real action,is temporally extended • Features are abilityoriented rather than sensor oriented • What is a chair? Something that can be sat upon • Consider a goal-oriented course of action, like dock-with-charger • Its model gives the probability of successfully docking as a function of state • I.e., a feature (detector) for states that afford docking • Such features can be learned without supervision

  34. This is RL with a totally different feel • Still one primary policy and set of values • But many other policies, values, and models are learned not directly in service of reward • The dominant purpose is discovery, not reward • What possibilities does this world afford? • How can I control and predict it in a variety of ways? • In other words, constructing representations to make the world: • Markov  Linear  Small • Reliable  Independent  Shallow • Deterministic Additive  Low branching

  35. Imagine • An agent driven primarily by biased curiosity • To discover how it can predict and controlits interaction with the world • What courses of action have predictable effects? • What salient observables can be controlled? • What models are most useful in planning? • A human coach presenting a series of • Problems/Tasks • Courses of action • Highlighting key states, providing subpolicies, termination conditions…

  36. What is New? • Constructivism itself is not new.But actually doing it would be! • Does RL really change it, make it easier?That is, do values and policies help? • Yes! Because so much constructed knowledge is • well represented as values and policies • in service of approximating values and policies • RL’s goal-orientation is also critical to modeling goal-oriented action and perception

  37. Take Home Messages • RL Past • Let’s revisit, but not repeat past work • RL Present • Do you accept that value functions are critical? • And that TD methods are the way to find them? • RL Future • It’s time to address representation construction • Explore/understand the world rather than control it • RL/values provide new structure for this • May explain goal-oriented action and perception

  38. How far can RL go? • A simpleandgeneral formulation of AI • Yet there is enough structure to make progress • While this is true, we should complicate no further, but seek generalprinciples of AI • They may take us all the way to human-level intelligence

More Related