1 / 33

Reinforcement Learning and the Reward Engineering Principle

Reinforcement Learning and the Reward Engineering Principle. Daniel Dewey. daniel.dewey@philosophy.ox.ac.uk ; AAAI Spring Symposium Series 2014. A modest aim: What role goals in AI research? …through the lens of reinforcement learning.

senwe
Download Presentation

Reinforcement Learning and the Reward Engineering Principle

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Reinforcement Learning and the Reward Engineering Principle Daniel Dewey daniel.dewey@philosophy.ox.ac.uk; AAAI Spring Symposium Series 2014

  2. A modest aim: What role goals in AI research? …through the lens of reinforcement learning. daniel.dewey@philosophy.ox.ac.uk; AAAI Spring Symposium Series 2014

  3. Reinforcement learning and AI Definitions: “control” “dominance” The reward engineering principle Conclusions daniel.dewey@philosophy.ox.ac.uk; AAAI Spring Symposium Series 2014

  4. RL and AI “…one can define AI as the problem of designing systems that do the right thing. Stuart Russell, “Rationality and Intelligence” Now we just need a definition for ‘right.’” Reinforcement learningprovides a definition: maximize total rewards. daniel.dewey@philosophy.ox.ac.uk; AAAI Spring Symposium Series 2014

  5. RL and AI action Environment AI Agent reward state daniel.dewey@philosophy.ox.ac.uk; AAAI Spring Symposium Series 2014

  6. RL and AI Understand and Exploit Inference, Planning, Learning, Metareasoning, Concept formation, etc… daniel.dewey@philosophy.ox.ac.uk; AAAI Spring Symposium Series 2014

  7. RL and AI • Advantages: • Simple and cheap • Flexible and abstract • Measurable “worse is better” …and used in natural neural nets (brains!) daniel.dewey@philosophy.ox.ac.uk; AAAI Spring Symposium Series 2014

  8. RL and AI Outside the frame: Some behaviours cannot be elicited (by any rewards!) Key concepts: Control and dominance As RL AI becomes more general and autonomous, it becomes harder to get good results with RL. daniel.dewey@philosophy.ox.ac.uk; AAAI Spring Symposium Series 2014

  9. Reinforcement learning and AI Definitions: “control” “dominance” The reward engineering principle Conclusions daniel.dewey@philosophy.ox.ac.uk; AAAI Spring Symposium Series 2014

  10. Definitions: “control” A user has control when the agent’s received rewards equal the user’s chosen reward. daniel.dewey@philosophy.ox.ac.uk; AAAI Spring Symposium Series 2014

  11. Definitions: “control” action Environment Agent reward state daniel.dewey@philosophy.ox.ac.uk; AAAI Spring Symposium Series 2014

  12. Definitions: “control” action Environment 1 state action Agent User Environment 2 reward reward state daniel.dewey@philosophy.ox.ac.uk; AAAI Spring Symposium Series 2014

  13. Definitions: “control” Environment 1 Agent User user chooses reward Environment 2 daniel.dewey@philosophy.ox.ac.uk; AAAI Spring Symposium Series 2014

  14. Definitions: “control” Environment 1 Agent User env. “chooses” reward Environment 2 daniel.dewey@philosophy.ox.ac.uk; AAAI Spring Symposium Series 2014

  15. Definitions: “dominance” Why does control matter? Loss of control can create situations where no possible sequence of rewards can elicit the desired behaviour. These behaviours are dominated by other behaviours. daniel.dewey@philosophy.ox.ac.uk; AAAI Spring Symposium Series 2014

  16. Definitions: “dominance” A “behaviour” (sequence of actions) is a policy. a3 a5 a6 a8 a1 a2 a4 a7 1 ? 0 ? ? ? 0 ? P1 daniel.dewey@philosophy.ox.ac.uk; AAAI Spring Symposium Series 2014

  17. Definitions: “dominance” User-chosen rewards 1 ? 0 ? ? ? 0 ? P1 daniel.dewey@philosophy.ox.ac.uk; AAAI Spring Symposium Series 2014

  18. Definitions: “dominance” Env.-chosen rewards (loss of control) 1 ? 0 ? ? ? 0 ? P1 daniel.dewey@philosophy.ox.ac.uk; AAAI Spring Symposium Series 2014

  19. Definitions: “dominance” 1 ? 0 ? ? ? 0 ? P1 1 0 ? 1 ? ? 1 1 P2 Can rewards make either better? daniel.dewey@philosophy.ox.ac.uk; AAAI Spring Symposium Series 2014

  20. Definitions: “dominance” 1 1 0 1 1 1 0 1 P1 Choose all rewards 1: Max. reward = 6 1 0 0 1 0 0 1 1 P2 Choose all rewards 0: Min. reward = 4 daniel.dewey@philosophy.ox.ac.uk; AAAI Spring Symposium Series 2014

  21. Definitions: “dominance” 1 0 0 0 0 0 0 0 P1 Choose all rewards 0: Min. reward = 1 1 0 1 1 1 1 1 1 P2 Choose all rewards 1: Max. reward = 7 daniel.dewey@philosophy.ox.ac.uk; AAAI Spring Symposium Series 2014

  22. Definitions: “dominance” 1 ? 0 ? ? ? 0 ? P1 1 1 1 1 1 ? 1 1 P3 daniel.dewey@philosophy.ox.ac.uk; AAAI Spring Symposium Series 2014

  23. Definitions: “dominance” 1 1 0 1 1 1 0 1 P1 Max. reward = 6 1 1 1 1 1 0 1 1 P3 Min. reward = 7 daniel.dewey@philosophy.ox.ac.uk; AAAI Spring Symposium Series 2014

  24. Definitions: “dominance” 1 ? 0 ? ? ? 0 ? P1 Dominated by P3 1 1 1 1 1 ? 1 1 P3 Dominates P1 daniel.dewey@philosophy.ox.ac.uk; AAAI Spring Symposium Series 2014

  25. Definitions: “dominance” AdominatesB if no possible assignment of rewards causes R(A) > R(B). No series of rewards can prompt a dominated policy; they are unelicitable. (A less obvious result: every unelicitable policy is dominated.) daniel.dewey@philosophy.ox.ac.uk; AAAI Spring Symposium Series 2014

  26. Recap Control is sometimes lost; Loss of control enables dominance; Dominance makes some policies unelicitable. All of this is outside the “RL AI frame” …but is clearly part of the AI problem (do the right thing!) daniel.dewey@philosophy.ox.ac.uk; AAAI Spring Symposium Series 2014

  27. Additional factors = better chance of finding dominant policies = more frequent loss of control Generality: the range of policies an agent has reasonably efficient access to. Autonomy: ability to function in environments with little interaction from users. daniel.dewey@philosophy.ox.ac.uk; AAAI Spring Symposium Series 2014

  28. Reinforcement learning and AI Definitions: “control” “dominance” The reward engineering principle Conclusions daniel.dewey@philosophy.ox.ac.uk; AAAI Spring Symposium Series 2014

  29. Reward Engineering Principle As RL AI becomes more general and autonomous, it becomes both more difficult and more important to constrain the environment to avoid loss of control. • …because general / autonomous RL AI has • better chance of dominant policies; • more unelicitable policies; • more significant effects daniel.dewey@philosophy.ox.ac.uk; AAAI Spring Symposium Series 2014

  30. Reinforcement learning and AI Definitions: “control” “dominance” The reward engineering principle Conclusions daniel.dewey@philosophy.ox.ac.uk; AAAI Spring Symposium Series 2014

  31. RL AI users: • Heed the Reward Engineering Principle. • Consider existence of dominant policies • Be as rigorous as possible in excluding them • Remember what’s outside the frame! daniel.dewey@philosophy.ox.ac.uk; AAAI Spring Symposium Series 2014

  32. AI Researchers: Expand the frame! Make goal design a first-class citizen. Consider alternatives: manually coded utility functions, preference learning, …? Watch out for dominance relations (e.g. in “dual” motivation systems, between intrinsic and extrinsic) daniel.dewey@philosophy.ox.ac.uk; AAAI Spring Symposium Series 2014

  33. Thank you! Toby Ord, Seán Ó hÉigeartaigh, and two anonymous judges, for comments. Work supported by the Alexander Tamas Research Fellowship daniel.dewey@philosophy.ox.ac.uk; AAAI Spring Symposium Series 2014

More Related