1 / 12

Welcome!

Thank you for coming Apologies to the skiers… Why we will be strict about timing Why we want the workshop to be interactive. Welcome!. NIPS 2007 Workshop. Hierarchical organization of behavior. RL: Decision making. Goal: maximize reward (minimize punishment).

trang
Download Presentation

Welcome!

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Thank you for coming • Apologies to the skiers… • Why we will be strict about timing • Why we want the workshop to be interactive Welcome! NIPS 2007 Workshop Hierarchical organization of behavior

  2. RL: Decision making Goal: maximize reward (minimize punishment) Rewards/punishments may be delayed Outcomes may depend on sequence of actions  Credit assignment problem

  3. 4 0 2 2 S2 S3 S1 L R RL in a nutshell: formalization Components of an RL task Policy: p(S,a) State values: V(S) State-action values: Q(S,a) states - actions - transitions - rewards - policy - long term values

  4. 4 0 2 2 S2 S3 S1 L R L S2 = 4 L R S1 L = 0 R S3 R = 2 = 2 RL in a nutshell: forward search Model based RL learn model through experience (cognitive map) choosing actions is hard goal directed behavior; cortical Model = T(ransitions) and R(ewards)

  5. 4 0 2 2 S2 S3 S1 L R TD learning: start with initial (wrong)Q(S,a) Q(S,a) = r(S,a) + max Q(S’,a’) PE = r(S,a) + max Q(S’,a’) - Q(S,a) Q(S,a)new = Q(S,a)old + PE RL in a nutshell: cached values Model-free RL Trick #1: Long-term values are recursive Q(S,a) = r(S,a) + V(Snext) temporal difference learning

  6. 4 0 2 2 S2 S3 S1 L R 2 4 2 0 2 4 Q(S1,R) Q(S3,R) Q(S2,L) Q(S2,R) Q(S3,L) Q(S1,L) RL in a nutshell: cached values Model-free RL Trick #2: Can learn values without a model choosing actions is easy (but need lots of practice to learn) habitual behavior; basal ganglia temporal difference learning

  7. 4 0 2 2 S2 S3 S1 L R L S2 = 4 L R S1 L = 0 R S3 4 2 2 0 2 4 Q(S1,L) Q(S3,R) Q(S1,R) Q(S2,L) Q(S2,R) Q(S3,L) R = 2 = 2 RL in real world tasks… Scaling problem! model based vs. model free learning and control

  8. 1. pour coffee add hot too cold add cold 2. add sugar 1. set water temp too hot wait 5sec 3. add milk change 2. get wet success just right 4. stir 3. shampoo 4. soap 5. turn off water 6. dry off Real-world behavior is hierarchical Hierarchical RL: What is it? simplified control, disambiguation, encapsulation

  9. S1 S2 S1(0.1) 0.8 0.1 0 S8 S2(0.1) 0.1 0.1 1 S1 S2 S3 0.1 0.8 0 … S3(0.9) … initiation set termination conditions policy Option: set water temperature HRL: (in)formal framework Hierarchical RL: What is it? Termination condition = (sub)goal state Option policy learning: via pseudo reward (model based or model free) options - skills - macros - temporally abstract actions (Sutton, McGovern, Dietterich, Barto, Precup, Singh, Parr…)

  10. S: start G: goal Options: going to doors Actions: + 2 door options HRL: a toy example Hierarchical RL: What is it?

  11. 2. Transfer of knowledge from previous tasks(generalization, shaping) Advantages of HRL Hierarchical RL: What is it? 1. Faster learning (mitigates scaling problem) RL: no longer ‘tabula rasa’

  12. Need ‘right’ options - how to learn them? Suboptimal behavior (“negative transfer”; habits) More complex learning/control structure Disadvantages (or: the cost) of HRL Hierarchical RL: What is it? no free lunches…

More Related