1 / 10

Model-based RL (+ action sequences): maybe it can explain everything

Model-based RL (+ action sequences): maybe it can explain everything. Niv lab meeting 6/11/2012. Stephanie Chan. goal-directed v.s . habitual instrumental actions. Habitual. Goal-directed. After extensive training Choose action based on previous actions/stimuli

Download Presentation

Model-based RL (+ action sequences): maybe it can explain everything

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Model-based RL (+ action sequences): maybe it can explain everything Niv lab meeting 6/11/2012 Stephanie Chan

  2. goal-directed v.s. habitual instrumental actions • Habitual • Goal-directed • After extensive training • Choose action based on previous actions/stimuli • Sensory motor cortices + DLS (putamen) • Not sensitive to: • reinforcer devaluation • action-outcome changes in contingency • After moderate training • Choose action based on expected outcome • PFC & DMS(caudate) • Usually: • Model-based RL • Model-free RL

  3. goal-directed v.s. habitual instrumental actions • What do real animals do?

  4. Model-free RL • Explains resistance to devaluation: • Devaluation occurs in “extinction”. No feedback / no TD error • Does NOT explain resistance to changes in action-outcome contingency • In fact, habituated behavior should be MORE sensitive to changes in contingency • Maybe: update rates go small after extended training

  5. Alternative explanation • We don’t need model-free RL • Habit formation = association of individual actions into “action sequences” • More parsimonious • A means of modeling action sequences

  6. Over the course of training • Exploration -> exploitation • Variability -> stereotypy • Errors and RT -> decrease • Individual actions -> “chunked” sequences • PFC + associative striatum -> sensorimotor striatum • “closed loop” -> “open loop”

  7. When should actions get chunked? • Q-learning with dwell time • Q(s,a) = R(s) + E[V(s’)] – D(s)<R> • When costs (possible mistakes) are outweighed by benefits (decrease decision time) • Cost: C(s,a,a’) = E[Q(s’,a’)-V(s’)] = E[A(s’,a’)] • Efficient way to compute this: TDt = [rt – dt<R> + V(st+1)]-V(st) = a sample of A(st,at) • Benefit: (# timesteps saved) <R>

  8. When do they get unchunked? • C(s,a,a’) is insensitive to changes in environment • Primitive actions no longer evaluated, no TD error, no samples for C • But <R> is sensitive to changes… • Action sequences get unchunked when environment changes to decrease <R> • No unchunking if environment changes to present a better alternative to increase <R> • Ostlund et al 2009: rats are immediately sensitive to devaluation of the state that the macro action lands on, but not on the intermediate states

  9. Simulations I : SRTT-like task

  10. Simulations II: Instrumental conditioning Reinforcer devaluation Non-contingent Omission

More Related