1 / 43

Journal club 09. 09. 2008 Marian Tsanov

Reinforcement Learning. Journal club 09. 09. 2008 Marian Tsanov. Predicting Future Reward – Temporal Difference Learning. Actor-Critic Learning. Sarsa learning. Q - learning. TD error:. Where V is the current value function implemented by the critic.

tracey
Download Presentation

Journal club 09. 09. 2008 Marian Tsanov

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Reinforcement Learning Journal club 09. 09. 2008 Marian Tsanov

  2. Predicting Future Reward – Temporal Difference Learning Actor-Critic Learning Sarsa learning Q - learning TD error: Where V is the current value function implemented by the critic.

  3. Actor-critic methods are TD methods that have a separate memory structure to explicitly represent the policy independent of the value function. The policy structure is known as the actor, because it is used to select actions, and the estimated value function is known as the critic, because it criticizes the Actions made by the actor.

  4. Why is Trial-To-Trial Variability needed for reinforcement learning ? • In reinforcement learning there is no „supervisor“ which tells a neural circuit what to do with its input. • Instead, a neural circuit has to try out different ways of processing the input, until it finds a successful (i.e., rewarded) one. This process is called exploration.

  5. trial-by-trial learning rule known as the Rescorla-Wagner rule Here e is the learning rate, which can be interpreted in psychological terms as the associability of the stimulus with the reward.

  6. Dayan and Abbott, 2000

  7. Foster, Morris and Dayan, 2000

  8. The prediction error δ plays an essential role in both the Rescorla-Wagner and temporal difference learning rules – biologically implemented by VTA.

  9. Actor – critic model and reinforcement learning circuits Barnes et al. 2005

  10. In a search for a critic – the striato-nigral problem Proposed actor – matriomes of dorsal striatum Proposed critic – striosomes of dorsal striatum

  11. critic circuit actor circuit LC AC SensoryMotor Cortex SMC Amigdala Hippocampus Prefrontal cortex hebbian w target r Ventral Striatum(Nucleus Accumbens) VS Dorsal striatum /coincidence detector/ DS t t t Dopamine Ventral Pallidum VP SNc Substantia nigra DP LH DorsalPallidum/action/ PPTN Lateral Hypothalamus/sensory-driven reward/ pedunculopontine tegmental nucleus

  12. Evidence for interregional learning systems interaction SNc Striosome Schultz et al. 1993 DeCoteau et al. 2007

  13. Q-learning

  14. Need for multiple critics/actors Adaptive State Aggregation R R

  15. Neuronal activity when shifting the cue modality Could current models explain this data? plain Q-learning? SARSA? a single actor-critic?

  16. SWITCH BETWEEN ACTORS First Phase (tones) (mean ± se) N = 2 (0.76 ± 0.12) N = 3 (0.64 ± 0.09) N = 4 (0.62 ± 0.09) Second Phase (textures) (mean ± se) N = 2 (1.06 ± 0.07) N = 3 (1.06 ± 0.1) N = 4 (1.05 ± 0.09) Actors used in the end (mean ± se) N = 3 (2.64 ± 0.05) N = 4 (2.96 ± 0.06)

  17. 4 ACTORS

  18. IF THE CTX/STRIATUM CAN TRACK THE PERFORMANCE OF THE ACTORS, AFTER THE TRANSFER THERE MIGHT BE AN INITIAL BIAS FOR THE BEFORE USED ACTORS (HERE WE IMPLEMENTED RANDOMLY). THEN THE PERFORMANCE SHOULD BE CLOSER TO THE RESULTS WITH N=2, EVEN IF MORE ACTORS ARE AVAILABLE

  19. Motivation: How is the knowledge transferred to the second cue? • What is a “State” in reinforcement learning? A place where you are free to choose your next action in to other states • The representation of environment should change • State aggregation • The Knowledge Transfer Problem • State aggregation and sequence learning

  20. SMC Sequence learning Theta dependent STDP plasticity hebbian w target SMC hebbian w target SMC hebbian w target STDP w target coincidence detector DS STDP w target coincidence detector DS coincidence detector DS DP (action) DP (action) DP DeCoteau et al. 2007 (action)

  21. A B Unsupervised Theta Dependent STDP A B

  22. LC/AC SMC before learning after learning(audio cue) actors aggregation in dorsal striatum

  23. Algorithm • Adaptive combination of states • Knowledge Transfer: Keep the learned states • Number of activated states.

  24. State aggregation • The Knowledge Tranfer Effect • Trail Number – Average Runing Step • No Aggregation State Aggregation

  25. State Number Reduction

  26. Conclusion: State Aggregation - Link to Learned Actor • Multiple Motor Layer and higher level decision making • State Aggregation: change to abstract states of Motion Selector • Sequential Learning • Learning pattern generator

More Related