1 / 59

Actor-Critic models: from ventral striatal reward-related activity to robotics simulations.

Actor-Critic models: from ventral striatal reward-related activity to robotics simulations. Dr. Mehdi Khamassi 1,2 1 LPPA, UMR CNRS 7152, Collège de France, Paris 2 AnimatLab-LIP6 / SIMA-ISIR, Université Pierre et Marie Curie, Paris 6. Intro. Intro. Intro. Intro. OBJECTIVE.

ratana
Download Presentation

Actor-Critic models: from ventral striatal reward-related activity to robotics simulations.

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Actor-Critic models: from ventral striatal reward-related activity to robotics simulations. Dr. Mehdi Khamassi1,2 1LPPA, UMR CNRS 7152, Collège de France, Paris 2AnimatLab-LIP6 / SIMA-ISIR, Université Pierre et Marie Curie, Paris 6

  2. Intro Intro Intro Intro OBJECTIVE Help to understand how mammals can adapt their behavior in order to maximize reward obtained from the environment. Help to understand brain mechanisms underlying these cognitive processes.

  3. Intro Intro Intro Intro OBJECTIVE • Challenging goal: • different levels of decision, different learning processes, different types of representation • Pluridisciplinary approach Behavioral Neurophysiology Computational Modelling Autonomous Robotics

  4. Intro Intro Intro Intro ACTOR-CRITIC MODEL CRITIC Learns to Predict reward ACTOR Learns to Select actions • Developed in the AI community (RL) • Explains some reward-seeking behaviors • Resemblance with some part of the brain • (dopaminergic neurons & striatum)

  5. Intro Intro Intro Intro Outline • 1. Introduction • How does an Actor-Critic model work ? • 2. Electrophysiology • Reward predictions in the rat ventral striatum • 3. Computational modelling • An Actor-Critic model in a simulated robot • 4. Discussion

  6. Intro 4 5 actions: 1 2 3 Reward The Actor-Critic model • Learning from reward reward 5 1 2 4 3

  7. Intro 4 5 actions: 1 2 3 Reward reinforcement reinforcement reward The Actor-Critic model • Learning from reward reward 5 1 2 4 3

  8. Intro reward prediction: 4 5 actions: 1 2 3 Reward reinforcement reinforcement reward The Actor-Critic model • Learning from reward • Pt-1 reward 5 1 2 4 3 Rescorla and Wagner (1972).

  9. Intro reward predictions: 4 5 actions: 1 2 3 Reward reinforcement ȓ reinforcement reward The Actor-Critic model • Temporal-Difference (TD) learning • Pt-1 • Pt reward 5 1 2 4 3 Sutton and Barto (1998).

  10. Intro S reinforcement reward The Actor-Critic model • Analogy with dopaminergic neurons R +1 Romo & Schultz (1990). Houk et al. (1995); Schultz et al. (1997).

  11. Intro S reinforcement reward The Actor-Critic model • Analogy with dopaminergic neurons R +1 Romo & Schultz (1990). Houk et al. (1995); Schultz et al. (1997).

  12. Intro S reinforcement reward The Actor-Critic model • Analogy with dopaminergic neurons R 0 Romo & Schultz (1990). Houk et al. (1995); Schultz et al. (1997).

  13. Intro S reinforcement reward The Actor-Critic model • Analogy with dopaminergic neurons R -1 Romo & Schultz (1990). Houk et al. (1995); Schultz et al. (1997).

  14. Intro The Actor-Critic model • Actor-Critic models Dopaminergic neuron Barto (1995); Houk et al. (1995); Montague et al. (1996); Schultz et al. (1997); Berns and Sejnowski (1996); Suri and Schultz (1999); Doya (2000); Suri et al. (2001); Baldassarre (2002). see Joel et al. (2002) for a review.

  15. Intro The Actor-Critic model • Actor-Critic models P = 0 P = 0 L E P = 0 P = 0 Dopaminergic neuron r = 0 r = 1

  16. Intro The Actor-Critic model • Actor-Critic models P = 0 P = 0 1 1 L E P = 0 P = 1 Dopaminergic neuron r = 0 r = 1

  17. Intro The Actor-Critic model • Actor-Critic models P = 1 P = 0 1 1 1 1 L E P = 0 P = 1 Dopaminergic neuron r = 0 r = 1

  18. Intro The rat brain Adapted from Tierney (2006)

  19. Intro The striatum Adapted from Voorn et al. (2004)

  20. Intro The striatum CRITIC ACTOR Ventral Striatum Dorsal Striatum Actions (Barto, 1995; Houk et al., 1995; Montague et al., 1996; Schultz et al., 1997; Doya et al., 2002; O’Doherty et al., 2004) Dopaminergic neurons (VTA / SNc)

  21. Intro The striatum • Learning based on reward prediction in VS... • ... on dopamine reinforcements. • ... modelled by Temporal Difference (TD)-learning In the monkey: (Hikosaka et al., 1989; Hollerman et al., 1998; Kawagoe et al., 1998; Hassani et al., 2001; Cromwell and Schultz, 2003) In the rat: (Carelli et al., 2000; Daw et al., 2002; Setlow et al., 2003; Nicola et al., 2004; Wilson and Bowman, 2005) (Schultz et al., 1992; Satoh et al., 2003; Nakahara et al., 2004) (Barto, 1995; Houk et al., 1995; Schultz et al., 1997; Doya et al., 2002)

  22. Intro The striatum • ... using precise timing reward prediction in TD-learning (Montague et al., 1996; Suri and Schultz, 2001; Perez-Uribe, 2001; Alexander and Sporns, 2002) simulation of a TD-learning model activity recorded from the monkey striatum Adapted from (Suri and Schultz, 2001)

  23. Electrophysiology ElectrophysiologyMethods • Recording in the rat VS • Simple electrodes

  24. Electrophysiology ElectrophysiologyBehavioral methods The plus-maze task

  25. Electrophysiology ElectrophysiologyBehavioral methods The plus-maze task Box arrival Center departure Time running immobile

  26. Electrophysiology ElectrophysiologyResults • 170 neurons • 91 neurons with behavioral correlates Departure Center Arrival 5 Time

  27. ElectrophysiologyResults: Reward anticipation Electrophysiology Ventral striatal neuron. Activity anticipating each reward droplet. Independent from locomotor behavior. Khamassi, Mulder et al. (in revision) J Neurophysiol.

  28. ElectrophysiologyResults: Reward anticipation Electrophysiology Ventral striatal neuron. Activity anticipating each reward droplet. Independent from locomotor behavior. Khamassi, Mulder et al. (in revision) J Neurophysiol.

  29. ElectrophysiologyResults: Reward anticipation Electrophysiology Ventral striatal neuron. Activity anticipating each reward droplet. Independent from locomotor behavior. Anticipation of an extra reward. Khamassi, Mulder et al. (in revision) J Neurophysiol.

  30. Modelling with TD-learningResults Electrophysiology 7 droplets 5 3 1 Temporal representation of stimuli (Montague et al., 1996). Incomplete temporal representation Ambiguous visual input No spatial information TD-learning TD-learning TD-learning TD-learning

  31. Modelling with TD-learningResults Electrophysiology 7 droplets 5 3 1 Temporal representation of stimuli (Montague et al., 1996). Incomplete temporal representation Same context after last drop than during droplets delivery. No spatial information TD-learning TD-learning TD-learning TD-learning

  32. Modelling with TD-learningResults Electrophysiology 7 droplets 5 3 1 Temporal representation of stimuli (Montague et al., 1996). Incomplete temporal representation Ambiguous visual input No spatial information TD-learning TD-learning TD-learning TD-learning

  33. Modelling with TD-learningResults Electrophysiology 7 droplets 5 3 1 Temporal representation of stimuli (Montague et al., 1996). Incomplete temporal representation Ambiguous visual input No spatial information TD-learning TD-learning TD-learning TD-learning

  34. Electrophysiology • TD-learning could reproduce neural anticipatory activity. • Can it reproduce the rat's locomotor behavior in the same task ? Khamassi, Mulder et al. (in revision) J Neurophysiol.

  35. Modelling Autonomous roboticsMethods • Virtual plus-maze Visual perceptions reward Actions reward

  36. Modelling Autonomous roboticsMethods • Virtual plus-maze Visual perceptions reward 5 1 3 2 4 2 1 3 Actions 4 5 reward

  37. Modelling Autonomous roboticsMethods • Results expected reward 5 1 2 4 3

  38. Modelling Autonomous roboticsMethods • Actor-Critic models • Simplistic Actor. • Most often: discrete environments. Dopaminergic neuron Barto (1995); Houk et al. (1995); Montague et al. (1996); Schultz et al. (1997); Berns and Sejnowski (1996); Suri and Schultz (1999); Doya (2000); Suri et al. (2001); Baldassarre (2002). see Joel et al. (2002) for a review.

  39. Modelling Autonomous roboticsMethods • Actor-Critic models • Simplistic Actor. • Most often: discrete environments. • Continuous environments: coordination of modules. • gating network: Baldassarre (2002); Doya et al. (2002). • hand-tuned (independent from modules' performances): Suri and Schultz (2001). Dopaminergic neuron Barto (1995); Houk et al. (1995); Montague et al. (1996); Schultz et al. (1997); Berns and Sejnowski (1996); Suri and Schultz (1999); Doya (2000); Suri et al. (2001); Baldassarre (2002). see Joel et al. (2002) for a review.

  40. Modelling Autonomous roboticsMethods • Actor-Critic models • Simplistic Actor. • Most often: discrete environments. • Continuous environments: coordination of modules. • gating network: Baldassarre (2002); Doya et al. (2002). • hand-tuned (independent from modules' performances): Suri and Schultz (2001). • Test principles within a common framework Dopaminergic neuron Barto (1995); Houk et al. (1995); Montague et al. (1996); Schultz et al. (1997); Berns and Sejnowski (1996); Suri and Schultz (1999); Doya (2000); Suri et al. (2001); Baldassarre (2002). see Joel et al. (2002) for a review.

  41. Modelling Autonomous roboticsMethods • Implemented framework

  42. Modelling Autonomous roboticsMethods Gurney, Prescott & Redgrave. (2001) Adapted by Girard et al. (2002; 2003).

  43. Modelling module coordination Autonomous roboticsMethods

  44. Modelling 1. gating network (tests modules' capacity for state prediction) Autonomous roboticsMethods

  45. Modelling 2. hand-tuned (independent from modules' performance) Autonomous roboticsMethods Visual perceptions Categorization reward

  46. Modelling 3. unsupervised categorization (Self-Oganizing Maps) Autonomous roboticsMethods

  47. Modelling 4. random robot Autonomous roboticsMethods

  48. Modelling average Autonomous roboticsResults

  49. Modelling 1. gating network 2. hand-tuned 3. unsupervised categorization (SOM) 4. random robot Autonomous roboticsResults Nb of iterations required (Average performance during the second half of the experiment) 3,500 94 404 30,000

  50. Modelling 1. gating network 2. hand-tuned 3. unsupervised categorization (SOM) 4. random robot Autonomous roboticsResults Nb of iterations required (Average performance during the second half of the experiment) 3,500 94 404 30,000

More Related