590 likes | 768 Views
Actor-Critic models: from ventral striatal reward-related activity to robotics simulations. Dr. Mehdi Khamassi 1,2 1 LPPA, UMR CNRS 7152, Collège de France, Paris 2 AnimatLab-LIP6 / SIMA-ISIR, Université Pierre et Marie Curie, Paris 6. Intro. Intro. Intro. Intro. OBJECTIVE.
E N D
Actor-Critic models: from ventral striatal reward-related activity to robotics simulations. Dr. Mehdi Khamassi1,2 1LPPA, UMR CNRS 7152, Collège de France, Paris 2AnimatLab-LIP6 / SIMA-ISIR, Université Pierre et Marie Curie, Paris 6
Intro Intro Intro Intro OBJECTIVE Help to understand how mammals can adapt their behavior in order to maximize reward obtained from the environment. Help to understand brain mechanisms underlying these cognitive processes.
Intro Intro Intro Intro OBJECTIVE • Challenging goal: • different levels of decision, different learning processes, different types of representation • Pluridisciplinary approach Behavioral Neurophysiology Computational Modelling Autonomous Robotics
Intro Intro Intro Intro ACTOR-CRITIC MODEL CRITIC Learns to Predict reward ACTOR Learns to Select actions • Developed in the AI community (RL) • Explains some reward-seeking behaviors • Resemblance with some part of the brain • (dopaminergic neurons & striatum)
Intro Intro Intro Intro Outline • 1. Introduction • How does an Actor-Critic model work ? • 2. Electrophysiology • Reward predictions in the rat ventral striatum • 3. Computational modelling • An Actor-Critic model in a simulated robot • 4. Discussion
Intro 4 5 actions: 1 2 3 Reward The Actor-Critic model • Learning from reward reward 5 1 2 4 3
Intro 4 5 actions: 1 2 3 Reward reinforcement reinforcement reward The Actor-Critic model • Learning from reward reward 5 1 2 4 3
Intro reward prediction: 4 5 actions: 1 2 3 Reward reinforcement reinforcement reward The Actor-Critic model • Learning from reward • Pt-1 reward 5 1 2 4 3 Rescorla and Wagner (1972).
Intro reward predictions: 4 5 actions: 1 2 3 Reward reinforcement ȓ reinforcement reward The Actor-Critic model • Temporal-Difference (TD) learning • Pt-1 • Pt reward 5 1 2 4 3 Sutton and Barto (1998).
Intro S reinforcement reward The Actor-Critic model • Analogy with dopaminergic neurons R +1 Romo & Schultz (1990). Houk et al. (1995); Schultz et al. (1997).
Intro S reinforcement reward The Actor-Critic model • Analogy with dopaminergic neurons R +1 Romo & Schultz (1990). Houk et al. (1995); Schultz et al. (1997).
Intro S reinforcement reward The Actor-Critic model • Analogy with dopaminergic neurons R 0 Romo & Schultz (1990). Houk et al. (1995); Schultz et al. (1997).
Intro S reinforcement reward The Actor-Critic model • Analogy with dopaminergic neurons R -1 Romo & Schultz (1990). Houk et al. (1995); Schultz et al. (1997).
Intro The Actor-Critic model • Actor-Critic models Dopaminergic neuron Barto (1995); Houk et al. (1995); Montague et al. (1996); Schultz et al. (1997); Berns and Sejnowski (1996); Suri and Schultz (1999); Doya (2000); Suri et al. (2001); Baldassarre (2002). see Joel et al. (2002) for a review.
Intro The Actor-Critic model • Actor-Critic models P = 0 P = 0 L E P = 0 P = 0 Dopaminergic neuron r = 0 r = 1
Intro The Actor-Critic model • Actor-Critic models P = 0 P = 0 1 1 L E P = 0 P = 1 Dopaminergic neuron r = 0 r = 1
Intro The Actor-Critic model • Actor-Critic models P = 1 P = 0 1 1 1 1 L E P = 0 P = 1 Dopaminergic neuron r = 0 r = 1
Intro The rat brain Adapted from Tierney (2006)
Intro The striatum Adapted from Voorn et al. (2004)
Intro The striatum CRITIC ACTOR Ventral Striatum Dorsal Striatum Actions (Barto, 1995; Houk et al., 1995; Montague et al., 1996; Schultz et al., 1997; Doya et al., 2002; O’Doherty et al., 2004) Dopaminergic neurons (VTA / SNc)
Intro The striatum • Learning based on reward prediction in VS... • ... on dopamine reinforcements. • ... modelled by Temporal Difference (TD)-learning In the monkey: (Hikosaka et al., 1989; Hollerman et al., 1998; Kawagoe et al., 1998; Hassani et al., 2001; Cromwell and Schultz, 2003) In the rat: (Carelli et al., 2000; Daw et al., 2002; Setlow et al., 2003; Nicola et al., 2004; Wilson and Bowman, 2005) (Schultz et al., 1992; Satoh et al., 2003; Nakahara et al., 2004) (Barto, 1995; Houk et al., 1995; Schultz et al., 1997; Doya et al., 2002)
Intro The striatum • ... using precise timing reward prediction in TD-learning (Montague et al., 1996; Suri and Schultz, 2001; Perez-Uribe, 2001; Alexander and Sporns, 2002) simulation of a TD-learning model activity recorded from the monkey striatum Adapted from (Suri and Schultz, 2001)
Electrophysiology ElectrophysiologyMethods • Recording in the rat VS • Simple electrodes
Electrophysiology ElectrophysiologyBehavioral methods The plus-maze task
Electrophysiology ElectrophysiologyBehavioral methods The plus-maze task Box arrival Center departure Time running immobile
Electrophysiology ElectrophysiologyResults • 170 neurons • 91 neurons with behavioral correlates Departure Center Arrival 5 Time
ElectrophysiologyResults: Reward anticipation Electrophysiology Ventral striatal neuron. Activity anticipating each reward droplet. Independent from locomotor behavior. Khamassi, Mulder et al. (in revision) J Neurophysiol.
ElectrophysiologyResults: Reward anticipation Electrophysiology Ventral striatal neuron. Activity anticipating each reward droplet. Independent from locomotor behavior. Khamassi, Mulder et al. (in revision) J Neurophysiol.
ElectrophysiologyResults: Reward anticipation Electrophysiology Ventral striatal neuron. Activity anticipating each reward droplet. Independent from locomotor behavior. Anticipation of an extra reward. Khamassi, Mulder et al. (in revision) J Neurophysiol.
Modelling with TD-learningResults Electrophysiology 7 droplets 5 3 1 Temporal representation of stimuli (Montague et al., 1996). Incomplete temporal representation Ambiguous visual input No spatial information TD-learning TD-learning TD-learning TD-learning
Modelling with TD-learningResults Electrophysiology 7 droplets 5 3 1 Temporal representation of stimuli (Montague et al., 1996). Incomplete temporal representation Same context after last drop than during droplets delivery. No spatial information TD-learning TD-learning TD-learning TD-learning
Modelling with TD-learningResults Electrophysiology 7 droplets 5 3 1 Temporal representation of stimuli (Montague et al., 1996). Incomplete temporal representation Ambiguous visual input No spatial information TD-learning TD-learning TD-learning TD-learning
Modelling with TD-learningResults Electrophysiology 7 droplets 5 3 1 Temporal representation of stimuli (Montague et al., 1996). Incomplete temporal representation Ambiguous visual input No spatial information TD-learning TD-learning TD-learning TD-learning
Electrophysiology • TD-learning could reproduce neural anticipatory activity. • Can it reproduce the rat's locomotor behavior in the same task ? Khamassi, Mulder et al. (in revision) J Neurophysiol.
Modelling Autonomous roboticsMethods • Virtual plus-maze Visual perceptions reward Actions reward
Modelling Autonomous roboticsMethods • Virtual plus-maze Visual perceptions reward 5 1 3 2 4 2 1 3 Actions 4 5 reward
Modelling Autonomous roboticsMethods • Results expected reward 5 1 2 4 3
Modelling Autonomous roboticsMethods • Actor-Critic models • Simplistic Actor. • Most often: discrete environments. Dopaminergic neuron Barto (1995); Houk et al. (1995); Montague et al. (1996); Schultz et al. (1997); Berns and Sejnowski (1996); Suri and Schultz (1999); Doya (2000); Suri et al. (2001); Baldassarre (2002). see Joel et al. (2002) for a review.
Modelling Autonomous roboticsMethods • Actor-Critic models • Simplistic Actor. • Most often: discrete environments. • Continuous environments: coordination of modules. • gating network: Baldassarre (2002); Doya et al. (2002). • hand-tuned (independent from modules' performances): Suri and Schultz (2001). Dopaminergic neuron Barto (1995); Houk et al. (1995); Montague et al. (1996); Schultz et al. (1997); Berns and Sejnowski (1996); Suri and Schultz (1999); Doya (2000); Suri et al. (2001); Baldassarre (2002). see Joel et al. (2002) for a review.
Modelling Autonomous roboticsMethods • Actor-Critic models • Simplistic Actor. • Most often: discrete environments. • Continuous environments: coordination of modules. • gating network: Baldassarre (2002); Doya et al. (2002). • hand-tuned (independent from modules' performances): Suri and Schultz (2001). • Test principles within a common framework Dopaminergic neuron Barto (1995); Houk et al. (1995); Montague et al. (1996); Schultz et al. (1997); Berns and Sejnowski (1996); Suri and Schultz (1999); Doya (2000); Suri et al. (2001); Baldassarre (2002). see Joel et al. (2002) for a review.
Modelling Autonomous roboticsMethods • Implemented framework
Modelling Autonomous roboticsMethods Gurney, Prescott & Redgrave. (2001) Adapted by Girard et al. (2002; 2003).
Modelling module coordination Autonomous roboticsMethods
Modelling 1. gating network (tests modules' capacity for state prediction) Autonomous roboticsMethods
Modelling 2. hand-tuned (independent from modules' performance) Autonomous roboticsMethods Visual perceptions Categorization reward
Modelling 3. unsupervised categorization (Self-Oganizing Maps) Autonomous roboticsMethods
Modelling 4. random robot Autonomous roboticsMethods
Modelling average Autonomous roboticsResults
Modelling 1. gating network 2. hand-tuned 3. unsupervised categorization (SOM) 4. random robot Autonomous roboticsResults Nb of iterations required (Average performance during the second half of the experiment) 3,500 94 404 30,000
Modelling 1. gating network 2. hand-tuned 3. unsupervised categorization (SOM) 4. random robot Autonomous roboticsResults Nb of iterations required (Average performance during the second half of the experiment) 3,500 94 404 30,000