Actor-Critic models: from ventral striatal reward-related activity to robotics simulations.

Actor-Critic models: from ventral striatal reward-related activity to robotics simulations. Dr. Mehdi Khamassi1,2 1LPPA, UMR CNRS 7152, Collège de France, Paris 2AnimatLab-LIP6 / SIMA-ISIR, Université Pierre et Marie Curie, Paris 6

Intro Intro Intro Intro OBJECTIVE Help to understand how mammals can adapt their behavior in order to maximize reward obtained from the environment. Help to understand brain mechanisms underlying these cognitive processes.

Intro Intro Intro Intro OBJECTIVE • Challenging goal: • different levels of decision, different learning processes, different types of representation • Pluridisciplinary approach Behavioral Neurophysiology Computational Modelling Autonomous Robotics

Intro Intro Intro Intro ACTOR-CRITIC MODEL CRITIC Learns to Predict reward ACTOR Learns to Select actions • Developed in the AI community (RL) • Explains some reward-seeking behaviors • Resemblance with some part of the brain • (dopaminergic neurons & striatum)

Intro Intro Intro Intro Outline • 1. Introduction • How does an Actor-Critic model work ? • 2. Electrophysiology • Reward predictions in the rat ventral striatum • 3. Computational modelling • An Actor-Critic model in a simulated robot • 4. Discussion

Intro 4 5 actions: 1 2 3 Reward The Actor-Critic model • Learning from reward reward 5 1 2 4 3

Intro 4 5 actions: 1 2 3 Reward reinforcement reinforcement reward The Actor-Critic model • Learning from reward reward 5 1 2 4 3

Intro reward prediction: 4 5 actions: 1 2 3 Reward reinforcement reinforcement reward The Actor-Critic model • Learning from reward • Pt-1 reward 5 1 2 4 3 Rescorla and Wagner (1972).

Intro reward predictions: 4 5 actions: 1 2 3 Reward reinforcement ȓ reinforcement reward The Actor-Critic model • Temporal-Difference (TD) learning • Pt-1 • Pt reward 5 1 2 4 3 Sutton and Barto (1998).

Intro S reinforcement reward The Actor-Critic model • Analogy with dopaminergic neurons R +1 Romo & Schultz (1990). Houk et al. (1995); Schultz et al. (1997).

Intro S reinforcement reward The Actor-Critic model • Analogy with dopaminergic neurons R 0 Romo & Schultz (1990). Houk et al. (1995); Schultz et al. (1997).

Intro S reinforcement reward The Actor-Critic model • Analogy with dopaminergic neurons R -1 Romo & Schultz (1990). Houk et al. (1995); Schultz et al. (1997).

Intro The Actor-Critic model • Actor-Critic models Dopaminergic neuron Barto (1995); Houk et al. (1995); Montague et al. (1996); Schultz et al. (1997); Berns and Sejnowski (1996); Suri and Schultz (1999); Doya (2000); Suri et al. (2001); Baldassarre (2002). see Joel et al. (2002) for a review.

Intro The Actor-Critic model • Actor-Critic models P = 0 P = 0 L E P = 0 P = 0 Dopaminergic neuron r = 0 r = 1

Intro The Actor-Critic model • Actor-Critic models P = 0 P = 0 1 1 L E P = 0 P = 1 Dopaminergic neuron r = 0 r = 1

Intro The Actor-Critic model • Actor-Critic models P = 1 P = 0 1 1 1 1 L E P = 0 P = 1 Dopaminergic neuron r = 0 r = 1

Intro The rat brain Adapted from Tierney (2006)

Intro The striatum Adapted from Voorn et al. (2004)

Intro The striatum CRITIC ACTOR Ventral Striatum Dorsal Striatum Actions (Barto, 1995; Houk et al., 1995; Montague et al., 1996; Schultz et al., 1997; Doya et al., 2002; O’Doherty et al., 2004) Dopaminergic neurons (VTA / SNc)

Intro The striatum • Learning based on reward prediction in VS... • ... on dopamine reinforcements. • ... modelled by Temporal Difference (TD)-learning In the monkey: (Hikosaka et al., 1989; Hollerman et al., 1998; Kawagoe et al., 1998; Hassani et al., 2001; Cromwell and Schultz, 2003) In the rat: (Carelli et al., 2000; Daw et al., 2002; Setlow et al., 2003; Nicola et al., 2004; Wilson and Bowman, 2005) (Schultz et al., 1992; Satoh et al., 2003; Nakahara et al., 2004) (Barto, 1995; Houk et al., 1995; Schultz et al., 1997; Doya et al., 2002)

Intro The striatum • ... using precise timing reward prediction in TD-learning (Montague et al., 1996; Suri and Schultz, 2001; Perez-Uribe, 2001; Alexander and Sporns, 2002) simulation of a TD-learning model activity recorded from the monkey striatum Adapted from (Suri and Schultz, 2001)

Electrophysiology ElectrophysiologyMethods • Recording in the rat VS • Simple electrodes

Electrophysiology ElectrophysiologyBehavioral methods The plus-maze task

Electrophysiology ElectrophysiologyBehavioral methods The plus-maze task Box arrival Center departure Time running immobile

Electrophysiology ElectrophysiologyResults • 170 neurons • 91 neurons with behavioral correlates Departure Center Arrival 5 Time

ElectrophysiologyResults: Reward anticipation Electrophysiology Ventral striatal neuron. Activity anticipating each reward droplet. Independent from locomotor behavior. Khamassi, Mulder et al. (in revision) J Neurophysiol.

ElectrophysiologyResults: Reward anticipation Electrophysiology Ventral striatal neuron. Activity anticipating each reward droplet. Independent from locomotor behavior. Anticipation of an extra reward. Khamassi, Mulder et al. (in revision) J Neurophysiol.

Modelling with TD-learningResults Electrophysiology 7 droplets 5 3 1 Temporal representation of stimuli (Montague et al., 1996). Incomplete temporal representation Ambiguous visual input No spatial information TD-learning TD-learning TD-learning TD-learning

Modelling with TD-learningResults Electrophysiology 7 droplets 5 3 1 Temporal representation of stimuli (Montague et al., 1996). Incomplete temporal representation Same context after last drop than during droplets delivery. No spatial information TD-learning TD-learning TD-learning TD-learning

Modelling with TD-learningResults Electrophysiology 7 droplets 5 3 1 Temporal representation of stimuli (Montague et al., 1996). Incomplete temporal representation Ambiguous visual input No spatial information TD-learning TD-learning TD-learning TD-learning

Electrophysiology • TD-learning could reproduce neural anticipatory activity. • Can it reproduce the rat's locomotor behavior in the same task ? Khamassi, Mulder et al. (in revision) J Neurophysiol.

Modelling Autonomous roboticsMethods • Virtual plus-maze Visual perceptions reward Actions reward

Modelling Autonomous roboticsMethods • Virtual plus-maze Visual perceptions reward 5 1 3 2 4 2 1 3 Actions 4 5 reward

Modelling Autonomous roboticsMethods • Results expected reward 5 1 2 4 3

Modelling Autonomous roboticsMethods • Actor-Critic models • Simplistic Actor. • Most often: discrete environments. Dopaminergic neuron Barto (1995); Houk et al. (1995); Montague et al. (1996); Schultz et al. (1997); Berns and Sejnowski (1996); Suri and Schultz (1999); Doya (2000); Suri et al. (2001); Baldassarre (2002). see Joel et al. (2002) for a review.

Modelling Autonomous roboticsMethods • Actor-Critic models • Simplistic Actor. • Most often: discrete environments. • Continuous environments: coordination of modules. • gating network: Baldassarre (2002); Doya et al. (2002). • hand-tuned (independent from modules' performances): Suri and Schultz (2001). Dopaminergic neuron Barto (1995); Houk et al. (1995); Montague et al. (1996); Schultz et al. (1997); Berns and Sejnowski (1996); Suri and Schultz (1999); Doya (2000); Suri et al. (2001); Baldassarre (2002). see Joel et al. (2002) for a review.

Modelling Autonomous roboticsMethods • Actor-Critic models • Simplistic Actor. • Most often: discrete environments. • Continuous environments: coordination of modules. • gating network: Baldassarre (2002); Doya et al. (2002). • hand-tuned (independent from modules' performances): Suri and Schultz (2001). • Test principles within a common framework Dopaminergic neuron Barto (1995); Houk et al. (1995); Montague et al. (1996); Schultz et al. (1997); Berns and Sejnowski (1996); Suri and Schultz (1999); Doya (2000); Suri et al. (2001); Baldassarre (2002). see Joel et al. (2002) for a review.

Modelling Autonomous roboticsMethods • Implemented framework

Modelling Autonomous roboticsMethods Gurney, Prescott & Redgrave. (2001) Adapted by Girard et al. (2002; 2003).

Modelling module coordination Autonomous roboticsMethods

Modelling 1. gating network (tests modules' capacity for state prediction) Autonomous roboticsMethods

Modelling 2. hand-tuned (independent from modules' performance) Autonomous roboticsMethods Visual perceptions Categorization reward

Modelling 3. unsupervised categorization (Self-Oganizing Maps) Autonomous roboticsMethods

Modelling 4. random robot Autonomous roboticsMethods

Modelling average Autonomous roboticsResults

Modelling 1. gating network 2. hand-tuned 3. unsupervised categorization (SOM) 4. random robot Autonomous roboticsResults Nb of iterations required (Average performance during the second half of the experiment) 3,500 94 404 30,000

Actor-Critic models: from ventral striatal reward-related activity to robotics simulations.

Actor-Critic models: from ventral striatal reward-related activity to robotics simulations.

Presentation Transcript

VENTRAL Hernia

Markov Models and Simulations

Robotics Homework related to GA/GP

Activity-Based Models

Critic

From Extropians to Evolutionary Robotics

Additive Models ， Trees ， and Related Models

Reward-related Neural Circuitry

From Software Models To Social Models

Networked Robotics: From Distributed Robotics to Sensor Networks

Modeling Issues Related to EDRC Models

From Conceptual Models to Simulation Models

From Marginal to Secondary Actor

From geology to spatial models for simulations and TBM’s

appliance critic

Additive Models ， Trees ， and Related Models

Additive Models ， Trees ， and Related Models

Systems, Models and Simulations

Molecular Models Activity

Managing Risks Related to Physical Activity