640 likes | 855 Views
Prediction, Control and Decisions Kenji Doya doya@irp.oist.jp. Initial Research Project, OIST ATR Computational Neuroscience Laboratories CREST, Japan Science and Technology Agency Nara Institute of Science and Technology. Outline. Introduction Cerebellum, basal ganglia, and cortex
E N D
Prediction, Control and DecisionsKenji Doyadoya@irp.oist.jp Initial Research Project, OIST ATR Computational Neuroscience Laboratories CREST, Japan Science and Technology Agency Nara Institute of Science and Technology
Outline • Introduction • Cerebellum, basal ganglia, and cortex • Meta-learning and neuromodulators • Prediction time scale and serotonin
Learning to Walk (Doya & Nakano, 1985) • Action: cycle of 4 postures • Reward: speed sensor output • Multiple solutions: creeping, jumping,…
Learning to Stand Up(Morimoto &Doya, 2001) early trials • after learning • Reward: height of the head • No desired trajectory
critic d actor reward r action a agent environment state s Reinforcement Learning (RL) • Framework for learning state-action mapping (policy) by exploration and reward feedback • Critic • reward prediction • Actor • action selection • Learning • external reward r • internal reward d: difference from prediction
Reinforcement Learning Methods • Model-free Methods • Episode-based • parameterize policy P(a|s; q) • Temporal difference • state value function V(s) • (state-)action value function Q(s,a) • Model-based methods • Dynamic Programming • forward model P(s’|s,a)
Temporal Difference Learning • Predict reward: value function • V(s) = E[ r(t) + gr(t+1) + g2r(t+2)…| s(t)=s] • Q(s,a) = E[ r(t) + gr(t+1) + g2r(t+2)…| s(t)=s, a(t)=a] • Select action • greedy: a = argmax Q(s,a) • Boltzmann: P(a|s) exp[ b Q(s,a)] • Update prediction: TD error • d(t) = r(t) + gV(s(t+1)) - V(s(t)) • DV(s(t)) = ad(t) • DQ(s(t),a(t)) = ad(t)
Dynamic Programming and RL • Dynamic Programming • model-based, off-line • solve Bellman equation • V(s) = maxaSs’ [ P(s’|s,a) {r(s,a,s’) + gV(s’)}] • Reinforcement Learning • model-free, on-line • learn by TD error • d(t) = r(t) + gV(s(t+1)) - V(s(t)) • DV(s(t)) = ad(t) • DQ(s(t),a(t)) = ad(t)
Discrete vs. Continuous RL (Doya, 2000) • Discrete time • Continuous time
Questions • Computational Questions • How to learn: • direct policy P(a|s) • value functions V(s), Q(s,a) • forward models P(s’|s,a) • When to use which method? • Biological Questions • Where in the brain? • How are they represented/updated? • How are they selected/coordinated?
Brain Hierarchy • Forebrain • Cerebral cortex (a) • neocortex • paleocortex: olfactory cortex • archicortex: basal forebrain, hippocampus • Basal nuclei (b) • neostriatum: caudate, putamen • paleostriatum: globus pallidus • archistriatum: amygdala • Diencephalon • thalamus (c) • hypothalamus (d) • Brain stem & Cerebellum • Midbrain (e) • Hindbrain • pons (f) • cerebellum (g) • Medulla (h) • Spinal cord (i)
Just for Motor Control?(Middleton & Strick 1994) • Basal ganglia (Globus Pallidus) Prefrontal cortex (area46) Cerebellum (dentate nucleus)
Cerebral Cortex:Unsupervised Learning output input Basal Ganglia: Reinforcement Learning reward output input Cerebellum: Supervised Learning target + error - output input Specialization by Learning Algorithms (Doya, 1999) Cortex Basal thalamus Ganglia SN Cerebellum IO
Cerebellum • Purkinje cells • ~105 parallel fibers • single climbing fiber • long-term depression • Supervised learning • perceptron hypothesis • internal models
Internal Models in the Cerebellum (Imamizu et al., 2000) • Learning to use ‘rotated’ mouse after learning early learning
Motor Imagery (Luft et al. 1998) Finger movement Imagery of movement
Basal Ganglia • Striatum • striosome & matrix • dopamine-dependent plasticity • Dopamine neurons • reward-predictive response • TD learning
r V d r V d r V d Dopamine Neurons and TD Errord(t) = r(t) + gV(s(t+1)) - V(s(t)) before learning after learning omit reward (Schultz et al. 1997)
Reward-predicting Activities of Striatal Neurons • Delayed saccade task (Kawagoe et al., 1998) • Not just actions, but resulting rewards Reward: Right Up Left Down All Target: Right Up Left Down
Cerebral Cortex • Recurrent connections • Hebbian plasticity • Unsupervised learning, e.g., PCA, ICA
Replicating V1Receptive Fields (Olshausen & Field, 1996) • Infomax and sparseness • Hebbian plasticity and recurrent inhibition
Specialization by Learning? • Cerebellum: Supervised learning • error signal by climbing fibers • forward model s’=f(s,a) and policy a=g(s) • Basal ganglia: Reinforcement leaning • reward signal by dopamine fibers • value functions V(s) and Q(s,a) • Cerebral cortex: Unsupervised learning • Hebbian plasticity and recurrent inhibition • representation of state s and action a • But how are they recruited and combined?
a s Q s’ a s a s ai V f g Multiple Action Selection Schemes • Model-free • a = argmaxa Q(s,a) • Model-based • a = argmaxa [r+V(f(s,a))] • forward model: f(s,a) • Encapsulation • a = g(s)
Internal models/Cerebellum Reza Shadmehr Stefan Schaal Mitsuo Kawato Reward/Basal ganglia Andrew G. Barto Bernard Balleine Peter Dayan John O’Doherty Minoru Kimura Wolfram Schultz State coding/Cortex Nathaniel Daw Leo Sugrue Daeyeol Lee Jun Tanji Anitha Pasupathy Masamichi Sakagami Lectures at OCNC 2005
Outline • Introduction • Cerebellum, basal ganglia, and cortex • Meta-learning and neuromodulators • Prediction time scale and serotonin
critic d actor reward r action a agent environment state s Reinforcement Learning (RL) • Framework for learning state-action mapping (policy) by exploration and reward feedback • Critic • reward prediction • Actor • action selection • Learning • external reward r • internal reward d: difference from prediction
Reinforcement Learning • Predict reward: value function • V(s) = E[ r(t) + gr(t+1) + g2r(t+2)…| s(t)=s] • Q(s,a) = E[ r(t) + gr(t+1) + g2r(t+2)…| s(t)=s, a(t)=a] • Select action • greedy: a = argmax Q(s,a) • Boltzmann: P(a|s) exp[ b Q(s,a)] • Update prediction: TD error • d(t) = r(t) + gV(s(t+1)) - V(s(t)) • DV(s(t)) = ad(t) • DQ(s(t),a(t)) = ad(t)
Cyber Rodent Project • Robots with same constraint as biological agents • What is the origin of rewards? • What to be learned, what to be evolved? • Self-preservation • capture batteries • Self-reproduction • exchange programs through IR ports
camera range sensor proximity sensors gyro battery latch two wheels IR port speaker microphones R/G/B LED Cyber Rodent: Hardware
Survival catch battery packs Reproduction copy ‘genes’ through IR ports Evolving Robot Colony
large g small g Discounting Future Reward
Setting of Reward Function • Reward r = rmain + rsupp - rcost • e.g., reward for vision of battery
Reinforcement Learning of Reinforcement Learning (Schweighfer&Doya, 2003) • Fluctuations in the metaparameters correlate with average reward • reward • g • b • a
14 12 10 8 β 6 4 2 Battery level 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Randomness Control by Battery Level • Greedier action at both extremes
Neuromodulators for Metalearning (Doya, 2002) • Metaparameter tuning is critical in RL • How does the brain tune them? Dopamine: TD error d Acetylcholine: learning rate a Noradrenaline: inv. temp. b Serotonin: discount g
Learning Ratea • DV(s(t-1)) = ad(t) • DQ(s(t-1),a(t-1)) = ad(t) • small aslow learning • large aunstable learning Acetylcholine basal forebrain • Regulate memory update and retention (Hasselmo et al.) • LTP in cortex, hippocampus • top-down and bottom-up information flow
Inverse Temperature b • Greediness in action selection • P(ai|s) exp[ b Q(s,ai)] • small bexploration • large bexploitation Noradrenaline locus coeruleus • Correlation with performance accuracy (Aston-Jones et al.) • Modulation of cellular I/O gain (Cohen et al.)
Discount Factor g • V(s(t)) = E[ r(t+1) + gr(t+2) + g2r(t+3) + …] • Balance between short- and long-term results Serotonin dorsal raphe • Low activity associated with impulsivity • depression, bipolar disorders • aggression, eating disorders
TD Error d • d(t) = r(t) + gV(s(t)) - V(s(t-1)) • Global learning signal • reward prediction: DV(s(t-1)) = ad(t) • reinforcement: DQ(s(t-1),a(t-1)) = ad(t) Dopamine substantia nigra, VTA • Respond to errors in reward prediction • Reinforcement of actions • addiction
Ach? 5-HT? a d NA? r • DA neurons: TD error d • SNr/GPi: action selection: Q(s,a) a TD Model of Basal Ganglia(Houk et al. 1995, Montague et al. 1996, Schultz et al. 1997,...) • Striosome: state value V(s) • Matrix: action value Q(s,a) s V(s) Q(s,a)
striatum g1 g2 g3 V1 V2 V3 V(s(t)) Dopamineneurons d(t) V(s(t+1)) Possible Control of Discount Factor • Modulation of TD error • Selection/weighting of parallel networks
Markov Decision Task (Tanaka et al., 2004) • State transition and reward functions • Stimulus and response
Behavior Results • All subjects successfully learned optimal behavior
SHORT vs. NO (p < 0.001 uncorrected) OFC Insula Striatum Cerebellum LONG vs. SHORT (p < 0.0001 uncorrected) DLPFC, VLPFC, IPC, PMd Striatum Cerebellum Dorsal raphe Block-Design Analysis Different brain areas involved in immediate and future reward prediction
Ventro-Dorsal Difference Lateral PFC Insula Striatum
Model-based Regressor Analysis • Estimate V(t) and d(t) from subjects’ performance data • Regression analysis of fMRI data Agent Value function V(s) Value function V(s) reward r(t) fMRI data TD error d(t) TD error d(t) 20yen Policy action a(t) Environment state s(t)
Explanatory Variables (subject NS) g = 0 Reward prediction V(t) g = 0.3 g = 0.6 g = 0.8 g = 0.9 g = 0.99 g = 0 Reward prediction error d(t) g = 0.3 g = 0.6 g = 0.8 g = 0.9 g = 0.99 1 312 trial
Regression Analysis mPFC Insula Reward prediction V x = -2 mm x = -42 mm Striatum Reward prediction error d z = 2
Day1: Tr- Day2: Tr0 Day3: Tr+ 2.3g of tryptophan (Control) 10.3g of tryptophan (Loading) No tryptophan (Depletion) Tryptophan Depletion/Loading • Tryptophan: precursor of serotonin • depletion/loading affectcentral serotonin levels • (e.g. Bjork et al. 2001, Luciana et al.2001) • 100 g of amino acid drink • experiments after 6 hours
Blood Tryptophan Levels N.D. (< 3.9 mg/ml)