Prediction, Control and Decisions Kenji Doya doya@irp.oist.jp

Prediction, Control and DecisionsKenji Doyadoya@irp.oist.jp Initial Research Project, OIST ATR Computational Neuroscience Laboratories CREST, Japan Science and Technology Agency Nara Institute of Science and Technology

Outline • Introduction • Cerebellum, basal ganglia, and cortex • Meta-learning and neuromodulators • Prediction time scale and serotonin

Learning to Walk (Doya & Nakano, 1985) • Action: cycle of 4 postures • Reward: speed sensor output • Multiple solutions: creeping, jumping,…

Learning to Stand Up(Morimoto &Doya, 2001) early trials • after learning • Reward: height of the head • No desired trajectory

critic d actor reward r action a agent environment state s Reinforcement Learning (RL) • Framework for learning state-action mapping (policy) by exploration and reward feedback • Critic • reward prediction • Actor • action selection • Learning • external reward r • internal reward d: difference from prediction

Reinforcement Learning Methods • Model-free Methods • Episode-based • parameterize policy P(a|s; q) • Temporal difference • state value function V(s) • (state-)action value function Q(s,a) • Model-based methods • Dynamic Programming • forward model P(s’|s,a)

Temporal Difference Learning • Predict reward: value function • V(s) = E[ r(t) + gr(t+1) + g2r(t+2)…| s(t)=s] • Q(s,a) = E[ r(t) + gr(t+1) + g2r(t+2)…| s(t)=s, a(t)=a] • Select action • greedy: a = argmax Q(s,a) • Boltzmann: P(a|s)  exp[ b Q(s,a)] • Update prediction: TD error • d(t) = r(t) + gV(s(t+1)) - V(s(t)) • DV(s(t)) = ad(t) • DQ(s(t),a(t)) = ad(t)

Dynamic Programming and RL • Dynamic Programming • model-based, off-line • solve Bellman equation • V(s) = maxaSs’ [ P(s’|s,a) {r(s,a,s’) + gV(s’)}] • Reinforcement Learning • model-free, on-line • learn by TD error • d(t) = r(t) + gV(s(t+1)) - V(s(t)) • DV(s(t)) = ad(t) • DQ(s(t),a(t)) = ad(t)

Discrete vs. Continuous RL (Doya, 2000) • Discrete time • Continuous time

Questions • Computational Questions • How to learn: • direct policy P(a|s) • value functions V(s), Q(s,a) • forward models P(s’|s,a) • When to use which method? • Biological Questions • Where in the brain? • How are they represented/updated? • How are they selected/coordinated?

Brain Hierarchy • Forebrain • Cerebral cortex (a) • neocortex • paleocortex: olfactory cortex • archicortex: basal forebrain, hippocampus • Basal nuclei (b) • neostriatum: caudate, putamen • paleostriatum: globus pallidus • archistriatum: amygdala • Diencephalon • thalamus (c) • hypothalamus (d) • Brain stem & Cerebellum • Midbrain (e) • Hindbrain • pons (f) • cerebellum (g) • Medulla (h) • Spinal cord (i)

Just for Motor Control?(Middleton & Strick 1994) • Basal ganglia (Globus Pallidus) Prefrontal cortex (area46) Cerebellum (dentate nucleus)

Cerebral Cortex：Unsupervised Learning output input Basal Ganglia: Reinforcement Learning reward output input Cerebellum: Supervised Learning target + error - output input Specialization by Learning Algorithms (Doya, 1999) Cortex Basal thalamus Ganglia SN Cerebellum IO

Cerebellum • Purkinje cells • ~105 parallel fibers • single climbing fiber • long-term depression • Supervised learning • perceptron hypothesis • internal models

Internal Models in the Cerebellum (Imamizu et al., 2000) • Learning to use ‘rotated’ mouse after learning early learning

Motor Imagery (Luft et al. 1998) Finger movement Imagery of movement

Basal Ganglia • Striatum • striosome & matrix • dopamine-dependent plasticity • Dopamine neurons • reward-predictive response • TD learning

r V d r V d r V d Dopamine Neurons and TD Errord(t) = r(t) + gV(s(t+1)) - V(s(t)) before learning after learning omit reward (Schultz et al. 1997)

Reward-predicting Activities of Striatal Neurons • Delayed saccade task (Kawagoe et al., 1998) • Not just actions, but resulting rewards Reward: Right Up Left Down All Target: Right Up Left Down

Cerebral Cortex • Recurrent connections • Hebbian plasticity • Unsupervised learning, e.g., PCA, ICA

Replicating V1Receptive Fields (Olshausen & Field, 1996) • Infomax and sparseness • Hebbian plasticity and recurrent inhibition

Specialization by Learning? • Cerebellum: Supervised learning • error signal by climbing fibers • forward model s’=f(s,a) and policy a=g(s) • Basal ganglia: Reinforcement leaning • reward signal by dopamine fibers • value functions V(s) and Q(s,a) • Cerebral cortex: Unsupervised learning • Hebbian plasticity and recurrent inhibition • representation of state s and action a • But how are they recruited and combined?

a s Q s’ a s a s ai V f g Multiple Action Selection Schemes • Model-free • a = argmaxa Q(s,a) • Model-based • a = argmaxa [r+V(f(s,a))] • forward model: f(s,a) • Encapsulation • a = g(s)

Internal models/Cerebellum Reza Shadmehr Stefan Schaal Mitsuo Kawato Reward/Basal ganglia Andrew G. Barto Bernard Balleine Peter Dayan John O’Doherty Minoru Kimura Wolfram Schultz State coding/Cortex Nathaniel Daw Leo Sugrue Daeyeol Lee Jun Tanji Anitha Pasupathy Masamichi Sakagami Lectures at OCNC 2005

Outline • Introduction • Cerebellum, basal ganglia, and cortex • Meta-learning and neuromodulators • Prediction time scale and serotonin

critic d actor reward r action a agent environment state s Reinforcement Learning (RL) • Framework for learning state-action mapping (policy) by exploration and reward feedback • Critic • reward prediction • Actor • action selection • Learning • external reward r • internal reward d: difference from prediction

Reinforcement Learning • Predict reward: value function • V(s) = E[ r(t) + gr(t+1) + g2r(t+2)…| s(t)=s] • Q(s,a) = E[ r(t) + gr(t+1) + g2r(t+2)…| s(t)=s, a(t)=a] • Select action • greedy: a = argmax Q(s,a) • Boltzmann: P(a|s)  exp[ b Q(s,a)] • Update prediction: TD error • d(t) = r(t) + gV(s(t+1)) - V(s(t)) • DV(s(t)) = ad(t) • DQ(s(t),a(t)) = ad(t)

Cyber Rodent Project • Robots with same constraint as biological agents • What is the origin of rewards? • What to be learned, what to be evolved? • Self-preservation • capture batteries • Self-reproduction • exchange programs through IR ports

camera range sensor proximity sensors gyro battery latch two wheels IR port speaker microphones R/G/B LED Cyber Rodent: Hardware

Survival catch battery packs Reproduction copy ‘genes’ through IR ports Evolving Robot Colony

large g small g Discounting Future Reward

Setting of Reward Function • Reward r = rmain + rsupp - rcost • e.g., reward for vision of battery

Reinforcement Learning of Reinforcement Learning (Schweighfer&Doya, 2003) • Fluctuations in the metaparameters correlate with average reward • reward • g • b • a

14 12 10  8 β 6 4 2 Battery level 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Randomness Control by Battery Level • Greedier action at both extremes

Neuromodulators for Metalearning (Doya, 2002) • Metaparameter tuning is critical in RL • How does the brain tune them? Dopamine: TD error d Acetylcholine: learning rate a Noradrenaline: inv. temp. b Serotonin: discount g

Learning Ratea • DV(s(t-1)) = ad(t) • DQ(s(t-1),a(t-1)) = ad(t) • small aslow learning • large aunstable learning Acetylcholine basal forebrain • Regulate memory update and retention (Hasselmo et al.) • LTP in cortex, hippocampus • top-down and bottom-up information flow

Inverse Temperature b • Greediness in action selection • P(ai|s)  exp[ b Q(s,ai)] • small bexploration • large bexploitation Noradrenaline locus coeruleus • Correlation with performance accuracy (Aston-Jones et al.) • Modulation of cellular I/O gain (Cohen et al.)

Discount Factor g • V(s(t)) = E[ r(t+1) + gr(t+2) + g2r(t+3) + …] • Balance between short- and long-term results Serotonin dorsal raphe • Low activity associated with impulsivity • depression, bipolar disorders • aggression, eating disorders

TD Error d • d(t) = r(t) + gV(s(t)) - V(s(t-1)) • Global learning signal • reward prediction: DV(s(t-1)) = ad(t) • reinforcement: DQ(s(t-1),a(t-1)) = ad(t) Dopamine substantia nigra, VTA • Respond to errors in reward prediction • Reinforcement of actions • addiction

Ach? 5-HT? a d NA? r • DA neurons: TD error d • SNr/GPi: action selection: Q(s,a)  a TD Model of Basal Ganglia(Houk et al. 1995, Montague et al. 1996, Schultz et al. 1997,...) • Striosome: state value V(s) • Matrix: action value Q(s,a) s V(s) Q(s,a)

striatum g1 g2 g3 V1 V2 V3 V(s(t)) Dopamineneurons d(t) V(s(t+1)) Possible Control of Discount Factor • Modulation of TD error • Selection/weighting of parallel networks

Markov Decision Task (Tanaka et al., 2004) • State transition and reward functions • Stimulus and response

Behavior Results • All subjects successfully learned optimal behavior

SHORT vs. NO (p < 0.001 uncorrected) OFC Insula Striatum Cerebellum LONG vs. SHORT (p < 0.0001 uncorrected) DLPFC, VLPFC, IPC, PMd Striatum Cerebellum Dorsal raphe Block-Design Analysis Different brain areas involved in immediate and future reward prediction

Ventro-Dorsal Difference Lateral PFC Insula Striatum

Model-based Regressor Analysis • Estimate V(t) and d(t) from subjects’ performance data • Regression analysis of fMRI data Agent Value function V(s) Value function V(s) reward r(t) fMRI data TD error d(t) TD error d(t) 20yen Policy action a(t) Environment state s(t)

Explanatory Variables (subject NS) g = 0 Reward prediction V(t) g = 0.3 g = 0.6 g = 0.8 g = 0.9 g = 0.99 g = 0 Reward prediction error d(t) g = 0.3 g = 0.6 g = 0.8 g = 0.9 g = 0.99 1 312 trial

Regression Analysis mPFC Insula Reward prediction V x = -2 mm x = -42 mm Striatum Reward prediction error d z = 2

Day1: Tr- Day2: Tr0 Day3: Tr+ 2.3g of tryptophan (Control) 10.3g of tryptophan (Loading) No tryptophan (Depletion) Tryptophan Depletion/Loading • Tryptophan: precursor of serotonin • depletion/loading affectcentral serotonin levels • (e.g. Bjork et al. 2001, Luciana et al.2001) • 100 g of amino acid drink • experiments after 6 hours

Blood Tryptophan Levels N.D. (< 3.9 mg/ml)

Prediction, Control and Decisions Kenji Doya doya@irp.oist.jp

Prediction, Control and Decisions Kenji Doya doya@irp.oist.jp

Presentation Transcript

Kenji Hakuta

Decisions, Decisions, and still more Decisions

Mizoguchi Kenji

Mizoguchi Kenji

Mizoguchi Kenji

Decisions, Decisions, Decisions

Kenji Kuwayama et al.

Computing Neurons - An Introduction - Kenji Doya doya@oist.jp

Moving Beyond Prediction to Control

Probability, Prediction and Decisions: A South Asian Project

Mizoguchi Kenji

Moving Beyond Prediction to Control

PARTISAN CONTROL AND STATE DECISIONS ABOUT OBAMACARE

22421: Management Decisions and Control

Kenji Takagi

Decisions, Decisions, Decisions:

Decisions, Decisions, Decisions