170 likes | 348 Views
ACQ and the Basal Ganglia. Jimmy Bonaiuto USC Brain Project 6/26/2007. Actor-Critic Learning. Actor – learns action policy Critic – learns value functions Different actor-critic architectures have been proposed for learning different value functions: V ( s ) = State values (most common)
E N D
ACQ and the Basal Ganglia Jimmy Bonaiuto USC Brain Project 6/26/2007
Actor-Critic Learning • Actor – learns action policy • Critic – learns value functions • Different actor-critic architectures have been proposed for learning different value functions: • V(s) = State values (most common) • V(a) = Action values • Q(s,a) = State, action pair values
Actor-Critic Architecture • Core Data – recording of midbrain dopaminergic neurons in appetitive learning tasks (Schultz, 1992; Schultz, 1998) (from Barto, 1995)
Critic – V(s), V(a), or Q(s,a)? • How do dopamine cells know about reward value? • Largest striatum input is from cortex (Haber and Gdowski, 2004) • V(s) and Q(s,a) learning may require the ventral striatum, SNc, and/or VTA to receive a copy of the same cortical projections that the dorsal striatum receives (state information) • V(a) may only require a projection from the dorsal striatum or globus pallidus (actor) to the ventral striatum, SNc and/or VTA (critic) • Largest forebrain input to dopamine neurons is striatum (Haber and Gdowski, 2004) • V(a) may be more biologically plausible in terms of connectivity
Actor-Critic in the Basal Ganglia • Dopamine targets (striatum) are site of value and policy learning (Suri & Schultz, 2001) • The striatum split into dorsal and ventral divisions (some say dorsolateral and ventromedial) (Voorn et al., 2004) • Ventral striatum – inputs from limbic structures (critic?) • Dorsal striatum – connected with motor and associative cortices (actor?)
Role of Dopamine • (Joel & Weiner, 2000) Dopamine neurons in the ventral tegmental area (VTA) and substantia nigra pars compacta (SNc) • VTA projects to ventral striatum – learning state values • SNc projects to dorsal striatum – policy learning • Little difference in VTA and SNc firing (Schultz et al., 1993) • Predicted by TD learning equation since the policy and values are both updated using TD error
ACQ • Reinforcement learning should maximize total utility, not necessarily total reward. Motivations map outcomes to utilities (Niv et al., 2006) • Multiple critics – one for each dimension of interoception (hunger, thirst, etc.) • Q(s ,a), s =internal state, a=action • Actor • Composite policy • Desirability – based on internal state • Executability – based on environmental state • Eligibility trace from mirror and canonical motor signals i i
ACQ – Actor/Multiple Critics x=executed action x=recognized action ^
ACQ - Eligibility Trace • = executed action (from efference copy) • = recognized action (from mirror system) ^ Idealized situations (perfect recognition) Realistic implementation would have confidence values between 0.0 and 1.0 for x and x, but the pattern of values for ε would be the same ^
ACQ - Weight Modification • Desirability and Executability updated using same eligibility and reinforcement signals • Requires different weight change rules: • Desirability • Executability Don’t update the value of the last action unless some action is currently recognized Step function of eligibility trace – Makes sign of weight change depend on r(t) ^ Tonic dopamine level, d, added to TD error – Makes sign of weight change depend on ε(t)
Multiple Critics – Q(s ,a) i • Is there evidence for multiple critics gated by interoceptive information? • The lateral hypothalamus does project to the SNc, VTA, and the ventral striatum (Saper et al., 1979; Fadel & Deutch, 2002; Brog et al., 1993) • The accumbens shell of the ventral striatum is reciprocally connected with the lateral hypothalamus and has been called a “sensory sentinel” or “visceral striatum” (Kelley, 1999, 2004) • Motivational state, such as food deprivation can influence the magnitude of dopamine release in the ventral striatum (Wilson et al., 1995; Ahn & Phillips, 1999) • Sexual satiety is signaled by serotonin from the lateral hypothalamus to the ventral striatum, which reduces dopamine levels (Lorrain et al., 1999)
Internal State-Dependent Policy • Is there evidence for internal state-dependent policies? (Kelley et al., 2005) • Information from the lateral hypothalamus reaches the dorsal striatum through the paraventricular nucleus • Hypothalamic-midline thalamic-striatal projections carry internal state information to cholinergic interneurons of the dorsal striatum • These are thought to modulate dorsal striatal output neurons
Eligibility Trace from the Mirror System • What is the evidence for an eligibility signal from mirror neurons? • People can implicitly learn sequences through action observation (Bird et al., 2005) • The striatum is consistently implicated in implicit sequence learning and the magnitude of activation is correlated with reaction time improvement (Rauch et al., 1997, 1998) • The basal ganglia is active during action observation (Frey & Gerry, 2006) • Projection from ventral premotor cortex (including the arcuate sulcus) to dorsal and ventral striatum in the macaque (McFarland & Haber, 2000)
References • Ahn S, Phillips AG (1999) Dopaminergic Correlates of Sensory-Specific Satiety in the Medial Prefrontal Cortex and Nucleus Accumbens of the Rat. The Journal of Neuroscience, 19:RC29:1-6. • Bird G, Osman M, Saggerson A, Heyes C (2005) Sequence learning by action, observation and action observation. British Journal of Psychology, 96: 371–388. • Brog JS, Salyapongse A, Deutch AY, Zahm DS (1993) The patterns of afferent innervation of the core and shell in the Accumbens part of the rat ventral striatum: Immunohistochemical detection of retrogradely transported fluoro-gold. The Journal of Comparative Neurology, 338(2): 255-278. • Fadel J, Deutch AY (2002) Anatomical Substrates of Orexin-Dopamine Interactions: Lateral hypothalamic projections to the ventral tegmental area. Neuroscience, 111(2): 379-387. • Frey SH, Gerry VE (2006) Modulation of Neural Activity during Observational Learning of Actions and Their Sequential Orders. The Journal of Neuroscience, 26(51):13194-13201. • Haber SN, Gdowski MJ (2004) The basal ganglia. In: The human nervous system (Paxinos G, Mai JK, eds) Ed 2 pp. 676–738. New York: Elsevier Academic. • D. Joel and I. Weiner. The connections of the dopaminergic system with the striatum in rats and primates: An analysis with respect to the functional and compartmental organization of the striatum. Neuroscience, 96:451–474, 2000. • Kelley AE (1999) Functional Specificity of Ventral Striatal Compartments in Appetitive Behaviors. Annals New York Academy of Sciences. • Kelley AE (2004) Ventral striatal control of appetitive motivation: role in ingestive behavior and reward-related learning. Neurosci Biobehav Rev, 27: 765-776. • Kelley AE, Baldo BA, Pratt WE (2005) A proposed hypothalamic-thalamic-striatal axis for the integration of energy balance, arousal, and food reward. J Comp Neurol. 493(1):72-85.
References • Lorrain DS, Riolo JV, Matuszewich L, Hull EM (1999) Lateral Hypothalamic Serotonin Inhibits Nucleus Accumbens Dopamine: Implications for Sexual Satiety. The Journal of Neuroscience, 19(17):7648-7652. • McFarland NR, Haber SN (2000) Convergent Inputs from Thalamic Motor Nuclei and Frontal Cortical Areas to the Dorsal Striatum in the Primate. The Journal of Neuroscience, 20(10): 3798–3813. • Niv Y, Joel D, Dayan P (2006) A normative perspective on motivation. Trends in Cognitive Sciences, 10(8): 375-381. • Rauch SL, Whalen PJ, Savage CR, Curran T, Kendrick A, Brown HD, Bush G, Breiter HC, Rosen BR (1997) Striatal Recruitment During an Implicit Sequence Learning Task as Measured by Functional Magnetic Resonance Imaging. Human Brain Mapping 5:124–132. • Rauch SL, Whalen PJ, Curran T, McInerney S, Heckers S, Savage CR (1998) Thalamic deactivation during early implicit sequence learning: a functional MRI study. NeuroReport, 9: 865–870. • Saper, C.B.; Swanson, L.W.; Cowan, W.M. (1979) An autoradiographic study of the efferent connections of the lateral hypothalamic area in the rat. J Comp Neurol., 183(4): 689-706. • W. Schultz. Activity of dopamine neurons in the behaving primate. Seminars in the Neurosciences, 4:129–138, 1992. • W. Schultz. Predictive reward signal of dopamine neurons. Journal of Neurophysiology, 80:1–27, 1998. • W. Schultz, P. Apicella, and T. Ljungberg. Responses of monkey dopamine neurons to reward and conditioned stimuli during successive steps of learning a delayed response task. Journal of Neuroscience, 13:900–913, 1993. • R. E. Suri and W. Schultz. Temporal difference model reproduces predictive neural activity. Neural Computation, 13:841–862, 2001. • P. Voorn, L. J. Vanderschuren, H. J. Groenewegen, T. W. Robbins, and C. M. Pennartz. Putting a spin on the dorsal-ventral divide of the striatum. Trends in Neuroscience, 27:468–474, 2004. • Wilson C, Nomikos GG, Collu M, Fibiger HC (1995) Dopaminergic correlates of motivated behavior: importance of drive. Journal of Neuroscience, 15: 5169-5178.