10 likes | 125 Views
Short visual stimulus. Reward (probabilistic) = drops of juice. Trace period. x50%. p = 50%. x50%. x75%. p = 75%. x25%. DA. 270%. Experiment Model. δ (t). 55%. Bayer and Glimcher Schultz lab. x(1) x(2) …. V(1) V(30). r( t ). δ ( t ).
E N D
Short visual stimulus Reward (probabilistic) = drops of juice Trace period x50% p = 50% x50% x75% p = 75% x25% DA 270% Experiment Model δ(t) 55% Bayer and Glimcher Schultz lab x(1) x(2) … V(1) V(30) r(t) δ(t) Asymmetric Coding of Temporal Difference Errors: Implications for Dopamine Firing Patterns Y. Niv1,2, M.O. Duff2 and P. Dayan2 (1)Interdisciplinary Center for Neural Computation, Hebrew University, Jerusalem, yaelniv@alice.nc.huji.ac.il (2) Gatsby Computational Neuroscience Unit, University College London Experimental results: measuring propagating errors Fiorillo et al. (2003) Overview Simulating TD with asymmetric coding • Substantial evidence suggests that phasic dopaminergic firing represents a temporal difference (TD) error in the predictions of future reward. • Recent experiments probe the way information about outcomes propagates back to the stimuli predicting them. These use stochastic rewards (eg., Fiorillo et al., 2003) which allow systematic study of persistent prediction errors even in well learned tasks. • We use a novel theoretical analysis to show that across-trials ramping in DA activity may be a signature of this process. Importantly, we address the asymmetric coding in DAactivityofpositive and negative TD errors, and acknowledge the constant learning that results from ongoing prediction errors. Negative δ(t) scaled by d=1/6 prior to PSTH summation Classical conditioning paradigm (delay conditioning) using probabilistic outcomes -> generates ongoing prediction errors in a learned task 2 sec visual stimulus indicating reward probability – 100%, 75%, 50%, 25% or 0% Probabilistic reward (drops of juice) • Single DA cell recordings in VTA/SNc: • At stimulus time - DA represents mean expected reward (compliant with TD hypothesis) • Surprising ramping of activity in the delay • -> Fiorillo et al.’s hypothesis: Coding of uncertainty • However: • No prediction error to `justify’ ramp • TD learning predicts away any predictable quantity • Uncertainty not available for control • -> The uncertainty hypothesis seems contradictory to the TD hypothesis Introduction: What does phasic Dopamine encode? • Learning proceeds normally (without scaling): • Necessary to produce the right predictions • Can be biologically plausible DA single cell recordings from the lab of Wolfram Schultz Unpredicted reward (neutral/no stimulus) Predicted reward (learned task) Trace conditioning: A puzzle solved Omitted reward (probe trial) • Same (if not more) uncertainty • But: no DA ramping -> DA encodes a temporally sophisticated reward signal Computational hypothesis – DA encodes reward prediction error: (Sutton, Barto 1987, Montague, Dayan, Sejnowski, 1996) • A TD resolution: Ramps result from backpropagating prediction errors - • Note that according to TD, activity at time of reward should cancel out – but it doesn’t. • This is because… • Prediction errors can be positive or negative • However, firing rate is positive -> encoding of negative errors is relative to baseline activity • But: baseline activity in DA cells is low (2-5Hz) -> asymmetric coding of errors Solution: Lower learning rate in trace conditioning eliminates ramp Morris et al. (2004)(see also Fiorillo et al. (2003)) Indeed: computed learning rate in Morris et al.’s data near zero (personal communication) Temporal Difference error Conclusion: Uncertainty or Temporal Difference? • -> Phasic DA encodes reward prediction error • Precise computational theory for generation of DA firing patterns • Compelling account for role of DA in appetitive conditioning With asymmetric coding of errors, the mean TD error at the time of reward is proportional to p(1-p) -> Indeed maximal at p=50% Visualizing Temporal-Difference Learning: • However: • No need to assume explicit coding of uncertainty – Ramping in DA activity is explained by neural constraints. • Explanation for puzzling absence of ramp in trace conditioning results. • Experimentaltests: • Ramp as within or between trial phenomenon? • Relationship between ramp size and learning rate (within/between experiments)? • Challenges to TD remain: TD and noise; Conditioned inhibition; additivity… After first trial -> Ongoing (intertwined) backpropagation of asymmetrically coded positive and negative errors causes ramps to appear in the summed PSTH -> The ramp itself is a between trial and not a within trial phenomenon (results from summation over different reward histories) After third trial Learning continues (~10 trials) Task learned Selected References [1] Fiorillo, Tobler & Schultz (2003) -Discrete coding of reward probability and uncertainty by dopamine neurons. Science,299, 1898–1902. [2] Morris, Arkadir, Nevet, Vaadia & Bergman (2004) – Coincident but distinct messages of midbrain dopamine and striatal tonically active neurons. Neuron, 43, 133-143. [3] Montague, Dayan & Sejnowski (1996) – J Neurosci,16:1936-1947. [4] Sutton and Barto (1988) – Reinforcement learning: An introduction, MIT Press. Acknowledgements This research was funded by an EC Thematic Network short-term fellowship to YN and The Gatsby Charitable Foundation.