1 / 58

models: reinforcement learning & fMRI

models: reinforcement learning & fMRI. Nathaniel Daw 11/28/2007. overview. reinforcement learning model fitting: behavior model fitting: fMRI. overview. reinforcement learning simple example tracking choice model fitting: behavior model fitting: fMRI.

mote
Download Presentation

models: reinforcement learning & fMRI

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. models: reinforcement learning & fMRI Nathaniel Daw 11/28/2007

  2. overview • reinforcement learning • model fitting: behavior • model fitting: fMRI

  3. overview • reinforcement learning • simple example • tracking • choice • model fitting: behavior • model fitting: fMRI

  4. Reinforcement learning: the problem Optimal choice learned by repeated trial-and-error • eg between slot machines that pay off with different probabilities But… • Payoff amounts & probabilities may be unknown • May additionally be changing • Decisions may be sequentially structured (chess, mazes: this we wont consider today) Very hard computational problem; computational shortcuts essential Interplay between what you can and should do Both have behavioral & neural consequences

  5. Simple example n-armed bandit, unknown but IID payoffs • surprisingly rich problem Vague strategy to maximize expected payoff: • Predict expected payoff for each option • Choose the best (?) • Learn from outcome to improve predictions

  6. Simple example • Predict expected payoff for each option • Take VL = last reward received on option L • (more generally, some weighted average of past rewards) • This is an unbiased, albeit lousy, estimator • Choose the best • (more generally, choose stochastically s.t. the machine judged richer is more likely to be chosen) Say left machine pays 10 with prob 10%, 0 owise Say right machine pays 1 always What happens? (Niv et al. 2000; Bateson & Kacelnik)

  7. Behavioral anomalies • Apparent risk aversion arises due to learning, i.e. due to the way payoffs are estimated • Even though we are trying to optimize expected reward, risk neutral • Easy to construct other examples for risk proneness, “probability matching” • Behavioral anomalies can have computational roots • Sampling and choice interact in subtle ways

  8. what can we do?

  9. Reward prediction weight What can we do? Exponentially weighted running average of rewards on an option: trials into past Convenient form because it can be recursively maintained (‘exponential filter’) ‘error-driven learning’, ‘delta rule’, ‘Rescorla-Wagner’

  10. what should we do? [learning]

  11. Bayesian view Specify ‘generative model’ for payoffs • Assume payoff following choice of A is Gaussian with unknown mean mA; known variance s2PAYOFF • Assume mean mA changes via a Gaussian random walk with zero mean and variance s2WALK payoff for A mA trials

  12. Bayesian view Describe prior beliefs about parameters as a probability distribution • Assume they are Gaussian with mean ; variance Update beliefs in light of experience with Bayes’ rule mean of payoff for A P(mA | payoff) /P(payoff | mA)P(mA)

  13. Bayesian belief updating mean of payoff for A

  14. Bayesian belief updating mean of payoff for A

  15. Bayesian belief updating mean of payoff for A

  16. Bayesian belief updating mean of payoff for A

  17. Bayesian belief updating mean of payoff for A

  18. Notes on Kalman filter • Looks like Rescorla/Wagner but • We track uncertainty as well as mean • Learning rate is function of uncertainty (asymptotically constant but nonzero) • Why do we exponentially weight past rewards?

  19. what should we do? [choice]

  20. The n-armed bandit n slot machines binary payoffs, unknown fixed probabilities you get some limited (technically: random, exponentially distributed) number of spins want to maximize income surprisingly rich problem

  21. The n-armed bandit • Track payoff probabilities Bayesian: learn a distributionover possible probs for each machine This is easy: Just requires counting wins and losses (Beta posterior)

  22. The n-armed bandit 2. Choose This is hard. Why?

  23. The explore-exploit dilemma 2. Choose Simply choosing apparently best machine might miss something better: must balance exploration and exploitation simple heuristics, eg choose at random once in a while

  24. Explore / exploit Which should you choose? left bandit: 4/8 spins rewarded right bandit: 1/2 spins rewarded mean of both distributions: 50%

  25. Explore / exploit Which should you choose? left bandit: 4/8 spins rewarded right bandit: 1/2 spins rewarded green bandit more uncertain (distribution has larger variance)

  26. Explore / exploit although green bandit has a larger chance of being worse… Which should you choose? Trade off uncertainty, exp value, horizon ‘Value of information’: exploring improves future choices How to quantify? … it also has a larger chance of being better …which would be useful to find out, if true

  27. Optimal solution This is really a sequential choice problem; can be solved with dynamic programming Naïve approach: Each machine has k ‘states’ (number of wins/losses so far); state of total game is product over all machines; curse of dimensionality (kn states) Clever approach: (Gittins 1972) Problem decouples to one with k states – consider continuing on a single bandit versus switching to a bandit that always pays some known amount. The amount for which you’d switch is the ‘Gittins index’. It properly balances mean, uncertainty & horizon

  28. overview • reinforcement learning • model fitting: behavior • pooling multiple subjects • example • model fitting: fMRI

  29. Model estimation What is a model? • parameterized stochastic data-generation process Model m predicts data D given parameters q Estimate parameters: posterior distribution over q by Bayes’ rule Typically use a maximum likelihood point estimate instead ie the parameters for which data are most likely. Can still study uncertainty around peak: interactions, identifiability

  30. application to RL eg D for a subject is ordered list of choices ct, rewards rt for eg where V might be learned by an exponential filter with decay q

  31. Example behavioral task shock Reinforcement learning for reward & punishment: • participants (31) repeatedly choose between boxes • each box has (hidden, changing) chance of giving money (20p) • also, independent chance of giving electric shock (8 on 1-10 pain scale) money

  32. This is good for what? • parameters may measure something of interest • eg learning rate, monetary value of shock • allow to quantify & study neural representations of subjective quantities • expected value, prediction error • compare models • compare groups

  33. Compare models In principle: ‘automatic Occam’s razor’ In practice: approximate integral as max likelihood + penalty: Laplace, BIC, AIC etc. Frequentist version: likelihood ratio test Or: holdout set; difficult in sequential case Good example refs: Ho & Camerer

  34. Compare groups • How to model data for a group of subjects? • Want to account for (potential) inter-subject variability in parameters q • this is called treating the parameters as “random effects” • ie random variables instantiated once per subject • hierarchical model: • each subject’s parameters drawn from population distribution • her choices drawn from model given those parameters

  35. Random effects model Hierarchical model: • What is qs? e.g., a learning rate • What is P(qs | q)? eg a Gaussian, or a MOG • What is q? eg the mean and variance, over the population, of the regression weights Interested in identifying population characteristics q (all multisubject fMRI analyses work this way)

  36. Random effects model Interested in identifying population characteristics q • method 1: summary statistics of individual ML fits (cheap & cheerful: used in fMRI) • method 2: estimate integral over parameters eg with Monte Carlo What good is this? • can make statistical statements about parameters in population • can compare groups • can regularize individual parameter estimatesie, P(q | qs) : “empirical Bayes”

  37. Example behavioral task shock Reinforcement learning for reward & punishment: • participants (31) repeatedly choose between boxes • each box has (hidden, changing) chance of giving money (20p) • also, independent chance of giving electric shock (8 on 1-10 pain scale) money

  38. Behavioral analysis Fit trial-by-trial choices using “conditional logit” regression model • coefficients estimate effects on choice of past rewards, shocks, & choices (Lau & Glimcher; Corrado et al) • selective effect of acute tryptophan depletion? choice shock reward 0 1 1 0 0 0 0 0… 0 1 1 0 0 1 1 0… ] • [weights] value(box 1) = [ 0 0 1 0 0 0 1 0… value(box 2) = [ 1 0 0 0 1 0 0 1… 0 0 0 0 1 0 0 1… 1 0 0 0 1 0 0 1… ] • [weights] etc values  choice probabilities using logistic (‘softmax’) rule probabilities  choices stochastically estimate weights by maximizing joint likelihood of choices, conditional on rewards exp(value(box 1)) prob(box 1)

  39. Summary statistics of individual ML fits • fairly noisy (unconstrained model, unregularized fits)

  40. models predict exponential decays in reward & shock weights • & typically neglect choice-choice autocorrelation

  41. Fit of TD model (w/ exponentially decaying choice sensitivity), visualized same way (5x fewer parameters, essentially as good fit to data; estimates better regularized)

  42. £0.20 £0.04 -£0.12 Quantify value of pain

  43. Effect of acute tryptophan depletion?

  44. Depleted participants are: • equally shock-driven • more ‘sticky’ (driven to repeat choices) • less money-driven (this effect less reliable)

  45. linear effects of blood tryptophan levels: p > .5

  46. linear effects of blood tryptophan levels: p < .005

  47. linear effects of blood tryptophan levels: p < .01 p < .005

  48. overview • reinforcement learning • model fitting: behavior • model fitting: fMRI • random effects • RL regressors

  49. L rFP rFP p<0.01 p<0.001 LFP • What does this mean when there are multiple subjects? • regression coefficients as random effects • if we drew more subjects from this population is the expected effect size > 0?

  50. History 1990-1991 – SPM paper, software released, used for PET low ratio of samples to subjects (within-subject variance not important) 1992-1997 – Development of fMRI more samples per subject 1998 – Holmes & Friston introduce distinction between fixed and random effects analysis in conference presentation; reveal SPM had been fixed effects all along 1999 – Series of papers semi-defending fixed effects; but software fixed

More Related