580 likes | 613 Views
models: reinforcement learning & fMRI. Nathaniel Daw 11/28/2007. overview. reinforcement learning model fitting: behavior model fitting: fMRI. overview. reinforcement learning simple example tracking choice model fitting: behavior model fitting: fMRI.
E N D
models: reinforcement learning & fMRI Nathaniel Daw 11/28/2007
overview • reinforcement learning • model fitting: behavior • model fitting: fMRI
overview • reinforcement learning • simple example • tracking • choice • model fitting: behavior • model fitting: fMRI
Reinforcement learning: the problem Optimal choice learned by repeated trial-and-error • eg between slot machines that pay off with different probabilities But… • Payoff amounts & probabilities may be unknown • May additionally be changing • Decisions may be sequentially structured (chess, mazes: this we wont consider today) Very hard computational problem; computational shortcuts essential Interplay between what you can and should do Both have behavioral & neural consequences
Simple example n-armed bandit, unknown but IID payoffs • surprisingly rich problem Vague strategy to maximize expected payoff: • Predict expected payoff for each option • Choose the best (?) • Learn from outcome to improve predictions
Simple example • Predict expected payoff for each option • Take VL = last reward received on option L • (more generally, some weighted average of past rewards) • This is an unbiased, albeit lousy, estimator • Choose the best • (more generally, choose stochastically s.t. the machine judged richer is more likely to be chosen) Say left machine pays 10 with prob 10%, 0 owise Say right machine pays 1 always What happens? (Niv et al. 2000; Bateson & Kacelnik)
Behavioral anomalies • Apparent risk aversion arises due to learning, i.e. due to the way payoffs are estimated • Even though we are trying to optimize expected reward, risk neutral • Easy to construct other examples for risk proneness, “probability matching” • Behavioral anomalies can have computational roots • Sampling and choice interact in subtle ways
Reward prediction weight What can we do? Exponentially weighted running average of rewards on an option: trials into past Convenient form because it can be recursively maintained (‘exponential filter’) ‘error-driven learning’, ‘delta rule’, ‘Rescorla-Wagner’
Bayesian view Specify ‘generative model’ for payoffs • Assume payoff following choice of A is Gaussian with unknown mean mA; known variance s2PAYOFF • Assume mean mA changes via a Gaussian random walk with zero mean and variance s2WALK payoff for A mA trials
Bayesian view Describe prior beliefs about parameters as a probability distribution • Assume they are Gaussian with mean ; variance Update beliefs in light of experience with Bayes’ rule mean of payoff for A P(mA | payoff) /P(payoff | mA)P(mA)
Bayesian belief updating mean of payoff for A
Bayesian belief updating mean of payoff for A
Bayesian belief updating mean of payoff for A
Bayesian belief updating mean of payoff for A
Bayesian belief updating mean of payoff for A
Notes on Kalman filter • Looks like Rescorla/Wagner but • We track uncertainty as well as mean • Learning rate is function of uncertainty (asymptotically constant but nonzero) • Why do we exponentially weight past rewards?
The n-armed bandit n slot machines binary payoffs, unknown fixed probabilities you get some limited (technically: random, exponentially distributed) number of spins want to maximize income surprisingly rich problem
The n-armed bandit • Track payoff probabilities Bayesian: learn a distributionover possible probs for each machine This is easy: Just requires counting wins and losses (Beta posterior)
The n-armed bandit 2. Choose This is hard. Why?
The explore-exploit dilemma 2. Choose Simply choosing apparently best machine might miss something better: must balance exploration and exploitation simple heuristics, eg choose at random once in a while
Explore / exploit Which should you choose? left bandit: 4/8 spins rewarded right bandit: 1/2 spins rewarded mean of both distributions: 50%
Explore / exploit Which should you choose? left bandit: 4/8 spins rewarded right bandit: 1/2 spins rewarded green bandit more uncertain (distribution has larger variance)
Explore / exploit although green bandit has a larger chance of being worse… Which should you choose? Trade off uncertainty, exp value, horizon ‘Value of information’: exploring improves future choices How to quantify? … it also has a larger chance of being better …which would be useful to find out, if true
Optimal solution This is really a sequential choice problem; can be solved with dynamic programming Naïve approach: Each machine has k ‘states’ (number of wins/losses so far); state of total game is product over all machines; curse of dimensionality (kn states) Clever approach: (Gittins 1972) Problem decouples to one with k states – consider continuing on a single bandit versus switching to a bandit that always pays some known amount. The amount for which you’d switch is the ‘Gittins index’. It properly balances mean, uncertainty & horizon
overview • reinforcement learning • model fitting: behavior • pooling multiple subjects • example • model fitting: fMRI
Model estimation What is a model? • parameterized stochastic data-generation process Model m predicts data D given parameters q Estimate parameters: posterior distribution over q by Bayes’ rule Typically use a maximum likelihood point estimate instead ie the parameters for which data are most likely. Can still study uncertainty around peak: interactions, identifiability
application to RL eg D for a subject is ordered list of choices ct, rewards rt for eg where V might be learned by an exponential filter with decay q
Example behavioral task shock Reinforcement learning for reward & punishment: • participants (31) repeatedly choose between boxes • each box has (hidden, changing) chance of giving money (20p) • also, independent chance of giving electric shock (8 on 1-10 pain scale) money
This is good for what? • parameters may measure something of interest • eg learning rate, monetary value of shock • allow to quantify & study neural representations of subjective quantities • expected value, prediction error • compare models • compare groups
Compare models In principle: ‘automatic Occam’s razor’ In practice: approximate integral as max likelihood + penalty: Laplace, BIC, AIC etc. Frequentist version: likelihood ratio test Or: holdout set; difficult in sequential case Good example refs: Ho & Camerer
Compare groups • How to model data for a group of subjects? • Want to account for (potential) inter-subject variability in parameters q • this is called treating the parameters as “random effects” • ie random variables instantiated once per subject • hierarchical model: • each subject’s parameters drawn from population distribution • her choices drawn from model given those parameters
Random effects model Hierarchical model: • What is qs? e.g., a learning rate • What is P(qs | q)? eg a Gaussian, or a MOG • What is q? eg the mean and variance, over the population, of the regression weights Interested in identifying population characteristics q (all multisubject fMRI analyses work this way)
Random effects model Interested in identifying population characteristics q • method 1: summary statistics of individual ML fits (cheap & cheerful: used in fMRI) • method 2: estimate integral over parameters eg with Monte Carlo What good is this? • can make statistical statements about parameters in population • can compare groups • can regularize individual parameter estimatesie, P(q | qs) : “empirical Bayes”
Example behavioral task shock Reinforcement learning for reward & punishment: • participants (31) repeatedly choose between boxes • each box has (hidden, changing) chance of giving money (20p) • also, independent chance of giving electric shock (8 on 1-10 pain scale) money
Behavioral analysis Fit trial-by-trial choices using “conditional logit” regression model • coefficients estimate effects on choice of past rewards, shocks, & choices (Lau & Glimcher; Corrado et al) • selective effect of acute tryptophan depletion? choice shock reward 0 1 1 0 0 0 0 0… 0 1 1 0 0 1 1 0… ] • [weights] value(box 1) = [ 0 0 1 0 0 0 1 0… value(box 2) = [ 1 0 0 0 1 0 0 1… 0 0 0 0 1 0 0 1… 1 0 0 0 1 0 0 1… ] • [weights] etc values choice probabilities using logistic (‘softmax’) rule probabilities choices stochastically estimate weights by maximizing joint likelihood of choices, conditional on rewards exp(value(box 1)) prob(box 1)
Summary statistics of individual ML fits • fairly noisy (unconstrained model, unregularized fits)
models predict exponential decays in reward & shock weights • & typically neglect choice-choice autocorrelation
Fit of TD model (w/ exponentially decaying choice sensitivity), visualized same way (5x fewer parameters, essentially as good fit to data; estimates better regularized)
£0.20 £0.04 -£0.12 Quantify value of pain
Depleted participants are: • equally shock-driven • more ‘sticky’ (driven to repeat choices) • less money-driven (this effect less reliable)
linear effects of blood tryptophan levels: p < .01 p < .005
overview • reinforcement learning • model fitting: behavior • model fitting: fMRI • random effects • RL regressors
L rFP rFP p<0.01 p<0.001 LFP • What does this mean when there are multiple subjects? • regression coefficients as random effects • if we drew more subjects from this population is the expected effect size > 0?
History 1990-1991 – SPM paper, software released, used for PET low ratio of samples to subjects (within-subject variance not important) 1992-1997 – Development of fMRI more samples per subject 1998 – Holmes & Friston introduce distinction between fixed and random effects analysis in conference presentation; reveal SPM had been fixed effects all along 1999 – Series of papers semi-defending fixed effects; but software fixed