450 likes | 513 Views
What is STAN doing?. Greg Francis. PSY 626: Bayesian Statistics for Psychological Science Fall 2018 Purdue University. Bayes formula. We provide the prior and the data, and we want to compute the posterior distribution a is some parameter (e.g., intercept, slope, sigma)
E N D
What is STAN doing? Greg Francis PSY 626: Bayesian Statistics for Psychological Science Fall 2018 Purdue University
Bayes formula • We provide the prior and the data, and we want to compute the posterior distribution • a is some parameter (e.g., intercept, slope, sigma) • x is data (a set of scores) • In practice we work with probability density (or likelihoods) rather than probabilities
Bayes formula • For any given value of a we can compute most of what we need • P(a) is the prior distribution • E.g., N(30, 1) • Easy to calculate • P(x |a) is the likelihood of the data, given a parameter value • E.g., prod(dnorm(x, mean=a,sd=1)) • Easy to calculate • P(x) is not so easy to compute, but we can often ignore it
Brute force • It might seem that to compute the posterior distribution, we just run through all possible values of a, and compute P(a|x) • Indeed, you can sometimes do exactly that!
Challenges • In practice you cannot try all values of a • Computers cannot represent irrational numbers • Cannot go from negative infinity to positive infinity • It would take infinite time to consider all the rational numbers • In practice, we consider a subset of possible values of a • A grid of possible values • Use an algorithm that has an adaptive grid size to focus on important/useful values (e.g., maximum likelihood, threshold) • How to do this is complicated
Monte Carlo methods • A class of algorithms that use pseudorandom numbers to help estimate something • What is the area of a circle with a radius of r=5”? • We know the answer:
Monte Carlo methods • A class of algorithms that use pseudorandom numbers to help estimate something • What is the area of a circle with a radius of r=5”? • Suppose we randomly assign20 points to different positions in the surrounding square • We know the square has an area100 • Here, 15 points are in the circle • That suggests that the area of thecircle is close to
Monte Carlo methods • This method works for essentially any shape • Provided you can determine whether a point is inside or outside the shape (not always easy to do)
Estimating a distribution • Suppose there exists some probability density function (distribution) that you want to estimate • One approach is to take random draws from the pdf and make a histogram of the resulting values
Estimating a distribution • Just taking random draws from the pdf gets complicated, mainly because it is not clear how to define “random” • Oftentimes, we can calculate the pdf for any given parameter value, but we do not know what parameter values to consider • Guessing or brute force is not likely to work because of the curse of dimensionality • The neighborhood around a given volume is larger than the volume itself • It gets bigger as the dimensionality gets larger 1/3 1/9 1/27
It takes a process • We want a method of sampling from a distribution that reproduces the shape of the distribution (the posterior) • We need to sample cleverly, so that the rate at which we sample a given value reflects the probability density of that value • Markov Chain: what you sample for your next draw depends (only) on the current state of the system
Markov Chain • Here’s a (not so useful) Markov Chain • We have a starting value for the mean of a distribution. We set the “current” value to be the starting value. • 1) We draw a “new” sample from the distribution with the current mean value • 2) We set the current value to be the “new” value • 3) Repeat • Not useful for usbecause it does not produce anestimate of the underlyingdistribution
Metropolis algorithm • Pick some starting value, and set it to be the “current” value. Compute the posterior for the current value • Adjust the starting value by adding some deviation that is drawn from a normal distribution, check the posterior for the new value • Replace the current value with the new value, if a random number between 0 and 1 is less than • Note, if the new value has a higher posterior than the posterior of the current value, this ratio is >1, so you will always replace the current value with the new value
Metropolis algorithm • What happens as the process unfolds? • If you are in a parameter region with low posterior values, you tend to transition to a parameter region with higher posterior values • If you are in a parameter region with high posterior values, you tend to stay put • But you will move to regions with lower posterior values some of the time • In general, the proportion of samples from different regions reflect the posterior values at those regions • Thus, your samples end up describing the posterior distribution
Metropolis algorithm • Here, a green dot means a new sample is “accepted” • You move to a new value • A red dot indicates a new new sample is “rejected” • You stay put
Sampling • It should be clear that some posterior distributions will be easier to estimate than others • How different should the “new” sample be from the “current” sample? • Too small, and you will miss regions separated by a “gap” of posterior probability • Too big, and you waste time sampling from regions with low probability Harder Easy Harderer
Sampling • As dimensionality increases, sampling becomes progressively harder • Hard to even visualize when dimension is greater than 2 • You are always sampling a vector of parameter values • E.g., mu1, mu2, sigma, b1, b2, … • With a dependent subjects model, you have independent parameters for each subject • Easy to have 100’s or 1000’s of dimensions Harder Easy
Metropolis algorithm • The posterior density function defines an (inverted) energy surface function • The Metropolis (or Metropolis-Hastings) algorithm moves around the energy surface in such a way that it spends more time at low-energy (high probability) regions than at high-energy (low probability) regions • Nice simulation at • http://arogozhnikov.github.io/2016/12/19/markov_chain_monte_carlo.html • Effect of step size • Rejected samples are wastes of time
Methods • There are a variety of methods of guiding the search process • Clever ways of adapting the size of the differences between current and new choices (many variations, Stan automates their selection) • Gibbs sampling: focus on one dimension at a time • Stan (Hamiltonian Monte Carlo): treat the parameter vector as a physical item that moves around the space of possible parameter vectors • Velocity • Momentum • Use it to pick a “new” choice for consideration • Often faster and more efficient
Hamiltonian MC • Each step is more complicated to calculate (generate a trajectory of the “item”): leapfrog steps computes position and momentum • But choices tend to be better overall (fewer rejected samples) • http://arogozhnikov.github.io/2016/12/19/markov_chain_monte_carlo.html • Effect of slide time • Rejected samples are a waste of time (and rare)! • NUTS stands for No U-Turn Sampler • It is how the HMC plans (stops) the trajectory At the start of each chain we see something like: Gradient evaluation took 0.00093 seconds 1000 transitions using 10 leapfrog steps per transition would take 9.3 seconds. Adjust your expectations accordingly! At the end of each model summary, we see: Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence, Rhat = 1).
Hamiltonian MC • Step size is still an issue • Stan tries to adjust the step size to fit to the situation • But it can fail to do so in some cases (often when the step size is too big and misses some features of the posterior) • Gets noticed at the very end (when it is too late to adjust) model4 <- brm(RT_ms ~ Target*NumberDistractors, family = exgaussian(link = "identity"), data = VSdata2, iter = 8000, warmup = 2000, chains = 3) … Elapsed Time: 1.03518 seconds (Warm-up) 1.88005 seconds (Sampling) 2.91523 seconds (Total) Warning messages: 1: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' 2: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' 3: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' 4: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' 5: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' 6: There were 11969 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help. See http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup 7: Examine the pairs() plot to diagnose sampling problems
Hamiltonian MC • Step size is still an issue • We can follow the warning’s advice • Smaller number of divergent transitions • Takes quite a bit longer model5 <- brm(RT_ms ~ Target*NumberDistractors, family = exgaussian(link = "identity"), data = VSdata2, iter = 8000, warmup = 2000, chains = 3, control = list(adapt_delta = 0.95)) … Elapsed Time: 6.89872 seconds (Warm-up) 3.51857 seconds (Sampling) 10.4173 seconds (Total) Warning messages: 1: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' 2: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' 3: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' 4: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' 5: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' 6: There were 8570 divergent transitions after warmup. Increasing adapt_delta above 0.95 may help. See http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup 7: Examine the pairs() plot to diagnose sampling problems
Hamiltonian MC • We can follow the warning’s advice and use an even smaller adapt_delta • Smaller number of divergent transitions • Takes quite a bit longer • New warning related to tree depth (you can set this variable too, but it often does not matter much) model5 <- brm(RT_ms ~ Target*NumberDistractors, family = exgaussian(link = "identity"), data = VSdata2, iter = 8000, warmup = 2000, chains = 3, control = list(adapt_delta = 0.99)) … Elapsed Time: 8.90083 seconds (Warm-up) 33.6789 seconds (Sampling) 42.5797 seconds (Total) Warning messages: 1: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' 2: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' 3: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' 4: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' 5: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' 6: There were 11180 divergent transitions after warmup. Increasing adapt_delta above 0.99 may help. See http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup 7: There were 3 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded 8: Examine the pairs() plot to diagnose sampling problems
Divergence • Inconsistent estimates because the steps sizes are too big • model3 <- brm(RT_ms ~ Target*NumberDistractors, data = VSdata2, iter = 2000, warmup = 200, chains = 3) • launch_shinystan(model3) • Launching ShinyStan interface... for large models this may take some time. • Loading required package: shiny • Listening on http://127.0.0.1:3422
Divergence • Model3 (based on a normal distribution) is good • Red dots indicate divergence (none here)
Divergence • Model5 (based on an exgaussian distribution) looks bad • Red dots indicate divergence (lots here)
Divergence • In this particular case, fiddling with adapt_delta did not seem to help at all • Oftentimes, this is a warning that your model specification is not a good one • E.g., the exgaussian may be a poor choice to model this data • E.g., your priors are making a good fit nearly impossible • E.g., you are including independent variables that are correlated, which makes it difficult to converge on one solution • Parameterizing the model in a different way sometimes helps • Centering predictors (for example) • Setting priors can help • I was unable to get a good result for the exgaussian • Perhaps because we have such a small data set (one subject)
Sample size • VSdata2<-subset(VSdata, VSdata$DistractorType=="Conjunction") • model7 <- brm(RT_ms ~ Target*NumberDistractors, family = exgaussian(link = "identity"), data = VSdata2, iter = 8000, warmup = 2000, chains = 3) • … • Elapsed Time: 78.118 seconds (Warm-up) • 91.1596 seconds (Sampling) • 169.278 seconds (Total) • Warning messages: • 1: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' • 2: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' • 3: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' • 4: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' • 5: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' • > model7 • Family: exgaussian • Links: mu = identity; sigma = identity; beta = identity • Formula: RT_ms ~ Target * NumberDistractors • Data: VSdata2 (Number of observations: 2720) • Samples: 3 chains, each with iter = 8000; warmup = 2000; thin = 1; • total post-warmup samples = 18000 • Population-Level Effects: • Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat • Intercept 1171.58 24.67 1123.30 1219.96 11870 1.00 • TargetPresent -59.42 31.33 -121.29 1.84 12077 1.00 • NumberDistractors 16.99 0.71 15.59 18.37 11178 1.00 • TargetPresent:NumberDistractors -10.53 0.90 -12.31 -8.76 10916 1.00 • Family Specific Parameters: • Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat • sigma 350.19 12.22 326.73 374.00 13159 1.00 • beta 544.63 17.63 510.65 579.21 12275 1.00 • Problem goes away if we use the full data set
Priors help • Suppose you use a broad (non-informative) prior • Uniform distribution • Suppose your search process starts in a region with flat posterior values • So you alwaysmove to a newset ofparametervalues • But you moveaimlessly • It takes a longtime to samplefrom regions ofhigher probability
Priors help • Even a slightly non-uniform prior helps get the search process moving • If the prior moves you in the general direction of the posterior, it can help a lot • It could hurt the search in some cases, but then you probably have other problems! • Once you getclose to where thedata likelihooddominates, theprior will hardlymatter
Using brms model1 = brm(Leniency ~ SmileType, data = SLdata, iter = 2000, warmup = 200, chains = 3, thin = 2 ) print(summary(model1)) plot(model1) Family: gaussian Links: mu = identity; sigma = identity Formula: Leniency ~ SmileType Data: SLdata (Number of observations: 136) Samples: 3 chains, each with iter = 2000; warmup = 200; thin = 2; total post-warmup samples = 2700 Population-Level Effects: Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat Intercept 5.36 0.29 4.80 5.93 2620 1.00 SmileTypeFelt -0.46 0.41 -1.25 0.34 2556 1.00 SmileTypeMiserable -0.45 0.40 -1.24 0.32 2672 1.00 SmileTypeNeutral -1.25 0.40 -2.02 -0.44 2433 1.00 Family Specific Parameters: Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat sigma 1.65 0.10 1.46 1.87 2650 1.00 Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence, Rhat = 1). • Standard brm call that we have used many times
model2 = brm(Leniency ~ SmileType, data = SLdata, iter = 200, warmup = 20, chains = 1, thin = 2 ) print(summary(model2)) plot(model2) Family: gaussian Links: mu = identity; sigma = identity Formula: Leniency ~ SmileType Data: SLdata (Number of observations: 136) Samples: 1 chains, each with iter = 200; warmup = 20; thin = 2; total post-warmup samples = 90 Population-Level Effects: Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat Intercept 5.35 0.31 4.78 5.92 59 0.99 SmileTypeFelt -0.43 0.43 -1.24 0.35 66 0.99 SmileTypeMiserable -0.44 0.42 -1.29 0.39 61 0.99 SmileTypeNeutral -1.27 0.41 -2.18 -0.56 90 0.99 Family Specific Parameters: Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat sigma 1.62 0.10 1.43 1.81 90 0.99 Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence, Rhat = 1). Number of chains/iterations • Let’s break it down to smaller sets so we can understand what is happening
Chains • Independent sampling “chains” are combined at the end to make a better estimate • Why bother? • Parallel processing • If your computer has multiple processors, you can split up the chains across processors and get faster results • Convergence checking • Independent chains should converge to the same posterior distribution, if not then one or more chains may not be sampling well (and you should not trust the estimated posterior to be accurate) • Rhat is a check on whether the chains converge
model3 = brm(Leniency ~ SmileType, data = SLdata, iter = 200, warmup = 20, chains = 2, thin = 2 ) print(summary(model3)) plot(model3) Family: gaussian Links: mu = identity; sigma = identity Formula: Leniency ~ SmileType Data: SLdata (Number of observations: 136) Samples: 2 chains, each with iter = 200; warmup = 20; thin = 2; total post-warmup samples = 180 Population-Level Effects: Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat Intercept 5.35 0.43 4.70 5.93 87 1.05 SmileTypeFelt -0.46 0.39 -1.23 0.26 123 1.02 SmileTypeMiserable -0.47 0.43 -1.35 0.31 102 1.03 SmileTypeNeutral -1.24 0.43 -2.18 -0.51 102 1.01 Family Specific Parameters: Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat sigma 1.67 0.21 1.46 1.90 123 1.00 Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence, Rhat = 1). Number of chains/iterations • Let’s break it down to smaller sets so we can understand what is happening
Rhat • Similar to F value in an ANOVA • B is the standard deviation between chains • W is the standard deviation within chains (pooled across all chains) • When • This is evidence that the chains have not converged (R=1) does not necessarily mean the chains have converged • In practice, we have confidence in convergence if • Otherwise, increase number of iterations (or change the model, more informative priors)
Iterations • Think of it as similar to sample size for a confidence interval • A bigger sample size leads to smaller standard error of the mean, and so a tighter confidence interval • How much you need depends on what questions you want to answer. • Tails of a distribution are harder to estimate than the center of a distribution • Rule of thumb: • 4000 iterations tohave reasonableaccuracy for 2.5thand 97.5th percentiles • Might do better withStan, though
Autocorrelation • The nature of MCMC methods is that one takes a new sample “close” to the current sample • Over the short-run, this can lead to correlations between samples • Makes estimating the posterior imprecise • It sorts itself out over the long-run • Lots of tools to look for autocorrelation > launch_shinystan(model3) Launching ShinyStan interface... for large models this may take some time. Loading required package: shiny Listening on http://127.0.0.1:3422
Autocorrelation • An autocorrelation plot compares correlation to “lag” (number of iterations away) • Plot should drop dramatically as lag increases • If not, then sampler might be caught in a “loop” • Autocorrelation plots look fine for model3
Autocorrelation • You want the bars to drop quickly as lag increases • This would indicate low autocorrelation Good Bad Better, but still bad
Autocorrelation • A lot of your intuitions about sampling come in to play here • Ideally, you want independent samples • Suppose you were estimating the mean, the standard error of the estimate from the posterior distribution would be • Where sigma_p is the standard deviation of the posterior distribution and T is the sample size (number of iterations times number of chains) • Correlated samples provide less information about your estimate
Effective Sample Size • If there is correlation in the samples, we overestimate the standard error • The loss of information from autocorrelation is expressed as an “effective sample size” • You want to be sure your T* is big enough to properly estimate your parameters • This is a crude measure, so plan accordingly
Thinning model5 = brm(Leniency ~ SmileType, data = SLdata, iter = 200, warmup = 20, chains = 2, thin = 2 ) print(summary(model5)) Family: gaussian Links: mu = identity; sigma = identity Formula: Leniency ~ SmileType Data: SLdata (Number of observations: 136) Samples: 2 chains, each with iter = 200; warmup = 20; thin = 2; total post-warmup samples = 180 Population-Level Effects: Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat Intercept 5.30 0.33 4.66 5.81 58 1.01 SmileTypeFelt -0.38 0.45 -1.19 0.49 30 1.06 SmileTypeMiserable -0.38 0.43 -1.39 0.48 84 0.99 SmileTypeNeutral -1.20 0.44 -2.10 -0.44 74 1.00 Family Specific Parameters: Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat sigma 1.65 0.10 1.50 1.85 145 1.01 model4 = brm(Leniency ~ SmileType, data = SLdata, iter = 200, warmup = 20, chains = 2, thin = 1 ) print(summary(model4)) Family: gaussian Links: mu = identity; sigma = identity Formula: Leniency ~ SmileType Data: SLdata (Number of observations: 136) Samples: 2 chains, each with iter = 200; warmup = 20; thin = 1; total post-warmup samples = 360 Population-Level Effects: Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat Intercept 5.40 0.29 4.78 5.92 129 1.00 SmileTypeFelt -0.51 0.42 -1.34 0.29 144 1.00 SmileTypeMiserable -0.49 0.41 -1.28 0.31 133 1.00 SmileTypeNeutral -1.30 0.41 -2.10 -0.54 128 1.01 Family Specific Parameters: Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat sigma 1.65 0.10 1.45 1.84 360 1.00 • Should you have an autocorrelation problem, you can try to mitigate it by “thinning” the samples • Just keep every nth sample • You can also just increase the number of iterations
Conclusions • STAN • Monte Carlo • Markov Chains • Hamiltonian MC • Convergence • Most of this happens in the background and you do not need to be an expert