About priors

About priors Greg Francis PSY 626: Bayesian Statistics for Psychological Science Fall 2018 Purdue University

Levels of priors • Flat prior: You know nothing, often improper • Super-vague but proper: N(0,1000000) • Weakly informative prior: N(0, 10) • Generic weakly informative prior: N(0,1) • Specific informative prior: N(0.4, 0.2)

How to pick a prior • Three main issues to consider • Technical/computational: Some priors cause problems with Stan or MCMC methods • Sometimes you pick a prior to avoid these problems • Model definition: You have to define a prior, but you have little knowledge to guide you • Default priors fill the gap of knowledge • Model definition: Your model specifies a narrow range of parameter values • Informative priors

Technical issues • Using a flat prior tends to give parameter estimates similar to frequentist approaches (e.g., linear regression) • Mean of posterior distribution matches regression estimates • But a flat prior can introduce instability in Stan/MCMC as it searches around parameter space • There are similar technical problems with hard limits to priors • Instead of Uniform(0,1) use N(0.5, 0.5)

Technical issues • If using non-informative/flat priors, it is appropriate to justify them as providing computational benefits • Something like: Although we felt parameter A could be anywhere between 0 and 1, to promote model convergence we set its prior to be N(0.5, 0.5).

Weakly informative prior • You may not know what range of values correspond to a parameter • But you may want to provide constraints anyhow • Shrinkage (often called regularization) offers benefits even when you don’t know the best value to shrink to (mean of the prior) • More generally, you may not know what values are appropriate, but you probably know some values that are inappropriate • Ridiculous values for a mean or standard deviation in a reaction time experiment (too short or too long) • Family income for children in an inner-city school

Weakly informative prior • The overall goal is to rule out unreasonable parameter values • Less concern over having the prior specify the true parameter value • For example, reaction times are rarely shorter than 150 milliseconds, even for the simplest of tasks • (often shorter RTs are interpreted as “anticipatory” responses rather than reactions) • N(150, 1000) is about the same as N(0, 1000)

Weakly informative prior • Err on the side of broadness: N(150, 2000) • A broader prior allows for more robustness in model fitting • In contrast, if your reaction time task results in RT values around 1500 ms, the N(150, 1000) prior is going to have a hard time finding the posterior distribution because everything is in the tails • A broader prior has a cost: loss of precision if the data is “typical” • Broader posterior distribution

Remember the posterior • On the other hand, the goal of a Bayesian analysis is to identify the posterior distribution of a parameter • Do not be too obsessed with just the mean of the posterior distribution • A narrower posterior distribution is surely better than a broad distribution

Remember the likelihood • The likelihood characterizes how your data is generated • Normal distribution • Ex-gaussian distribution • Logit model • The description of the data generation process is as important as specifying the priors for the likelihood parameters • The likelihood needs to be justified just as much (perhaps more so) than the priors

Complications • Depending on the model/likelihood, setting broad/uninformative priors can actually be very constraining • Especially for complex models an effort to use broad priors can lead to weird effects, where lots of prior “weight” is put on rare parameter values • These are subtle effects: my advice is to get expert advice (which is often reflected in default settings)

Informative priors • You get the most benefit from a Bayesian analysis by using informative priors • More precise models are easier to test because they make more precise predictions • Informative priors are not the norm in Bayesian analyses • You should constantly look for the opportunity to use them • They can come from many different sources

Subjective/Objective priors • If you start to read the history/literature of Bayesian methods, you will see discussions about what a prior is • Objective priors: expressions of models, principles, scientific consensus, or computational technicalities • Subjective priors: the scientist’s “belief” • Personally, I find the “belief” approach to priors to be problematic • As a scientist, I hardly care about another researcher’s beliefs; nor would I directly include my beliefs into my analysis

Subjective priors • Moreover, there seem to be fundamental problems with priors as beliefs • The basic idea is that a prior expresses your belief about some parameter value • After gathering and analyzing data, the posterior describes how your belief should be modified

Subjective priors • But this interpretation only makes sense if your prior is a reasonable characterization of what you (should) believe • How could that happen? • Maybe you have been doing Bayesian analyses since birth? (unlikely) • Maybe you use some non-Bayesian method for establishing beliefs? • If they work well, then why not continue doing that instead of Bayesian analysis? • If they do not work well, then this is a poor starting point for your Bayesian analysis.

Virtues • Gelman & Hennig (2016) argue that there is no one way to think about priors (and data analysis, more generally) • They recommend thinking about “virtues” for data analysis • Instead of objectivity, think about: • transparency • consensus • Impartiality • Correspondence to observable reality • Instead of subjectivity, think about • Multiple perspectives • Context dependence

Transparency • Clear and unambiguous definitions of concepts • Challenging for many models in the social sciences • Some attempt is better than nothing: weakly informative prior is better than a non-informative prior • Open planning and following agreed protocols • Full communication of reasoning, procedures, spelling out of (potentially unverifiable) assumptions and potential limitations • Basically, be honest about what you have done and why

Consensus • Accounting for relevant knowledge and existing related work • For example, Zelano et al. (2016) reported that memory improved for items studied when inhaling compared to items studied while exhaling (d=0.86) • This is much larger than the best known mnemonic strategy (d=0.49) • Following generally accepted rules where possible and reasonable • Do not make up a new measure of analysis method for your investigation • If you have to make up something new, you need to validate it • Provision of rationales for consensus and unification • Debates need to have some means of reaching conclusions • Scientists should be able to agree on what kinds of results would make them change their mind • If not, then you are probably not having a scientific debate

Impartiaity • Thorough consideration of relevant and potentially competing theories and points of view • Thorough consideration and, if possible, removal of potential biases: factors that may jeopardize consensus and the intended interpretation of results • Openness to criticism and exchange • In terms of priors, you need to be open to the possibility that other scientists could come up with reasonable but different priors that produce different conclusions

Correspondence to reality • Clear connection of concepts and models to observables • This is definitely about priors; they are not just beliefs! • Clear conditions for reproduction, testing, and falsification • This is more about experimental design

Multiple perspectives and Context Dependence • Recognition of dependence on specific contexts and aims • Reality and facts are only accessible through individual personal experiences • Different people have different skills sets and resources • Honest acknowledgment of the researcher’s position, goals, experiences, and subjective point of view • Different information and different viewpoints can be valuable • Know your own limitations

Investigation of Stability • Consequences of alternative decisions and assumptions that could have been made in the analysis • Other models that could have been considered, other comparisons that could be made • Different priors may change your conclusions • Variability and reproducibility of conclusions on new data • New data may change your conclusions • Respect the uncertainty in your data and in your model analysis

Using the virtues • Every prior should have some kind of justification (a sentence) explaining why it is being used • Gelman & Hennig (2016) given an example:

Informative prior? • Rule of thumb: • Compare the standard deviation of the posterior to the standard deviation of the prior • If the posterior standard deviation is more than 0.1 times the prior standard deviation, then the prior distribution is “informative” • You should double-check that the prior makes sense for your situation

Informative prior? Family: gaussian Links: mu = identity; sigma = identity Formula: Leniency ~ SmileType Data: SLdata (Number of observations: 136) Samples: 3 chains, each with iter = 2000; warmup = 200; thin = 2; total post-warmup samples = 2700 Population-Level Effects: Estimate Est.Error l-95% CI u-95% CI Eff.SampleRhat Intercept 5.36 0.28 4.83 5.91 2298 1.00 SmileTypeFelt -0.45 0.40 -1.24 0.33 2173 1.00 SmileTypeMiserable -0.44 0.39 -1.19 0.32 2376 1.00 SmileTypeNeutral -1.24 0.39 -2.03 -0.48 2427 1.00 Family Specific Parameters: Estimate Est.Error l-95% CI u-95% CI Eff.SampleRhat sigma 1.64 0.10 1.46 1.86 2487 1.00 Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence, Rhat = 1). • For the smiles and leniency data set, we ran a model with some priors on the slopes # Intercept is first Level read in (False SmileType) model2 = brm(Leniency ~ SmileType, data = SLdata, iter = 2000, warmup = 200, chains = 3, thin = 2, prior = c(prior(normal(0, 10), class = "Intercept"), prior(normal(0, 10), class = "b"), prior(cauchy(0, 5), class = "sigma")) )

Informative prior? > post<-posterior_samples(model2) > FalseLeniency <- post$b_Intercept > sd(FalseLeniency) [1] 0.2765206 > plot(density(FalseLeniency)) • We can pull out the posterior information • The prior had sd=10 • The ratio of posterior sd to prior sd is • 0.27/10 = 0.027 • We did not use an informative prior

Informative prior? Family: gaussian Links: mu = identity; sigma = identity Formula: CorrectResponses ~ Dosage + (1 | SubjectID) Data: ATdata (Number of observations: 96) Samples: 3 chains, each with iter = 2000; warmup = 200; thin = 2; total post-warmup samples = 2700 Group-Level Effects: ~SubjectID (Number of levels: 24) Estimate Est.Error l-95% CI u-95% CI Eff.SampleRhat sd(Intercept) 9.30 1.69 6.51 13.13 1369 1.00 Population-Level Effects: Estimate Est.Error l-95% CI u-95% CI Eff.SampleRhat Intercept 33.18 2.30 28.53 37.62 1612 1.00 DosageD15 6.28 2.10 2.20 10.58 2591 1.00 DosageD30 10.93 2.12 6.91 15.13 2429 1.00 DosageD60 17.61 1.00 15.68 19.56 2318 1.00 Family Specific Parameters: Estimate Est.Error l-95% CI u-95% CI Eff.SampleRhat sigma 8.07 0.75 6.78 9.66 1929 1.00 model4 = brm(CorrectResponses ~ Dosage + (1 |SubjectID), data = ATdata, iter = 2000, warmup = 200, chains = 3, thin = 2, prior = c( prior(normal(20, 1), class = "b", coef="DosageD60")) ) print(summary(model4)) • In Lecture 10, we ran a model with a (“bad”) prior for a slope • ADHD data set: D60 condition

Informative prior? > post<-posterior_samples(model4) > D60 <- post$b_DosageD60 > sd(D60) [1] 0.9978903 > plot(density(D60)) • Looking at posteriors • Prior has sd=1 • The ratio of posterior sd to prior sd is • 0.99/1 = 0.99 • Informative prior! • Is it “bad”?

Theory • Informative priors often come from theory, but especially from theories that have defined mechanisms • Mechanisms are fundamental to science • It is one thing to say that reaction times increase with set size in a visual search task • This theory allows you to identify terms of a model that can be estimated from data (e.g., look for a positive effect of set size) • It is something else to say that reaction times increase with set size because of a serial process that moves attention from element to element • This theory allows you to predict a roughly 2:1 slope ratio for target absent compared to target present trials

Not just empirical results • Consider superconductivity • Discovered in 1911, plays an important role in fMRI • How do we know superconductivity works the same in Lausanne and West Lafayette, Indiana? • Mountains? • Lake versus river? • Brick buildings? • French versus English? • 7T versus 3T? • It’s not just that superconductivity worked before! • Every new environment is different

Mechanisms • There is a theory about superconductivity that describes mechanisms that produce it • Meissner effect (1930s) • Cooper pairs in quantum mechanisms (1950s) • This theory predicts when superconductivity works and when it does not • That’s how engineering works • It’s not perfect • High-temperature superconductivity remains unexplained • That’s where science is being done • We know/believe fMRI will work in both Lausanne and West Lafayette because we understand the mechanisms that determine when superconductivity will (and will not) happen

Getting to mechanisms • If we want to have successful/robust science, our long term goal is identification of mechanisms • We might not get there in our lifetime • Exploratory work • Confirmatory work • Proposing theories • Testing theories • It is not just successful prediction from a statistical model • That can be valuable, but it is not enough

Plague • Paul-Louis Simond (1898) discovered that the plague was transmitted by fleas on rats • Once a mechanism is identified, it suggests what to do • To reduce occurrence of the plague, reduce the number of rats and contact with rats • Kill rats • Keep dogs and cats • Seal food containers • Set rat traps • Avoid rats • Don’t bother with quarantining the family of an infected person • Not precise predictions about the magnitude of the benefit (but they should all work to some extent)

Mechanisms in social sciences • Psychology faces challenges because there are very few proposed mechanisms • Even when something seems to be a strong effect, we cannot judge when it will apply and when it will not • Neuroscience and medicine has some hope because scientists naturally seek out mechanisms based on biology • But there are other problems with sample sizes and costs of investigations • Keep in mind that the long-term goal is to identify mechanisms, and plan studies and analyses accordingly

Conclusions • Various types of priors • If you are in the social sciences, you get priors from: • Defaults to help model convergence/estimation • Other literature (range of plausible values) • Theory • There are no simple procedures for producing good priors • Be transparent and be honest

About priors

About priors

Presentation Transcript

Geodesic Saliency Using Background Priors

Priors, Normal Models , Computing Posteriors

Priors, Normal Models, Computing Posteriors

Smoothing/Priors/ Regularization for Maxent Models

Priors, Normal Models, Computing Posteriors

Priors, Normal Models, Computing Posteriors

Image Reconstruction and Image Priors

Upper Limits and Priors

Mixture Models with Adaptive Spatial Priors

Shape Priors and Knowledge Based Segmentation

Biospheric Models as Priors

Level Set Segmentation with Shape Priors

Image reconstruction and Image Priors

Diffuse Priors for Base Models

Image Restoration using Auto-encoding Priors

Bayesian methods, priors and Gaussian processes

Level Set Segmentation with Shape Priors