Bayesian style t-test

Bayesian style t-test Greg Francis PSY 626: Bayesian Statistics for Psychological Science Fall 2018 Purdue University

Physicians and patient weight • Hebl & Xu (2001): sent medical charts to physicians. Patient complained of migraine headaches. Chart indicated whether patient was average_weight or overweight • The question is how much time doctors estimate they would spend with the patient • Each physician contributes one score, between subjects design

Standard analysis • # load data file • PWdata<-read.csv(file="PhysiciansWeight.csv",header=TRUE,stringsAsFactors=TRUE) • # Compare to traditional t-test • traditional <- t.test(Time ~ Weight, data=PWdata, var.equal=TRUE) • print(traditional) Two Sample t-test data: Time by Weight t = 2.856, df = 69, p-value = 0.005663 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 1.997955 11.255633 sample estimates: mean in group average_weight mean in group overweight 31.36364 24.73684

Standard (regression) analysis • # Compare to traditional linear regression • print(summary(lm(Time~Weight, data=PWdata))) • Prints out a table of estimates of Intercept (average_weight) and deviation for overweight (difference of means) Call: lm(formula = Time ~ Weight, data = PWdata) Residuals: Min 1Q Median 3Q Max -19.737 -5.550 -1.364 5.263 35.263 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 31.364 1.697 18.477 < 2e-16 *** Weightoverweight -6.627 2.320 -2.856 0.00566 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘’ 1 Residual standard error: 9.751 on 69 degrees of freedom Multiple R-squared: 0.1057, Adjusted R-squared: 0.09276 F-statistic: 8.157 on 1 and 69 DF, p-value: 0.005663

Bayesian variation of t-test • We start with default priors (uniform, improper) • Follow along by downloading “PhysiciansWeight.csv” and “PhysiciansWeight1.R” from the class web site # load data file PWdata<-read.csv(file="PhysiciansWeight.csv",header=TRUE,stringsAsFactors=TRUE) # load the brms library library(brms) model1 = brm(Time ~ Weight, data = PWdata, iter = 2000, warmup = 200, chains = 3, thin = 2 ) # print out summary of model print(summary(model1))

Output Family: gaussian Links: mu = identity; sigma = identity Formula: Time ~ Weight Data: PWdata (Number of observations: 71) Samples: 3 chains, each with iter = 2000; warmup = 200; thin = 2; total post-warmup samples = 2700 Population-Level Effects: Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat Intercept 31.41 1.77 27.78 34.74 2342 1.00 Weightoverweight -6.66 2.37 -11.40 -2.03 2476 1.00 Family Specific Parameters: Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat sigma 9.89 0.87 8.38 11.70 2559 1.00 Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence, Rhat = 1). • Prints out a table of estimates of Intercept (average_weight) and deviation for overweight (difference of means)

Output • # Check on model convergence and posteriors • plot(model1) • Looks good!

Output • # Plot model fit to data • dev.new() • plot(marginal_effects(model1), points = TRUE)

HDPI Two Sample t-test data: Time by Weight t = 2.856, df = 69, p-value = 0.005663 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 1.997955 11.255633 sample estimates: mean in group average_weight mean in group overweight 31.36364 24.73684 • You might be interested in the Highest Density Posterior Interval (HPDI) for the differences of means # Compute HDPIs library(broom) library(coda) print( tidyMCMC(model1$fit, conf.int = TRUE, conf.method = "HPDinterval") ) term estimate std.error conf.low conf.high 1 b_Intercept 31.408944 1.7658964 27.759071 34.703057 2 b_Weightoverweight -6.662331 2.3698441 -11.032153 -1.768079 3 sigma 9.886702 0.8658198 8.420497 11.715446

t-test versus Bayesian • Note, we get essentially the same pattern of results either way • It looks like there is a difference in the mean times dependent on the patient weight

Posteriors • You can compute things that are not possible with the standard t-test • Probability that the mean time for overweight patients is longer than for average_weight patients post<-posterior_samples(model1) dev.new() plot(density(post$b_Weightoverweight)) # check probability that difference is less than zero cat("Probability that group mean difference is greater than zero is: ", sum(post$b_Weightoverweight > 0) / length(post$b_Weightoverweight)) >Probability that group mean difference is greater than zero is: 0.002222222

Comparing models Compare to sample statistics > mean(PWdata$Time) [1] 27.8169 > sd(PWdata$Time) [1] 10.23762 • What happens for a model with just one mean? (One intercept) • model2 = brm(Time ~ 1, data = PWdata, iter = 2000, warmup = 200, chains = 3, thin = 2 ) print(model2) Family: gaussian Links: mu = identity; sigma = identity Formula: Time ~ 1 Data: PWdata (Number of observations: 71) Samples: 3 chains, each with iter = 2000; warmup = 200; thin = 2; total post-warmup samples = 2700 Population-Level Effects: Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat Intercept 27.88 1.24 25.38 30.29 2414 1.00 Family Specific Parameters: Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat sigma 10.39 0.91 8.75 12.28 2222 1.00

Comparing models • Does a model with two means (model1) do better or worse at LOO than a model with one mean (model2)? • In terms of predicting data, the two-mean model does better than the one-mean model > LOO(model1, model2) LOOIC SE model1 529.91 16.43 model2 535.41 14.01 model1 – model2 -5.50 6.24

Welch’s t-test • We have unequal sample sizes in our dataset, so we might want to use Welch’s test in case we have unequal standard deviations (s=9.86 for average_weight and s=9.65 for overweight) > traditional <- t.test(Time ~ Weight, data=PWdata, var.equal=FALSE) > print(traditional) Welch Two Sample t-test data: Time by Weight t = 2.8516, df = 67.174, p-value = 0.005774 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 1.988532 11.265056 sample estimates: mean in group average_weight mean in group overweight 31.36364 24.73684

Baysian Welch’s t-test • We modify the model set up so that it treats sigma as differing across groups • # Define a formula that includes variations in sigma by group • uneq_var_frm <- brmsformula(Time ~ Weight, sigma ~ Weight) • model3 = brm(uneq_var_frm, data = PWdata, iter = 2000, warmup = 200, chains = 3, thin = 2 ) • # print out summary of model • print(summary(model3))

Baysian Welch’s t-test • Output splits up intercept and slope for sigma as well as for means • Careful of the link! Family: gaussian Links: mu = identity; sigma = log Formula: Time ~ Weight sigma ~ Weight Data: PWdata (Number of observations: 71) Samples: 3 chains, each with iter = 2000; warmup = 200; thin = 2; total post-warmup samples = 2700 Population-Level Effects: Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat Intercept 31.41 1.80 27.94 34.93 2344 1.00 sigma_Intercept 2.30 0.12 2.08 2.56 2421 1.00 Weightoverweight -6.60 2.44 -11.44 -1.76 2187 1.00 sigma_Weightoverweight -0.03 0.17 -0.36 0.30 2442 1.00 exp(2.30)=9.974 exp(2.30-0.03)= 9.679

Baysian Welch’s t-test • Posteriors and chains look fine

Baysian Welch’s t-test • Does a model with two means and two standard deviations do better than a model with two means and one standard deviation? (No) > print(LOO(model1,model3)) LOOIC SE model1 529.91 16.43 model3 533.14 18.25 model1 - model3 -3.23 2.03 Warning messages: 1: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' 2: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' 3: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' 4: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' 5: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' 6: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' 7: Found 1 observations with a pareto_k > 0.7 in model 'model3'. It is recommended to set 'reloo = TRUE' in order to calculate the ELPD without the assumption that these observations are negligible. This will refit the model 1 times to compute the ELPDs for the problematic observations directly.

Baysian Welch’s t-test • Does a model with two means and two standard deviations do better than a model with two means and one standard deviation? (No) > print(LOO(model1, model3, reloo=TRUE)) No problematic observations found. Returning the original 'loo' object. 1 problematic observation(s) found. The model will be refit 1 times. Fitting model 1 out of 1 (leaving out observation 59) Start sampling …… LOOIC SE model1 529.91 16.43 model3 533.11 18.22 model1 - model3 -3.20 2.00 There were 23 warnings (use warnings() to see them)

Baysian Welch’s t-test • Which kind of model does best?: • One mean, one standard deviation (model1) • Two means, one standard deviation (model2) • Two means, two standard deviations (model3) > print(LOO(model1,model2, model3)) LOOIC SE model1 529.91 16.43 model2 533.05 18.10 model3 533.14 18.25 model1 - model2 -3.14 1.89 model1 - model3 -3.23 2.03 model2 - model3 -0.09 0.19

Finish up these models • Define and fit model4 that uses • One mean, two standard deviations • Compare all the models with > print(LOO(model1,model2, model3, model4))

Non-default priors • model5 = brm(Time ~ Weight, data = PWdata, iter = 2000, warmup = 200, chains = 3, thin = 2, prior = c(prior(normal(0, 10), class = "Intercept"), prior(normal(0, 10), class = "b")) ) • print(summary(model5)) Family: gaussian Links: mu = identity; sigma = identity Formula: Time ~ Weight Data: PWdata (Number of observations: 71) Samples: 3 chains, each with iter = 2000; warmup = 200; thin = 2; total post-warmup samples = 2700 Population-Level Effects: Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat Intercept 30.83 1.69 27.53 34.13 2342 1.00 Weightoverweight -6.35 2.29 -10.84 -1.98 2340 1.00 Family Specific Parameters: Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat sigma 9.86 0.85 8.36 11.76 2433 1.00

Kind of crazy prior • model6 = brm(Time ~ Weight, data = PWdata, iter = 2000, warmup = 200, chains = 3, thin = 2, prior = c(prior(normal(10, 1), class = "Intercept"), prior(normal(0, 10), class = "b")) ) • print(summary(model6)) Family: gaussian Links: mu = identity; sigma = identity Formula: Time ~ Weight Data: PWdata (Number of observations: 71) Samples: 3 chains, each with iter = 2000; warmup = 200; thin = 2; total post-warmup samples = 2700 Population-Level Effects: Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat Intercept 16.37 2.41 11.56 21.06 2358 1.00 Weightoverweight -5.60 3.92 -13.04 2.13 2511 1.00 Family Specific Parameters: Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat sigma 17.60 1.74 14.40 21.34 2536 1.00

Reasonable priors • Let’s discuss what might be reasonable priors here • Then define corresponding models

Conclusions • Bayesian t-tests • Welch’s test • Model comparison • Setting priors

Bayesian style t-test

Bayesian style t-test

Presentation Transcript

t -test

T TEST

t -test

T-Test

T-Test

T-test (Uji t)

T- Test

t-test

t-test

t-Test

T-Test

t-test

t -test

t -test

T-test

T-Test

t-Test

t-test

T - test

Bayesian style dependent tests

Bayesian style ANOVA