Bayesian Model Checking and Parameter Tests

36-463/663: Multilevel & Hierarchical Models Bayesian Model Checking & Parameter Tests Brian Junker 132E Baker Hall brian@stat.cmu.edu TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA

Outline • Model Checking • Two strategies for checking model fit • DIC • Simulation checks/fake data/posterior predictive checks • Example: Rating professors • Parameter Tests • Testing a difference by simulation • Example: Risky Behaviors

Model Checking

Two main strategies for checking model fit • Overall index of fit based on the deviance • AIC = D(M) + 2 k • BIC = D(M) + k log(n) • DIC = meansim(D(M)) + 2 kD • Simulation-based checks • “Fake data” methods • Posterior predictive checks

Overall penalized-deviance indices • Pro’s: • Quick-and-dirty (a drop of 2 is marginally interesting, a drop of 3 is interesting, a drop of 10 is a big deal) • Tells you whether one model “fits better” overall than another • Con’s: • Doesn’t tell whether the “best” model really fits • If a model doesn’t fit, doesn’t tell you what went wrong

Simulation-based checks • Basic idea: • Simulate data from the fitted model • Does the original data look like it could have come from this simulation? • Implementations • Eyeball a graph • Make a hypothesis test. Use the simulated “fake data” as the “null hypothesis” distribution • If the model fits, you will not reject the null when you test the “real data”

Simulation-based checks • Con’s • The method is not automatic • You have to choose a “feature” of the data or model that you want to check • You have to program the simulation • Pro’s • You can “see” what went wrong with the model, and sometimes fix it • IDEALLY, DIC [or AIC or BIC] and simulation checks work together • Simulation checks illustrate what made DIC go down

Example – Instructor Ratings • 463 “rate the professor” evaluations on 94 professors in 31 classroom settings [data in week12 area of website] • Several other covariates on the professors • tenured? female? minority? age … etc. • Several other covariates on the classes • number of students, number who filled out evaluation, “rate the course” evaluations … etc. • Beauty rating of each instructor (average of 6 student ratings) • Does beauty predict professor evaluations, after controlling for other covariates?

> display(lm.1) lm(formula = eval ~ beauty) coef.est coef.se (Intercept) 4.01 0.03 beauty 0.13 0.03 --- n = 463, k = 2 residual sd = 0.55, R-Squared = 0.04 rube.lm.1 <- "model { for (i in 1:n) { eval[i] ~ dnorm(mu[i],sig2inv) mu[i] <- b0 + b1*beauty[i] } b0 ~ dnorm(0,.000001) b1 ~ dnorm(0,.000001) sig2inv <- pow(sig,-2) sig ~ dunif(0,100) }" (rube.lm.1.fit <- rube(rube.lm.1,data.list,rube.lm.1.inits, parameters.to.save = c("b0", "b1", "sig"), n.chains=3)) mean sd b0 4.010 0.0252 b1 0.133 0.0315 sig 0.547 0.0181 deviance 753.594 2.4783 Basic model: eval ~ beauty

rube.lm.1.new <- "model { for (i in 1:n) { eval[i] ~ dnorm(mu[i],sig2inv) mu[i] <- b0 + b1*beauty[i] } b0 ~ dnorm(0,.000001) b1 ~ dnorm(0,.000001) sig2inv <- pow(sig,-2) sig ~ dunif(0,1) for (i in 1:n) { neweval[i] ~ dnorm(mu[i],sig2inv) } }" attach(rube.lm.1.fit$sims.list) b0.hat <- mean(b0) b1.hat <- mean(b1) sig.hat <- mean(sig) detach() rube.lm.1.new.inits <- function(){ list (b1=b1.hat, b0=b0.hat, sig=sig.hat) } rube.lm.1.new.fit <- rube(rube.lm.1.new, data.list,rube.lm.1.new.inits, parameters.to.save = c("neweval"), n.iter=400, n.chains=1) neweval <- rube.lm.1.new.fit$sims.list$neweval dim(neweval) # [1] 200 463 Making a sample for simulation checks… Starting at last best estimates This simulates new evaluations

We think “female” may matter so look at, e.g., residuals based on “female” original data replication 1 Are there too many/ too few neg residuals? replication 2 replication 3 x = male o = female

Check the proportion of negative residuals overall (eval.resid.neg <- sum(eval < b0.hat + b1.hat*beauty)/n) # [1] 0.4492441 neweval.resid.neg <- apply(neweval,1, function(x) sum(x < b0.hat + b1.hat*beauty)/n) par(mfrow=c(1,1)) hist(neweval.resid.neg) abline(v=eval.resid.neg, col="Red") sum(neweval.resid.neg < eval.resid.neg)/n.sims # [1] 0.02

Check proportion of negative residuals for females f.eval <- eval[female==1] f.beauty <- beauty[female==1] f.n <- sum(female==1) f.neweval <- neweval[,female==1] (f.eval.resid.neg <- sum(f.eval < b0.hat + b1.hat*f.beauty)/f.n) # [1] 0.5282051 f.neweval.resid.neg <- apply(f.neweval,1,function(x) sum(x < b0.hat + b1.hat*f.beauty)/f.n) par(mfrow=c(1,1)) hist(f.neweval.resid.neg) abline(v=f.eval.resid.neg,col="Red") sum(f.neweval.resid.neg < f.eval.resid.neg)/n.sims # [1] 0.69

Check proportion of negative residuals for males m.eval <- eval[female==0] m.beauty <- beauty[female==0] m.n <- sum(female==0) m.neweval <- neweval[,female==0] m.neweval <- neweval[,female==0] (m.eval.resid.neg <- sum(m.eval < b0.hat + b1.hat*m.beauty)/m.n) # [1] 0.3917910 m.neweval.resid.neg <- apply(m.neweval,1,function(x) sum(x < b0.hat + b1.hat*m.beauty)/m.n) par(mfrow=c(1,1)) hist(m.neweval.resid.neg) abline(v=m.eval.resid.neg,col="Red") sum(m.neweval.resid.neg<m.eval.resid.neg)/n.sims # [1] 0

Conclusions so far… • It is worth putting “female” in the model… • In the R notes for this class I also checked to see if “course.id” should be in the model, again using simulation checks. • “course.id” should also be in the model. • For variable selection in lm(), stepAIC is faster! • But this illustrates the idea…

Other models… • I tried several other models using just these variables… • summary(lm.2 <- lm(eval ~ beauty + female)) • summary(lm.8 <- lm(eval ~ beauty + as.factor(course.id))) • summary(lm.9 <- lm(eval ~ beauty + female + as.factor(course.id))) • summary(lm.10 <- lm(eval ~ beauty*female + as.factor(course.id))) • lm.10 seemed to work the best…

summary(lm.10)… Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.13988 0.03799 108.966 < 2e-16 *** beauty 0.21823 0.04595 4.749 2.79e-06 *** female -0.22802 0.05259 -4.336 1.81e-05 *** beauty:female -0.13716 0.06480 -2.117 0.034873 * factor(course.id)1 0.47509 0.23538 2.018 0.044171 * factor(course.id)2 0.51671 0.36603 1.412 0.158774 factor(course.id)3 -0.13889 0.18437 -0.753 0.451685 factor(course.id)4 -0.24897 0.12202 -2.040 0.041922 * factor(course.id)5 -0.02639 0.25883 -0.102 0.918834 … • Some course.id coefficients are significant, most are not • Should we be treating course.id as a random effect instead?

To check whether course.id should be a random effect… • Fit lm.10 in WinBUGS • Re-run WinBUGS to simulate new response data “neweval” • Look at the spread in the simulated data, vs the spread of the original data “eval”, especially between groups defined by “course.id” • Details in R handout • Example picture on next slide…

Examining Between-Groups Spread with Boxplots… Replication 2 Replication 1 red = eval green = neweval Replication 4 Replication 3

General conclusions… • The red boxplots (eval) seem less spread out than the green boxplots (replications of neweval) • It is probably worth it to fit course.id as a random effect • A more formal test might look at the between-group variance (variance of the group means) for the different course.id’s.

rube.lmer.10 <- "model { for (i in 1:n) { eval[i] ~ dnorm(mu[i],sig2inv) mu[i] <- a0[course.id[i]+1] + b1*beauty[i] + b2*female[i] + b3*beauty[i]*female[i] } for (j in 1:J) { a0[j] ~ dnorm(b0,tau2inv) } b0 ~ dnorm(0,.0001) b1 ~ dnorm(0,.0001) b2 ~ dnorm(0,.0001) b3 ~ dnorm(0,.0001) tau2inv <- pow(tau,-2) tau ~ dunif(0,100) sig2inv <- pow(sig,-2) sig ~ dunif(0,100) }" data.list <- list(eval=eval,beauty=beauty,female=female, course.id=course.id,n=length(beauty), J=length(unique(course.id))) rube.lmer.10.inits <- function(){ list (b0=rnorm(1),b1=rnorm(1),b2=rnorm(1),b3=rnorm(1), a0=rnorm(J),tau=runif(1,0,10),sig=runif(1,0,10)) } rube(rube.lmer.10,data.list,rube.lmer.10.inits) rube.lmer.10.fit <- rube(rube.lmer.10,data.list,rube.lmer.10.inits, parameters.to.save=c("b0","b1", "b2", "b3", "a0", "sig","tau"), n.chains=3) Fitting the model with course.id as a random effect…

Results of fit mean sd b0 4.090 0.0677 b1 0.213 0.0438 b2 -0.211 0.0515 b3 -0.131 0.0655 a0[1] 4.131 0.0378 a0[2] 4.364 0.1941 ... sig 0.517 0.0174 tau 0.260 0.0756

Ideally, DIC and simulation checks should work together (DIC <- c(rube.lm.1.fit$DIC,rube.lm.10.fit$DIC,rube.lmer.10.fit$DIC)) [1] 756.510 732.603 723.647 • lm.1: eval ~ beauty • lm.10: eval ~ beauty * female + factor(course.id) • lmer.10: eval ~ beauty * female + (1 | course.id) • So in this case, they do work together…

What about AIC, BIC? • Usual rules of thumb • difference of 2 marginally intersting • difference of 3 interesting • difference of 10 a big deal

Testing a contrast by simulation

Using Posterior Simulation to Test Hypotheses about Parameters • The “Risky Behaviors” data set (you worked with on earlier hw) compares two treatements (couples counseling and women-only counseling) with a null control group, for reducing HIV infection. • The variables are: sex : "woman","man" couples : 0/1: in couples counseling?women_alone: 0/1: in women-only counseling? bs_hiv : HIV status at baseline: "negative","positive" bupacts : number of unprotected sex acts @ baseline fupacts : number of unprotected sex acts @ end of study id : respondent id (1:434) • Recall that fupacts has to be rounded (always data cleaning…)

One possible model… > summary(glmer.1 <- glmer(fupacts ~ couples + women_alone + bupacts + bs_hiv + (1|id), data=rb, family=poisson)) Random effects: Groups Name Variance Std.Dev. id (Intercept) 3.2374 1.7993 Number of obs: 434, groups: id, 434 Fixed effects: Estimate Std. Error z value Pr(>|z|) (Intercept) 1.666481 0.186823 8.920 < 2e-16 *** couples -0.759627 0.231159 -3.286 0.00102 ** women_alone -0.974322 0.235538 -4.137 3.53e-05 *** bupacts 0.021172 0.002847 7.438 1.03e-13 *** bs_hivpositive -1.080734 0.239413 -4.514 6.36e-06 *** • Clearly both treatments are better than the null control • Counseling only women looks better than counseling couples – is this real?

The lazy statistician’s answer: Do the CI’s overlap? est <- coef(summary(glmer.1))[2:3,1:2] est[1,1] + c(-2,2)*est[1,2] est[2,1] + c(-2,2)*est[2,2] plot(c(est[1,1] + c(-2,2)*est[1,2], est[2,1] + c(-2,2)*est[2,2]),c(1,1,2,2), type="n",ylim=c(0,3), xlab="CI for coefficient", axes=F,ylab="condition") lines(c(est[1,1] + c(-2,2)*est[1,2]),c(1,1)) lines(c(est[2,1] + c(-2,2)*est[2,2]),c(2,2)) box() axis(1) axis(2,at=c(1,2), labels=c("couples", "women only"))

A slightly less (?) lazy answer... n.sim <- 1000 beta.c <- rnorm(n.sim,est[1,1],est[1,2]) beta.w <- rnorm(n.sim,est[2,1],est[2,2]) plot(density(beta.c - beta.w), main="Simulated difference between coefficient estimates", xlab="beta(couples) - beta(women only)") abline(v=0,col="red") pval <- sum(beta.c - beta.w < 0)/n.sim legend("topleft",legend=paste("P[diff<0] =",round(pval,2)),lty=1,col="red") cor(beta.c,beta.w) # [1] .0455817 contour(kde2d(beta.c,beta.w)) 0.25 0.25 • The ’s have asymptotically normal distributions (under resampling of the data); simulate from that? From library(MASS)

A substantially better answer... (mu <- est[,1]) # couples women_alone # -0.7596268 -0.9743223 (Sigma <- vcov(glmer.1)[2:3,2:3]) # 2 x 2 Matrix of class "dsyMatrix" # [,1] [,2] # [1,] 0.05343457 0.02872367 # [2,] 0.02872367 0.05547830 n.sim <- 1000 sims.1 <- mvrnorm(n.sim,mu,Sigma) beta.c <- sims.1[,1] beta.w <- sims.1[,2] plot(density(beta.c - beta.w), main="Simulated difference between coefficient estimates", xlab="beta(couples) - beta(women only)") abline(v=0,col="red") pval <- sum(beta.c - beta.w < 0)/n.sim legend("topleft",legend=paste("P[diff<0] =",round(pval,2)),lty=1,col="red") cor(beta.c,beta.w) # [1] 0.490836 contour(kde2d(beta.c,beta.w)) 0.16 From library(MASS) From library(MASS) • The asymptotic distribution of the ’s is correlated normal; simulate from that!

An Even Better Answer… • The “normality” of the ’s is only asymptotic. • In fact, the ’s have a distribution that is not quite normal (because, for example, the data is Poisson and not normal). • We can get the exact distribution of the ’s from a Bayesian model. • We next fit the Bayesian model, and look at a simulation of the difference from the posterior distribution of that model.

Setting up and fitting the Bayesian model… rube.glmer.1 <- "model{ # LEVEL 1 for (i in 1:n) { fupacts[i] ~ dpois(lambda[i]) log(lambda[i]) <- a0[i] + b1*couples[i] + b2*women_alone[i] + b3*bupacts[i] + b4*bs_hiv[i] } # LEVEL 2 for (i in 1:n) { a0[i] ~ dnorm(b0,tau2inv) } # LEVEL 3 b0 ~ dnorm(0,.0001); b1 ~ dnorm(0,.0001) b2 ~ dnorm(0,.0001); b3 ~ dnorm(0,.0001) b4 ~ dnorm(0,.0001); tau2inv <- 1/tau2 tau2 ~ dunif(0,10) }" data.list <- as.list(rb) data.list$bs_hiv <- ifelse(data.list$bs_hiv=="negative",0,1) data.list$n <- dim(rb)[1] inits <- function() { list(b0=rnorm(1), b1=rnorm(1), b2=rnorm(1), b3=rnorm(1), b4=rnorm(1), tau2=runif(1,0,10)) } run.glmer.1 <- rube(rube.glmer.1, data.list, inits, parameters.to.save=c("b1","b2","tau2"), n.chains=3) p3(run.glmer.1) Yikes! …so this is what unconverged MCMC samples look like…

Some possible fixes… • Explore the full posterior… • just very diffuse? • actually has secondary modes? • Try a set of initial values from the glmer fit… • initial values from a “simpler” model! co <- coef(summary(glmer.1)) inits <- function() { list(b0=rnorm(1,co[1,1],co[1,2]), b1=rnorm(1,co[2,1],co[2,2]), b2=rnorm(1,co[3,1],co[3,2]), b3=rnorm(1,co[4,1],co[4,2]), b4=rnorm(1,co[5,1],co[5,2]), tau2=runif(1,0,5)) } run.glmer.1 <- rube(rube.glmer.1, data.list, inits, parameters.to.save = c("b1", "b2", "tau2"), n.chains=3) p3(run.glmer.1) Still not great, but better… (other params similar)

Posterior simulation of the difference… beta.c <- run.glmer.1$sims.list$b1 beta.w <- run.glmer.1$sims.list$b2 n.sim <- length(beta.c) plot(density(beta.c - beta.w), main="Simulated difference between coefficient estimates", xlab="beta(couples) - beta(women only)") abline(v=0,col="red") pval <- sum(beta.c - beta.w < 0)/ n.sim legend("topleft",legend=paste("P[ diff<0] =", round(pval,2)), lty=1, col="red") cor(beta.c,beta.w) #[1] 0.6314137 contour(kde2d(beta.c,beta.w)) 0.21 • We now simulate from the joint posterior distribution of the ’s.

Simulation-based testing (summary) • We can simulate “fake data” to check the fit of the model • Does a statistic from the real data “fit in” with fake data simulation of the statistic from the fitted model? • Plug in ’s and simulate data from that model (parametric bootstrap) • Simulate data from the “posterior predictive” distribution (posterior predictive checks) • We can simulate from • the [likelihood-based] asymptotic distribution of the parameter estimates, or • the [Bayesian] posterior distribution, to construct tests directly on the parameters themselves.

Summary • Model Checking • Two strategies for checking model fit • DIC • Simulation checks/fake data/posterior predictive checks • Example: Rating professors • Parameter Tests • Testing a difference by simulation • Example: Risky Behaviors

Bayesian Model Checking and Parameter Tests

Bayesian Model Checking and Parameter Tests

Presentation Transcript