310 likes | 326 Views
Explore model comparison methods in Bayesian statistics using MCMC, DIC, and random slopes regression. Understand the complexities and challenges in choosing the best model, including Bayes Factors and issues with the DIC. Learn about Bayesian model averaging and reversible jump MCMC.
E N D
Lecture 9 Model Comparison using MCMC and further models
Lecture Contents • Model comparison • DIC diagnostic • Random slopes regression model • Priors for variance matrices • MLwiN RSR demonstration • Other predictor variables • DIC in WinBUGS
Bayesian Model Comparison • In Bayesian statistics model comparison is a thorny issue!! • In MLwiN we used to suggest running IGLS for model selection then MCMC on your chosen model. Why is it a thorny issue? • The posterior f(θ|Y) does not allow criticism of the model in light of the observed data nor comparison amonst models. • It is f(Y) that can be used to assess model performance. • Regardless of the model, f(Y) is a density over the space of observables which can be compared with what was actually observed.
Bayes Factors • If we observe YOBS and have 2 models M1 and M2 then the Bayes Factor is This provides the relative weight of evidence for model M1 compared to model M2. Rough calibration of the Bates factor has been proposed:
Problems with Bayes Factor 1. When prior is vague -> f(θ) is improper This implies that even though f(θ |Y) may be proper, f(Y) is improper so Bayes Factors cannot be used! 2. Computation of the Bayes factor itself requires high-dimensional integration. 3. Lindley’s paradox – data points to rejection but prior is diffuse so denominator of Bayes factor much smaller than numerator and too much weight given to parsimonious models.
Other related ideas • Prior predictive distributions f(Y). • Cross-validation predictive distributions F(yr|Y(r)). • Posterior predictive distributions f(Y|Yobs). • Model uncertainty – where the model is itself a parameter to be estimated. • Bayesian model averaging. • Reversible jump MCMC.
Model Comparison for random effect models • As we will typically use diffuse priors, Bayes factors are not an option here. • The methods listed previously are possibilities but not built into software packages. • The Deviance Information Criterion (DIC) is one possibility but is it a saviour for Bayesian model choice or a white elephant?
DIC – Spiegelhalter et al. (2002) Plus points: • Discussion paper proposing it written by leading figures in Bayesian modelling. • Available in both MLwiN and WinBUGS for standard models Minus points: The paper was given a very mixed reception at the RSS when it was discussed!
DIC A natural way to compare models is to use a criterion based on a trade-off between the fit of the data to the model and the corresponding complexity of the model. DIC does this in a Bayesian way. DIC = ‘goodness of fit’ + ‘complexity’. Fit is measured by deviance Complexity is measured by an estimate of the ‘effective number of parameters’ defined as i.e. Posterior mean deviance minus the deviance evaluated at the posterior mean of the parameters.
DIC (continued) • The DIC is then defined analagously to AIC as Models with smaller DIC are better supported by the data. • DIC can be monitored in WinBUGS from Inference/DIC menu. • DIC is available in MLwiN under the Model/MCMC menu.
Education dataset • We can fit a simple (Bayesian) linear regression in MLwiN • The DIC output is as follows: ← Note PD ~ 3 = the actual number of parameters
Variance components model Here we consider the random intercepts model from earlier practicals This is the parallel lines model
Change in DIC Here we see the clear improvement in fitting random effects for school. Note that the effective number of parameters is ~60 compared with 68 actual parameters in the dataset due to random rather than fixed school effects.
school 1 u1,1 0 + 1x1ij u0,2 u1,2 school 2 -3 0 1 3 Random slopes model (crossing lines) u0,1
Fitting an RSR in a Bayesian Framework The basic random slopes regression model is as follows: To this model we need to add priors for
Wishart priors For a (kxk) variance matrix parameter in a Normal likelihood the conjugate prior is the inverse Wishart distribution with parameters ν and S This distribution looks complex but is simply a multivariate generalisation of the inverse Gamma distribution.
Wishart prior for Ωu-1 In MLwiN we use an inverse Wishart prior for the precision matrix: Note this is a (weakly informative) prior as the first parameter represents the prior sample size and is set to the smallest feasible value. Browne and Draper have looked at alternative Wishart priors as well as a Uniform prior and performed simulations.
Gibbs Sampling algorithm for RSR model Repeat the following four steps • 1. Generate β from its (Multivariate) Normal conditional distribution. • 2. Generate each uj from its (Multivariate) Normal conditional distribution. • 3. Generate Ωu-1 from its Wishart conditional distribution. • 3. Generate 1/σe2 from its Gamma conditional distribution
Bayesian RSR Model for education dataset Note IGLS estimates used in prior. Variance (posterior mean) estimates bigger than IGLS estimates.
DIC for RSR model As with the frequentist approach the random slopes model is an improvement over the random intercepts model. The additional 65 random parameters only add 32 effective parameters
Predictions for the RSR model with highlighted data • Here the top and bottom school are highlighted:
Residuals for the RSR Individually: and pairwise:
Uniform Priors Here the level 2 variance estimates increase as in Browne and Draper (2000) Browne and Draper found that the Wishart priors were preferable although the use of the IGLS estimate is not strictly Bayesian as we are using the data twice!
Other predictors in the education dataset This dataset has other predictors such as gender and school gender that can be considered in the practical. In the next slide we see the equations window for a model with these added which has DIC 9189.26 a reduction of over 25 on the earlier RSR model
WinBUGS RSR & gender model { # Level 1 definition for(i in 1:N) { normexam[i] ~ dnorm(mu[i],tau) mu[i]<- beta[1] * cons[i] + beta[2] * standlrt[i] + beta[3] * girl[i] + beta[4] * boysch[i] + beta[5] * girlsch[i] + u2[school[i],1] * cons[i] + u2[school[i],2] * standlrt[i] } # Higher level definitions for (j in 1:n2) { u2[j,1:2] ~ dmnorm(zero2[1:2],tau.u2[1:2,1:2]) } # Priors for fixed effects for (k in 1:5) { beta[k] ~ dflat() } # Priors for random terms tau ~ dgamma(0.001000,0.001000) sigma2 <- 1/tau for (i in 1:2) {zero2[i] <- 0} tau.u2[1:2,1:2] ~ dwish(R2[1:2, 1:2],2) sigma2.u2[1:2,1:2] <- inverse(tau.u2[,]) } ← Here we see the WiNBUGS code for our last model. Notice how MVN and Wishart distributions are specified in WinBUGS
DIC in WinBUGS In WinBUGS DIC is available from the Inference menu: The DIC is set after the burnin and then the DIC button is pressed after running giving: Dbar = post.mean of -2logL; Dhat = -2LogL at post.mean of stochastic nodes Dbar Dhat pD DIC Normexam 9098.590 9007.960 90.631 9189.220 total 9098.590 9007.960 90.631 9189.220
Parameter estimates in WinBUGS • Note that here we see that WinBUGS gives similar estmates as MLwiN for the model. Note that for the fixed effects β that WinBUGS indexes from 1 while MLwiN indexes from 0. node mean sd (2.5%, 97.5%) beta[1] -0.19 0.05236 (-0.2996, -0.08937) beta[2] 0.5539 0.01971 (0.5145, 0.5911) beta[3] 0.1687 0.03399 (0.1019, 0.2349) beta[4] 0.1775 0.1008 (-0.02581, 0.3781) beta[5] 0.175 0.08212 (0.004677, 0.3318) sigma2 0.5511 0.0125 (0.5272,0.576) sigma2.u2[1,1] 0.08777 0.01885 (0.05745 , 0.1305) sigma2.u2[1,2] 0.02141 0.00719 (0.009262,0.0372) sigma2.u2[2,2] 0.01545 0.00460 (0.008271,0.02603)
Next Practical The next practical is free ranging: • You can follow the MLwiN chapter on RSR models that is given. • You can try out RSR models in WinBUGS. • You can try out fitting random effect models to the orthodont dataset using MCMC. • You can try out DIC on other models.