REALCOM

REALCOM An ESRC research project at Bristol University Multilevel models for realistically complex data Methodology and examples for: Measurement errors Multilevel Structural equations Multivariate responses at several levels and of different types

General Format • MATLAB software • Free standing executable programs • ASCII and worksheet input and output • Graphical menu based input specification • Model equation display • Monitoring of MCMC chains • A training manual containing: • Outline of methodology • Worked through examples

Markov Chain Monte Carlo – a quick introduction • Bayesian simulation based method that, given starting values samples a new set of parameters at each cycle of a ‘Markov chain’ • This yields a final chain (after discarding a burn-in set) of, say, 5000 sets of values from the (joint) posterior distribution of the parameters • This is formed by combining the likelihood based on the data and a prior distribution – typically diffuse. • These chains are used for inference – e.g. the mean for a parameter is analogous to the point estimate from a likelihood analysis, intervals etc.

Consider the simple 2-level model: The parameters in this model are the fixed coefficients, the two variances and the level 2 residuals. From suitable starting values eventually the chain ‘settles down’ so that sampling is from the true posterior distribution and we need to sample sufficient to provide stable estimates – using suitable ‘convergence’ criteria.All the MATLAB routines use MCMC sampling.

Measurement errors • Continuous variables: a simple example: • Basic model is: • With a model of interest e.g.

Some assumptions we need to make • Variance assumed known – or alternatively • Reliability: • We also need a distribution for true value: • An important issue is value for and sensitivity analysis useful – we can also give it a prior.

2. Missclassification errors • Assume a binary (0,1) variable, for example whether or not a school pupil is eligible for free school meals (yes=1) • Probability of observing a zero (no eligibility), given that the true value is zero, is and the probability of observing a one given that the true value is zero by - likewise we have and • We now assume we know these missclassification probabilities – similar target model as before with a binary predictor.

Modelling considerations • We can model multivariate continuous measurement errors, but only independent binary missclassifications. • We can allow different measurement error variances and covariances for different groups – e.g. gender. • In multivariate case we typically need non-zero correlations between measurement errors: • Thus, say, if R=0.7 observed correlation = 0.8 then we require measurement error correlation >0.33

An educational example • Maths test score related to prior test scores and FSM eligibility. • We will look at continuous, correlated and binary measurement errors. Open measurement-error.exe and readfile ‘classsize’

Summary table for analyses:

Factor analysis and structural equation models Consider a single level factor model where we have several responses on each member of a sample: Where r indexes the response variable and i the person. This is a special kind of multivariate model where we assume the residuals are independent and the covariance between two responses is thus given by A constraint is needed for identifiability and the default is to choose

Extensions- further factors We can add explanatory variables in addition to the (see later) or we can add further factors: As number of factors increases, we require further constraints, typically on loading values. A popular choice is ‘simple structure’ with each response loading on only 1 factor and non-zero correlations between factors.

Extensions – structural variables We can allow the factors themselves to depend on further variables e.g. Or alternatively, but less commonly

Two level factor models Standard formulation Alternatively But we shall not consider this case

Example – PISA data A survey of reading performance, of 15 year olds in 32 countries by OECD in 2000. We use one subscale of 35 items ‘retrieving information’ and look at France and England. First we shall fit one and two level models assuming responses are Normal – in fact they are binary and ordered but we come to that later. Open structural-equation.exe load pisadata

Binary and ordered responses Assume a binary response z. We will use the idea of a latent Normal distribution. Consider the (factor) model for a single response: Where we observe a positive (=1) response for our binary variable z if y is positive, that is So that we obtain the probit model

Ordered data Consider the cumulative probability of being in one of the lowest s+1 categories of a p category variable - categories numbered from 0 upwards: s=0,…p-2 We extend the binary response model as: Where the define a set of ‘thresholds’ for the categories. So suppose we have a 3-category variable, then for observed responses

PISA data with binary/ordered responses • In fact all the responses are binary except for 4 with 3 ordered categories: C9, C14, C20, and C26 • Change these responses and rerun models. • Finally fit explanatory variables Countryand Genderin structural part of model.

Multivariate models with responses at 2 levels • Consider first 2 Normal responses: Superscript indicates level • Models are linked via level 2 covariance matrix • MCMC algorithm handles missing response data and categorical (binary, ordered and unordered) as well as Normal data. • First example is a repeated measures growth curve model

Child heights + adult height Child height as a cubic polynomial with intercept + slope random at level 2

Load growthdata.txt and fit the model Results:

Adult height prediction Suppose we have 2 growth measures: we want a regression prediction of the form This leads to:

Mixed response types and missing data • Normal and ordered data already considered in structural equation models • We now introduce unordered categorical responses • We can also have general Normalising transformations • Missing data via imputation is an important application for these models

Unordered categorical responses Assume p categories where an individual responds to just one. We have where h indexes the response. For each we assume an underlying latent variable exists and that we have the following model: For identifiability we model p-1 categories and assume . The maximum indicant model: we observe category h for individual i iff . so that

Multiple imputation – briefly and simply Consider the model of interest (MOI) We turn this into a multivariate response model and obtain residual estimates of (from an MCMC chain) which are missing. Use these to ‘fill in’ and produce a complete data set. Do this (independently) n (e.g. = 20) times. Fit MOI to each data set and combine according to rules to get estimates and standard errors.

Class size example Load classsize_impute MOI is Normalised exam score as response regressed on pretest score, gender, FSM, class size. 50% level 1 units have missing data. Multivariate model:

MI estimates vs listwise deletion Fixed effects in multivariate model: 50% records MCAR

Further extensions • Box-Cox normalising transformations: • Application to survival data treated as an ordered response when divided into discrete time intervals • Combination of measurement errors, structural models and responses at >1 level into a single program • Incorporation into MLwiN

General remarks • Report back welcome (h.goldstein@bristol.ac.uk) • A REALCOM discussion group is under consideration Use with care!

REALCOM

REALCOM

Presentation Transcript

REALCOM