80 likes | 93 Views
This study delves into imputing missing data for multilevel data, focusing on discrete variables and sampling weights. It showcases a model with MVN joint distribution at 2 levels, illustrated with children's heights as a case study. The imputation utilizes the MCMC algorithm, addressing missing data at both levels effectively. Moreover, it discusses the imputation of mixed response types and the incorporation of ordered and unordered categorical data. The methodology includes sampling MVN sets of variables with imputed values, catered for both normal and non-normal continuous data through transformations. The approach extends to handling partially observed data by incorporating prior probability distributions. Multiple applications like record matching and rating scales are explored, emphasizing the efficiency of using all available data. The study also touches upon incorporating sampling weights in a 2-level model for improved imputation. Ongoing efforts aim to integrate these techniques into MLwiN-REALCOM.
E N D
Missing data – issues and extensions For multilevel data we need to impute missing data for variables defined at higher levels We need to have a valid procedure for discrete variables Useful to include sampling weights Can we deal with partially missing data?
Consider the imputation stage with a set of multivariate responses • We illustrate first with a simple model where the response joint distribution is MVN and there are responses at 2 levels • To illustrate how such a model is specified consider repeated measures of childrens’ heights: level 2 is the child’s adult height.
Child heights + adult height Child height as a cubic polynomial with intercept + slope random at level 2 and both correlated with adult height random effect to give 3-variate normal. This allows us jointly to model level1 and level 2 variables with missing data. (see Goldstein and Kounali, JRSSA, 2009)
Results: Thus, if data are missing at either level 1 or level 2 they will get imputed via the MCMC algorithm.
Mixed response types • For ordered, or unordered categorical data we can specify corresponding ‘latent normal’ distributions. • For ordered response we can consider a ‘probit’ threshold model s.t. • the cumulative probability of being in one of the categories 1,…,s is and the associated latent normal model is • For a p – category unordered response we can define a latent p-1 variate normal We can define MCMC steps to sample form observed categorical responses an underlying normal or MVN. Note that these are further conditioned on the remaining set of (correlated) normal variables. For details see Multilevel models with multivariate mixed response types (2009) Goldstein, H, Carpenter, J., Kenward, M., Levin, K. Statistical Modelling (to appear)
Imputation • So now with any mixture of categorical and normal variables at any level, we sample, for each MCMC iteration, a MVN set of variables including imputed values. • Thus imputation is standard and the reverse transformation is used to obtain imputed variables on the categorical scales. • For non-normal continuous data we can use e.g. a Box-Cox normalising transformation to sample a latent normal. Further extensions for Poisson and other discrete distributions are also available. • Release 2.10 of MLwiN has a link to REALCOM that allows these extensions.
Partially observed (coarsened) data: • Where we have a prior (estimated) probability distribution (PD) for a missing discrete (or continuous) variable value we simply insert an extra MCMC step that accepts the ‘standard’ MI value with a probability that is just the probability given by the PD. A corresponding step is used for normal data. • This thus uses all of the data efficiently. No data are discarded so long as it is possible to assign a PD. • Applications in record matching, rating scales with uncertain responses etc. • Several completed data sets are produced and combined as in standard MI
Sampling weights- briefly • Consider a 2-level model: • Write level 2 weights as • Level 1 weights for j-th level 2 unit as Final level 1 weights We use as the level 1 random part explanatory variable instead of the constant =1 This will be used for imputation and for MOI Ongoing work to incorporate this into MLwiN-REALCOM