190 likes | 213 Views
MICE for multiple imputation of missing values. Patrick Royston MRC Clinical Trials Unit, London 11 th London Stata Users’ Meeting 17-18 May 2005. Outline. What is multiple imputation? Types of missing data Multiple imputation with the MICE method Example: Fetal growth study
E N D
MICE for multiple imputation of missing values Patrick Royston MRC Clinical Trials Unit, London 11th London Stata Users’ Meeting 17-18 May 2005
Outline • What is multiple imputation? • Types of missing data • Multiple imputation with the MICE method • Example: Fetal growth study • Passive imputation • Coping with categorical variables • Notes and conclusions
What is multiple imputation (MI)? • Context: Multiple regression (in general) • Replace missing values with “plausible” substitutes • Based on distribution of given data • Inject the right amount of randomness to reflect uncertainty • Do this several times, create m > 1 datasets • Analyse datasets individually, but identically • Combine the estimates, get confidence intervals using Rubin’s rules (micombine)
Types of missing data: The Holy Triad • MCAR (missing completely at random) • MAR (probability of missingness does not depend on unobserved information) • MNAR (probability of missingness does depend on unobserved information) Will not be considering MNAR data here - data will be assumed MAR at worst
Multiple imputation with MICE • MICE = “multiple imputation by chained equations” (van Buuren et al Stat Med 1999) • The MICE approach has three components: • Univariate – implemented in uvis • Multivariate – implemented in ice • Multiple – implemented in ice • ice = imputation by chained equations
Univariate imputation with uvis • Suppose have variables x1, x2, …, xk on n cases • Suppose the variable to be imputed is x1 • x1 has some observations “missing at random” • x2, …, xk are complete (no missing data) • Regress x1 on x2, …, xk • Draw * from posterior distribution of regression coefficients (or use bootstrap – boot option) • Use prediction-matching to estimate missing x1 • Predict all x1 values using *(x2, …, xk)T • Find non-missing prediction nearest to missing-value prediction and impute using corresponding value of x1 • Or, predict missing values of x1 from posterior predictive distribution of x1 (draw option)
Univariate imputation with uvis uvis regression_cmd yvar xvarlist [if exp] [in range] [weight], gen(newvarname) [ boot draw seed(#) ] • Quite general - regression_cmd may be regress, logit, ologit or mlogit for different types of yvar
Multiple imputation with ice • Variables x1, …, xk may have missing data • Eliminate cases with all variables missing • Initialise – fill in all missing values at random • Apply uvis to x1 regressing on x2, …, xk • Replace missing values in x1 • Repeat for x2 , …, xk on other x’s (cycle 1) • Repeat for about 10 cycles • Repeat whole process m times • gives m imputed datasets with complete observations
Multiple imputation with ice ice varlist using filename[.dta] [if exp] [in range] [weight], [m(#) cmd(cmdlist) cycles(#) boot draw seed(#) dryruneq(eqlist)passive(passivelist) noshoweqsubstitute(sublist)other_options] Red options are new with ice cf. mvis – I will illustrate some aspects of these today
Example: Fetal size data • Ultrasound study of fetal growth (Lyn Chitty) • n = 649 singleton pregnancies • Many measurements – will concentrate on ac (abdominal circumference), hc (head circ.), ml (mandible length) and gest. age (ga) • Gestational age range 12-42 weeks • Rank correlations: all 0.95 • Missing: ac 6%, hc 8%, ml 75%, ga 0% • ml ‘unreliable’ after 28 weeks • Wish to see what ml might look like > 28 wks • Heteroscedasticity – log transformations used
Multiple imputation • Prediction equations for lnac, lnhc, lnml • MFP modelling on ga, otherwise linear:
Creating one imputation with ice eq(lnac:lnhc ga_1 ga_2, lnhc:lnac ga_3 ga_4, lnml:lnac lnhc
Result for log ML – random draws from posterior distribution (draw option)
Suppose ga had had missing values – introducing the passive() option
Coping with categorical variables – using passive() with substitute() No good – all the prediction equations are illogical!
Notes and Conclusions • MICE method is very flexible – but demands thought when creating the imputation model • Strongly recommend mastering the eq(), passive() and substitute() options • Can deal with interactions using passive() • Choice of m is important • may need to be (much) larger than 5 • See Royston (2004, SJ 4:227-41) for discussion • ice software is available (?on CD) • or send email to pr@ctu.mrc.ac.uk