Multiple Imputation

Multiple Imputation Stata (ice) How and when to use it.

How ice() works • Each variable with missing data is the subject of a regression. • Typically all other variables are used as predictors • Estimate ß, σ via the regression • Draw σ* from its posterior distribution (non-informative prior) • Draw ß* from its posterior distribution (non-informative prior) • Find predicted values: Ŷ=Xß*, then either: • Keep Ŷ for the missing values (default option) • Predictive Mean Matching • Move on to the next variable, using the newly-predicted values • Cycle through the variables a number of times (10 is default)

Assumptions • Missing at Random • No getting around this one. MCAR is fine, of course. • Distinct Parameters • Does the missing data mechanism govern what data-generating parameters you can see? Ex: limits of detection. • Adequate Sample Size • Hard to quantify. Regression on continuous variables doesn’t take much, but other methods certainly can • Convergence to a Posterior Distribution • Standard MI (such as Proc MI) is known to converge to a posterior distribution with enough iterations. Ice() does not have this guarantee. This is typically ignored when ice() is used.

Predictive Mean Matching • We have Ŷmis for the variable with missing information • Previously • Find the ŷobs that is closest to ŷmis, fill in the missing observation’s value with the true value of the ŷobs • Was the default behavior for previous versions of ice() • Could be a problem; not enough variability. • Currently • Find a set of ŷobs that are close to ŷmis, choose one randomly, fill in the missing observation’s value with the true value of the ŷobs • Invoked by using the “match” argument

Other Regression Methods • Multinomial Logistic Regression • For categorical variables, ordered or unordered • Finds a probability for each category value, then imputes a value using those probabilities. • My advice: try to avoid using it, as I’ve found its results to be incorrect (biased) • Ordinal Logistic Regression • For ordered categorical variables • My advice: it seems to work well, but it needs a large (n>1000) sample size to work

Useful Material: How to run ice() • Getting the program • Help -> Search -> [Search all] “ice imputation” • Click on st_0067_2 (www.stata-journal.com) • Click “click here to install” • This gets you ice and micombine, as well as a few other commands

Running ice • Have the dataset open • insheet using "C:\path\example.csv", clear • Four variables with missing information • npnitm: binary variable • npceradm, npneurm: continuous variables • npbrkm: 3-category ordered variable • Four variables with complete data • We need to make dummy variables for categorical variables: • recode npbrkm (4=0) (5=1) (6=0) (.=.), generate(brk5) • recode npbrkm (4=0) (5=0) (6=1) (.=.), generate(brk6)

Running ice, continued (1) • Call ice() • ice educ mmselast npdage npgender npnitm npceradm npbrkm brk5 brk6 npneurm using "C:\path\outfile", m(5) passive(brk5:npbrkm==5 \ brk6:npbrkm==6) substitute(npbrkm:brk5 brk6) cmd(npbrkm:mlogit, npnitm:logit) • Here’s what the code pieces do: • educ … npneurm: Variables to be used for imputation • using "C:\path\outfile“: the result; outfile.dta • m(5): 5 imputed datasets • passive(brk5:npbrkm==5 \ brk6:npbrkm==6) • Stata will not impute for brk5 and brk6: they will be updated from the new values in npbrkm

Running ice, continued (2) • Here’s what the code pieces do: • substitute(npbrkm:brk5 brk6) • npbrkm won’t be used to impute other variables; brk5 and brk6 will be used in its place • cmd(npbrkm:mlogit, npnitm:logit) • npbrkm will have multiple logistic regression • npnitm will have logistic regression • all other variables with missing data use default methods: • continuous: OLS • n=2 categories: Logistic Regression • n>2 categories: Multinomial Logistic Regression

Results • A dataset, outfile.dta • use “C:\path\outfile.dta”, clear • New variables • _i: row number per dataset (not generally used) • _j: imputed dataset number (same as _Imputation_ from Proc MI) • Analyzing the results using micombine, an example • xi: micombine regress mmselast npgender npnitm npceradm i.npbrkm • xi: expand interactions. Used to break npbrkm into dummy variables for the analysis • micombine: automatically does the MI analysis, using _j to distinguish between the imputed datasets • See its help file for a list of supported regression commands • For some methods, SAS’s MIANALYZE may be needed

The end. • Questions?

Multiple Imputation