1 / 11

Multiple Imputation

Multiple Imputation. Stata (ice) How and when to use it. How ice() works. Each variable with missing data is the subject of a regression. Typically all other variables are used as predictors Estimate ß, σ via the regression Draw σ* from its posterior distribution (non-informative prior)

hallie
Download Presentation

Multiple Imputation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multiple Imputation Stata (ice) How and when to use it.

  2. How ice() works • Each variable with missing data is the subject of a regression. • Typically all other variables are used as predictors • Estimate ß, σ via the regression • Draw σ* from its posterior distribution (non-informative prior) • Draw ß* from its posterior distribution (non-informative prior) • Find predicted values: Ŷ=Xß*, then either: • Keep Ŷ for the missing values (default option) • Predictive Mean Matching • Move on to the next variable, using the newly-predicted values • Cycle through the variables a number of times (10 is default)

  3. Assumptions • Missing at Random • No getting around this one. MCAR is fine, of course. • Distinct Parameters • Does the missing data mechanism govern what data-generating parameters you can see? Ex: limits of detection. • Adequate Sample Size • Hard to quantify. Regression on continuous variables doesn’t take much, but other methods certainly can • Convergence to a Posterior Distribution • Standard MI (such as Proc MI) is known to converge to a posterior distribution with enough iterations. Ice() does not have this guarantee. This is typically ignored when ice() is used.

  4. Predictive Mean Matching • We have Ŷmis for the variable with missing information • Previously • Find the ŷobs that is closest to ŷmis, fill in the missing observation’s value with the true value of the ŷobs • Was the default behavior for previous versions of ice() • Could be a problem; not enough variability. • Currently • Find a set of ŷobs that are close to ŷmis, choose one randomly, fill in the missing observation’s value with the true value of the ŷobs • Invoked by using the “match” argument

  5. Other Regression Methods • Multinomial Logistic Regression • For categorical variables, ordered or unordered • Finds a probability for each category value, then imputes a value using those probabilities. • My advice: try to avoid using it, as I’ve found its results to be incorrect (biased) • Ordinal Logistic Regression • For ordered categorical variables • My advice: it seems to work well, but it needs a large (n>1000) sample size to work

  6. Useful Material: How to run ice() • Getting the program • Help -> Search -> [Search all] “ice imputation” • Click on st_0067_2 (www.stata-journal.com) • Click “click here to install” • This gets you ice and micombine, as well as a few other commands

  7. Running ice • Have the dataset open • insheet using "C:\path\example.csv", clear • Four variables with missing information • npnitm: binary variable • npceradm, npneurm: continuous variables • npbrkm: 3-category ordered variable • Four variables with complete data • We need to make dummy variables for categorical variables: • recode npbrkm (4=0) (5=1) (6=0) (.=.), generate(brk5) • recode npbrkm (4=0) (5=0) (6=1) (.=.), generate(brk6)

  8. Running ice, continued (1) • Call ice() • ice educ mmselast npdage npgender npnitm npceradm npbrkm brk5 brk6 npneurm using "C:\path\outfile", m(5) passive(brk5:npbrkm==5 \ brk6:npbrkm==6) substitute(npbrkm:brk5 brk6) cmd(npbrkm:mlogit, npnitm:logit) • Here’s what the code pieces do: • educ … npneurm: Variables to be used for imputation • using "C:\path\outfile“: the result; outfile.dta • m(5): 5 imputed datasets • passive(brk5:npbrkm==5 \ brk6:npbrkm==6) • Stata will not impute for brk5 and brk6: they will be updated from the new values in npbrkm

  9. Running ice, continued (2) • Here’s what the code pieces do: • substitute(npbrkm:brk5 brk6) • npbrkm won’t be used to impute other variables; brk5 and brk6 will be used in its place • cmd(npbrkm:mlogit, npnitm:logit) • npbrkm will have multiple logistic regression • npnitm will have logistic regression • all other variables with missing data use default methods: • continuous: OLS • n=2 categories: Logistic Regression • n>2 categories: Multinomial Logistic Regression

  10. Results • A dataset, outfile.dta • use “C:\path\outfile.dta”, clear • New variables • _i: row number per dataset (not generally used) • _j: imputed dataset number (same as _Imputation_ from Proc MI) • Analyzing the results using micombine, an example • xi: micombine regress mmselast npgender npnitm npceradm i.npbrkm • xi: expand interactions. Used to break npbrkm into dummy variables for the analysis • micombine: automatically does the MI analysis, using _j to distinguish between the imputed datasets • See its help file for a list of supported regression commands • For some methods, SAS’s MIANALYZE may be needed

  11. The end. • Questions?

More Related