340 likes | 379 Views
Learn about missing data types, solutions like Listwise Deletion and Single Imputation, and the technique of Multiple Imputation (MI) by Donald Rubin. Discover the pros and cons, steps involved, and importance of proper imputation in statistical analysis.
E N D
Multiple Imputations: Introduction and Application in Stata Dylan Conger Trachtenberg School GWIPP Brown Bag April 28, 2011
Types of Missing Data (Rubin 1976) • Missing Completely at Random (MCAR) • P(missing) is unrelated to all observed and unobserved variables • e.g., surveys randomly sorted on your desk and dog ate pieces of some • Missing at Random (MAR) • P(missing) is not related to the score on that variable after controlling for other variables in the study • e.g., poor students less likely to take the exam, but conditional on poverty, P(missing) unrelated to the exam score • Missing NOT at Random (MNAR) • P(missing) depends on unobserved values
Solution 1: Listwise Deletion • Also called casewise deletion and complete case analysis • Easy. Most software does it automatically • If missing data are MCAR, the sub-sample will be a random draw of larger sample and inferences will be unbiased • If the missing data are not a random draw of the larger population, estimates could be biased • Even if estimates are unbiased, it can lower statistical power
Solution 2: Single Imputation • Replace missing values with another value using complete data • Mean imputation (with or without a dummy variable adjustment) • Conditional mean imputation (e.g., predicted value from a regression with all observables) • Easy to implement and retains the sample size • But.. • Estimates from mean imputation tend to be more biased than those from listwise deletion • Even with conditional mean imputation, variance on Xs is reduced because same value is repeatedly imputed • Underestimate standard errors and overestimates test statistics
What is Multiple Imputation (MI)? • Similar to regression-based imputations, but conducted multiple times and with simulations to account for the uncertainty in the prediction • MI substitutes missing values with multiple versions (m) of simulated values using the complete data • It then applies standard analyses (e.g. regression) to each dataset, and adjusts the obtained parameter estimates & standard errors that incorporate missing data uncertainty • Proposed by Donald Rubin (1987). Became popular in last few years
MI Pros and Cons • Pros • Several studies have found that MI tends to be less biased and more efficient than the other methods (e.g. Rubin, 1987; Little, 1992; Schafer, 1997; Allison, 2002; Puma et al., 2009) • Cons • Computationally-intensive & time consuming, especially with large datasets • Can be done wrong without the right software. Most importantly, the standard errors need to be properly adjusted or they will be downwardly biased
The 3 Steps to MI • Impute the data • Replace the missing values with m sets of plausible values according to the “imputation model” • Analyze the data • Perform same analyses on each imputed dataset to obtain estimates and standard errors • Pool the results • Consolidate the results from the analyses into one MI inference using Donald Rubin’s combination rules
Step 1: Impute the Data Suppose I have two vars, y and x • Regress y on x using complete data and impute values for cases with missing data • yi = 0.534 + 1.89xi + e • ŷi= 0.534 + 1.89xi • Take random draws from the residual distribution (e) and add the random number to the prediction • ÿi = ŷi + residual • Repeat step 2 more than once, producing multiple datasets and multiple predictions
What is proper imputation and how is it conducted? • Proper imputation means that we pull a different set of estimates for each imputation from the full distribution of possible parameters • This requires us to have a distribution of the parameters, which are unknown • Rubin recommends using Bayesian arguments to generate this distribution
My attempt at explaining Bayesian inference • In classical (frequentist) statistics, we conduct inference as follows: • State a theory or hypothesis (θ = 0.0) • Collect data and observe X • Generate sampling distribution -- P(X|θ) • Calculate the p-value for X value: the probability of getting an X at least this size if θ were true • In Bayesian statistics, we conduct inference as follows: • State several theories (θA = 0.00, θB = 1,000, θC = 5,000) • Use prior information to determine how likely each theory is to be true– the “prior distribution” • P(θA) = 0.5, P(θB) = 0.3 , P(θC) = 0.2 • Collect data and observe X • Generate the “posterior distribution” based on prior distribution, observed X, and Bayes Theorem -- P(θ|X) • Compare the probabilities from this posterior distribution to determine how likely each theory is
My attempt at explaining Bayesian inference • In classical (frequentist) statistics, we conduct inference as follows: • State a theory or hypothesis (θ = 0.0) • Collect data and observe X • Generate sampling distribution -- P(X|θ) • Calculate the p-value for X value: the probability of getting an X at least this size if θ were true • In Bayesian statistics, we conduct inference as follows: • State several theories (θA = 0.00, θB = 1,000, θC = 5,000) • Use prior information to determine how likely each theory is to be true– the “prior distribution” • P(θA) = 0.5, P(θB) = 0.3 , P(θC) = 0.2 • Collect data and observe X • Generate the “posterior distribution” based on prior distribution, observed X, and Bayes Theorem -- P(θ|X) • Compare the probabilities from this posterior distribution to determine how likely each theory is
Take home from Bayes • Use Bayesian inference to generate the posterior distribution of possible parameter values from which we can randomly draw to compute imputations • A few ways to do this: one popular approach is Markov Chain Monte Carlo (MCMC), which is an iterative algorithm that simultaneously estimates parameters and missing values • In the context of MI, the MCMC converges to a distribution of multiple sets of parameter values and imputed values, from which we draw our m imputed values
MCMC in the context of MI • Two iterative steps over and over again: • I step (imputation step): use complete data and regressions to simulate missing values • P step (posterior step): simulate the posterior population distribution of parameters • These 2 steps are repeatedly run to produce a single data augmentation chain (e.g., 500 cycles). • From this chain, you draw m datasets from the I-step
Sort of what this looks like • Data Augmentation Chain: 1st I step:ŷ=0.534 + 1.89x + e 1st P step: y = (0.534 + residual) + (1.89+ residual)x 2nd I step: ŷ = 0.548 + 1.91x + e 2nd P step: y = (0.548 + residual) + (1.91 + residual)x . . 500th I step: ŷ = 0.539 + 1.82x + e 500th P step: y = (0.539 + residual) + (1.82 + residual)x • After some number of iterations (e.g., 100), pull first imputed dataset. This “burn-in” phase eliminates dependency on the starting values • Then pull the remaining m datasets at every kth iteration (e.g., 100). This “between-imputation” phase eliminates dependency between imputed datasets.
Still on Step 1: What variables should go in the imputation model? • All variables with missing data • All other variables in the model to be estimated • Other variables that are correlated with the variables that have missing data but that might not be used in the final model • Including dependent variable
Still on Step 1: How many imputations? • Rubin offers this formula for determining the efficiency of an estimate based on m imputations given the amount of missing data you have (γ) • More missing data requires more imputations for same efficiency • 5-10 imputations often provided as a rule of thumb • If 20% of observations are missing data.. • efficiency from 5 imputations is 96% • efficiency from 10 imputations is about 98%
Step 2: Analyze the imputed data • Easy. If you have 5 datasets, run 5 regressions. • Nothing fancy required
Step 3: Parameter estimates • The combined parameter is simply the mean of the estimates generated from the separate analyses
Step 3: Standard errors • The combined standard error contains two parts: • within-imputation variance (var within each regression): • between imputation variance (var in estimates across regressions):
MI in Stata 11 • mi impute mvn performs the imputation step (step 1) using the Markov Chain Monte Carlo algorithm • mi estimateperforms individual analyses and combines them (steps 2 and 3) • Several other mi commands can be found in Stata documentation
Set data and register variables that require imputation and those that don’t
Other questions • What if the imputation model is wrong? • Schafer (1997) shows that MI works well even if some of the underlying assumptions are not met • Rules of thumb regarding how much missing data? • Have read that if fewer than 5% of the observations are missing data and MCAR isn’t way off, listwise deletion or single regression-based imputation work pretty well • Also, studies show that MI works well with up to 40% missing data on some variables
Conclusions • All methods for handling missing data that are not MCAR can lead to false inferences • Simulations suggest that methods that take into account the missing data uncertainty can produce less biased and more efficient estimates than other methods. Multiple imputation is one such method. • My approach: Estimate using listwise deletion and multiple imputation. If they are the same, report one in the appendix. If they are different, examine how. For instance, are the parameter estimates the same but the standard errors bigger?
Some References • Allison, Paul. 2002. Missing Data. Sage university Paper. • Little, Roderick J.A. 1992. Regression with Missing Xs: A Review. Journal of the American Statistical Association. Vol. 87, No 420. • Little, Roderick J.A. & Donald B. Rubin. 1987. Statistical Analysis with Missing Data. John Wiley & Sons. • Puma, Michael J, etc. 2009. What to do when Data are Missing in Randomized Trials. NCEE IES. US department of Education • Rubin, Donald B. 1987. Multiple Imputation for Nonresponse in Surveys. John Wiley & Sons. • Stata Multiple Imputation Reference Manual. Release 11. • http://www.ats.ucla.edu/stat/stata/seminars/missing_data/mi_in_stata_pt2.htm