Multiple Imputations: Introduction and Application in Stata

Multiple Imputations: Introduction and Application in Stata Dylan Conger Trachtenberg School GWIPP Brown Bag April 28, 2011

Types of Missing Data (Rubin 1976) • Missing Completely at Random (MCAR) • P(missing) is unrelated to all observed and unobserved variables • e.g., surveys randomly sorted on your desk and dog ate pieces of some • Missing at Random (MAR) • P(missing) is not related to the score on that variable after controlling for other variables in the study • e.g., poor students less likely to take the exam, but conditional on poverty, P(missing) unrelated to the exam score • Missing NOT at Random (MNAR) • P(missing) depends on unobserved values

Solution 1: Listwise Deletion • Also called casewise deletion and complete case analysis • Easy. Most software does it automatically • If missing data are MCAR, the sub-sample will be a random draw of larger sample and inferences will be unbiased • If the missing data are not a random draw of the larger population, estimates could be biased • Even if estimates are unbiased, it can lower statistical power

Solution 2: Single Imputation • Replace missing values with another value using complete data • Mean imputation (with or without a dummy variable adjustment) • Conditional mean imputation (e.g., predicted value from a regression with all observables) • Easy to implement and retains the sample size • But.. • Estimates from mean imputation tend to be more biased than those from listwise deletion • Even with conditional mean imputation, variance on Xs is reduced because same value is repeatedly imputed • Underestimate standard errors and overestimates test statistics

What is Multiple Imputation (MI)? • Similar to regression-based imputations, but conducted multiple times and with simulations to account for the uncertainty in the prediction • MI substitutes missing values with multiple versions (m) of simulated values using the complete data • It then applies standard analyses (e.g. regression) to each dataset, and adjusts the obtained parameter estimates & standard errors that incorporate missing data uncertainty • Proposed by Donald Rubin (1987). Became popular in last few years

MI Pros and Cons • Pros • Several studies have found that MI tends to be less biased and more efficient than the other methods (e.g. Rubin, 1987; Little, 1992; Schafer, 1997; Allison, 2002; Puma et al., 2009) • Cons • Computationally-intensive & time consuming, especially with large datasets • Can be done wrong without the right software. Most importantly, the standard errors need to be properly adjusted or they will be downwardly biased

The 3 Steps to MI • Impute the data • Replace the missing values with m sets of plausible values according to the “imputation model” • Analyze the data • Perform same analyses on each imputed dataset to obtain estimates and standard errors • Pool the results • Consolidate the results from the analyses into one MI inference using Donald Rubin’s combination rules

Step 1: Impute the Data Suppose I have two vars, y and x • Regress y on x using complete data and impute values for cases with missing data • yi = 0.534 + 1.89xi + e • ŷi= 0.534 + 1.89xi • Take random draws from the residual distribution (e) and add the random number to the prediction • ÿi = ŷi + residual • Repeat step 2 more than once, producing multiple datasets and multiple predictions

What is proper imputation and how is it conducted? • Proper imputation means that we pull a different set of estimates for each imputation from the full distribution of possible parameters • This requires us to have a distribution of the parameters, which are unknown • Rubin recommends using Bayesian arguments to generate this distribution

My attempt at explaining Bayesian inference • In classical (frequentist) statistics, we conduct inference as follows: • State a theory or hypothesis (θ = 0.0) • Collect data and observe X • Generate sampling distribution -- P(X|θ) • Calculate the p-value for X value: the probability of getting an X at least this size if θ were true • In Bayesian statistics, we conduct inference as follows: • State several theories (θA = 0.00, θB = 1,000, θC = 5,000) • Use prior information to determine how likely each theory is to be true– the “prior distribution” • P(θA) = 0.5, P(θB) = 0.3 , P(θC) = 0.2 • Collect data and observe X • Generate the “posterior distribution” based on prior distribution, observed X, and Bayes Theorem -- P(θ|X) • Compare the probabilities from this posterior distribution to determine how likely each theory is

Take home from Bayes • Use Bayesian inference to generate the posterior distribution of possible parameter values from which we can randomly draw to compute imputations • A few ways to do this: one popular approach is Markov Chain Monte Carlo (MCMC), which is an iterative algorithm that simultaneously estimates parameters and missing values • In the context of MI, the MCMC converges to a distribution of multiple sets of parameter values and imputed values, from which we draw our m imputed values

MCMC in the context of MI • Two iterative steps over and over again: • I step (imputation step): use complete data and regressions to simulate missing values • P step (posterior step): simulate the posterior population distribution of parameters • These 2 steps are repeatedly run to produce a single data augmentation chain (e.g., 500 cycles). • From this chain, you draw m datasets from the I-step

Sort of what this looks like • Data Augmentation Chain: 1st I step:ŷ=0.534 + 1.89x + e 1st P step: y = (0.534 + residual) + (1.89+ residual)x 2nd I step: ŷ = 0.548 + 1.91x + e 2nd P step: y = (0.548 + residual) + (1.91 + residual)x . . 500th I step: ŷ = 0.539 + 1.82x + e 500th P step: y = (0.539 + residual) + (1.82 + residual)x • After some number of iterations (e.g., 100), pull first imputed dataset. This “burn-in” phase eliminates dependency on the starting values • Then pull the remaining m datasets at every kth iteration (e.g., 100). This “between-imputation” phase eliminates dependency between imputed datasets.

Still on Step 1: What variables should go in the imputation model? • All variables with missing data • All other variables in the model to be estimated • Other variables that are correlated with the variables that have missing data but that might not be used in the final model • Including dependent variable

Still on Step 1: How many imputations? • Rubin offers this formula for determining the efficiency of an estimate based on m imputations given the amount of missing data you have (γ) • More missing data requires more imputations for same efficiency • 5-10 imputations often provided as a rule of thumb • If 20% of observations are missing data.. • efficiency from 5 imputations is 96% • efficiency from 10 imputations is about 98%

Step 2: Analyze the imputed data • Easy. If you have 5 datasets, run 5 regressions. • Nothing fancy required

Step 3: Parameter estimates • The combined parameter is simply the mean of the estimates generated from the separate analyses

Step 3: Standard errors • The combined standard error contains two parts: • within-imputation variance (var within each regression): • between imputation variance (var in estimates across regressions):

MI in Stata 11 • mi impute mvn performs the imputation step (step 1) using the Markov Chain Monte Carlo algorithm • mi estimateperforms individual analyses and combines them (steps 2 and 3) • Several other mi commands can be found in Stata documentation

An Example: Data on applicants to UT Austin in 2003

How much missing data and for which variables?

Missing data mechanism?

Set data and register variables that require imputation and those that don’t

Step 1: Imputation

Take a look at the resulting dataset

Take a look at the resulting distributions

Steps 2 and 3: Analyze the data

Steps 2 and 3: Analyze the data -- Regression

Other questions • What if the imputation model is wrong? • Schafer (1997) shows that MI works well even if some of the underlying assumptions are not met • Rules of thumb regarding how much missing data? • Have read that if fewer than 5% of the observations are missing data and MCAR isn’t way off, listwise deletion or single regression-based imputation work pretty well • Also, studies show that MI works well with up to 40% missing data on some variables

Conclusions • All methods for handling missing data that are not MCAR can lead to false inferences • Simulations suggest that methods that take into account the missing data uncertainty can produce less biased and more efficient estimates than other methods. Multiple imputation is one such method. • My approach: Estimate using listwise deletion and multiple imputation. If they are the same, report one in the appendix. If they are different, examine how. For instance, are the parameter estimates the same but the standard errors bigger?

Some References • Allison, Paul. 2002. Missing Data. Sage university Paper. • Little, Roderick J.A. 1992. Regression with Missing Xs: A Review. Journal of the American Statistical Association. Vol. 87, No 420. • Little, Roderick J.A. & Donald B. Rubin. 1987. Statistical Analysis with Missing Data. John Wiley & Sons. • Puma, Michael J, etc. 2009. What to do when Data are Missing in Randomized Trials. NCEE IES. US department of Education • Rubin, Donald B. 1987. Multiple Imputation for Nonresponse in Surveys. John Wiley & Sons. • Stata Multiple Imputation Reference Manual. Release 11. • http://www.ats.ucla.edu/stat/stata/seminars/missing_data/mi_in_stata_pt2.htm

Multiple Imputations: Introduction and Application in Stata

Multiple Imputations: Introduction and Application in Stata

Presentation Transcript

Stata: Getting Starting and Being Productive with VA Data

Introduction to Multiple Imputation

Introduction to Statistical Computing in Clinical Research

Using Stata for Subpopulation Analysis of Complex Sample Survey Data

An Efficient Data Envelopment Analysis with a large data set in Stata

Finding help

Introduction to STATA/SPSS

Assumption checking in “normal” multiple regression with Stata

A suite of Stata programs for network meta-analysis

INTRODUCTION TO STATA

SEQUENTIAL IMPUTATIONS AND BAYESIAN MISSING DATA PROBLEMS AUGUSTING KONG, JUN LIU WING HUNG WONG

Stata Introduction, Short v2

Analysis of multiple informant/ multiple source data in Stata

Getting Started with STATA

Stata 简介

CCPR Computing Services Workshop: Introduction to Stata June, 2006

Threads

Topics

Introduction to Multiple Imputation

Software for data management: The contribution of Stata