Missing Data Analysis Multiple Imputation

1. 1 Missing Data Analysis � Multiple Imputation Ming-Yu Fan, PhD April 30, 2008

2. 2 Outline Missing Mechanism Multiple imputation Propensity scores Applications SOLAS Stata SAS

3. 3 How to deal with missing data Do nothing Exclude subjects with missing values ? Expand the results from the sub-sample to the whole sample Make a guess, replace with the guessed values Fill in with simple guess, e.g. sample mean ? Expand the results from the sub-sample to the whole sample ? Similar to �do nothing� Fill in with better guessed values ? Imputation

4. 4 Missing pattern (1)

5. 5 Missing pattern (1) � cont. Do nothing Figure 1.1 has the same mixed color as figure 1.2 Fill in with sample mean The two mixed colors are still identical Fill in with better guessed values Nice but not necessary


7. 7 Missing pattern (2) � cont. Do nothing Figure 1.1 and figure 1.2 have different mixed colors Fill in with sample mean Two figures will have different mixed color Fill in with better guessed values Necessary If we can correctly identify the �slices�, we can better guess a missing value according to the observed value in the same slice The final mixed colors might be similar


9. 9 Missing pattern (3) � cont. Do nothing Figures 3.1 and 3.2 have different mixed colors Fill in with sample mean Two figures will have different mixed colors Fill in with better guessed values Even if we can identify the slices, we won�t be able to correctly guess the missing value Ex: we won�t be able to guess the missing brown piece based on the grey observed piece

10. 10 Missing Mechanism Missing Completely At Random (MCAR) The best scenario Simple approaches can yield unbiased results Missing At Random (MAR) The less ideal scenario More advanced approaches are necessary; can yield unbiased results Not Missing At Random (NMAR) The worse scenario No approaches can help with the biased results

11. 11 Missing Mechanism - cont. How do we determine the missing mechanism? Since missing information is not observed, we really don�t know how the complete sample looks like, and thus we can�t say for sure the missing data are MCAR, MAR, or NMAR Can we guess? E.g. (1): income missing because the patient�s income is extremely high E.g. (2): gender missing because the reviewer forgets to fill in the information E.g. (3): SCL-20 items missing because older men don�t like to answer some of the questions

12. 12

13. 13 Missing Mechanism - cont. Respondents and non-respondents are different in some baseline characteristics Probably not MCAR MAR or NMAR? The truth is, most of the time we can�t really determine that A common approach: unless a missing value is clearly NMAR (e.g. income), we would assume MAR on the missing data and apply methods that are based on this assumption (e.g. propensity score weighting, multiple imputation) In reality, it is not common to have MCAR, hence �do nothing� and �fill in with sample mean� approaches are likely to introduce bias

14. 14 Imputation Assumption: MAR Challenges: How to identify the �slices� How to guess the missing values

15. 15 Imputation � example How to impute the missing SCL for patient # 5? Sample mean: (3.8 + 0.6 + 1.1 + 1.3)/4 = 1.7 By age: (3.8+0.6)/2 = 2.2 By sex: 1.1 By education: 1.3 By race: (3.8 + 0.6 + 1.3)/3 = 1.9 By ADL: (1.1 + 1.3)/2 = 1.2 Who is/are in the same �slice� with #5?

16. 16 Propensity score Measure the similarity by �the likelihood of being observed/missing� Use logistic regression models to estimate this likelihood Dependent variable Z = 1 if a subject�s outcome is observed 0 if a subject�s outcome is missing Independent variables = anything that might be associated with the outcome being missing (Z=1) Demographic information Baseline characteristics

17. 17 Propensity score � cont. Model: p = prob(Z=1) log(p/(1-p)) = �0 + �1X1 + �2X2 + � + �kXk Z has no missing values X1~Xk all must have non-missing values Statistical significance of ߒs is not important The predicted p�s derived from the model are the propensity scores

18. 18 Propensity score � example Dependent variable: Y = 12-month SCL score Z = 1 if Y is observed, Z = 0 if Y is missing Independent variable: X1 = age = Age X2 = sex ( = 1 if male, = 0 if female) = Sex X3 = number of chronic conditions = NumC X4 = baseline SCL score = SCL00 Model: log(p/(1-p)) = �0 + �1X1 + �2X2 + �3X3 + �4X4 Result: �0 = 0.31; �1 = 0.003; �2 = -0.58; �3 = -0.25; �4 = 0.25

19. 19 Propensity score � example log(p/(1-p)) = (0.31) + (0.003)�Age + (-0.58)�Sex + (-0.25)�NumC + (0.25)�SCL00 Derive the propensity scores for subject A & B: Subject A: 70-year-old male, 3 chronic conditions, SCL00 = 1.7 (0.31)+(0.003)*70+(-0.58)*1+(-0.25)*3+(0.25)*1.7 = -0.385 log(p/(1-p)) = - 0.385 ? p = 0.405 Subject B: 85-year-old female, 4 chronic conditions, SCL00 = 0.7 (0.31)+(0.003)*85+(-0.58)*0+(-0.25)*4+0.25*0.7 = -0.26 log(p/(1-p)) = -0.26 ? p = 0.435

20. 20 Propensity score � cont. We can compute the propensity score for every subject, including those with missing outcome We already know whether a subject�s outcome is observed or missing Propensity scores do not �predict� the probability of missing outcome in the sample They estimate the �likelihood/probability� of �having the outcome observed� for ANY subject with a similar background measured by the independent variables Subjects with close propensity scores are considered �similar� (in the same �slice�)

21. 21 Imputation � hot-deck How to impute the missing SCL for patient # 5? 4 strata ? closest to #2 ? impute with 0.6 2 strata ? closest to both #2 and #3 ? impute with a randomly selected value from (0.6, 1.1) The method is called �Hot-Deck�; #2, #3 are called �donors� Common approach: Stratify the sample by the propensity scores (e.g. 5 strata) Randomly select a donor from the same stratum and impute the missing value with the donor�s observed value

22. 22 Imputation � regression Model: SCL = b0 + b1Age + b2Sex + b3Edu + b4Race + b5ADL + b6Pain + b7Comorb Fit the model to the observed data: b0=-0.8, b1=0.02, b2=-0.5, b3=0.05, b4=-0.6, b5=0.1, b6=0.1, b7=0.05 Plug in the information of #5 to derive the predicted value: (-0.8) + (0.02)�70 + (-0.5)�0 + (0.05)�21 + (-0.6)�1 + (0.1)�2 + (0.1)�4 + (0.05)�3 = 1.8 = predicted SCL Notes: Predicted values might be out of the natural range of the outcome (e.g. SCL > 4 or SCL < 0) For ordinal outcomes, the predicted values might not be plausible (e.g. number of people living in the house = 2.7)

23. 23 Multiple imputation For each missing value, impute m data points m >1, usually m = 5 For single imputation m = 1 What�s wrong with single imputation? Imputed values are derived from the observed sample, and thus the imputed sample is more homogeneous Variances are under-estimated (10, 20, 30) ? mean = 20, variance = 100 (10, 20, 30, 20, 20, 20) ? mean = 20, variance = 40 More likely to yield biased result Advantage of multiple imputation Add the variation across the m data sets �back� to the estimation of variance Result is less likely to be biased

24. 24 Multiple imputation � cont. To impute is easy � repeat for m times To analyze is more complicated Suppose m = 5 ? (mean, median, proportion, etc) s = squared standard error = se2 = sd2/N Derived (?1, ?2, ?3, ?4, ?5 ) and (s1, s2, s3, s4, s5) from the 5 imputed data sets Rubin 1987 The combined ? = (?1 + ?2 + ?3+ ?4 + ?5)/5 The combined s = v1 + [1+(1/m)]�v2 v1 = (s1 + s2 + s3+ s4 + s5)/5 v2 = variance across (?1, ?2, ?3, ?4, ?5) = {(?1-?)2 + (?2-?)2 + (?3-?)2 + (?4-?)2 + (?5-?)2}/ (5-1) For more complicated analyses we need statistical software

25. 25 Multiple imputation - SOLAS SOLAS 3.2 (Statistical Solutions Ltd.) ~ $1000, no need for renewal Recommended by Dr. Rubin Can impute longitudinal data with both item missing and wave missing Can impute many variables with missing data simultaneously (internal algorithm to form monotone missing pattern)

26. 26

27. 27

28. 28

29. 29

30. 30 Multiple imputation - Stata Stata version 7.0 and above (Stata Corporation, College Station TX) ~ $100 for UW faculty/students, no need for renewal Free download: �ice� (or �mice� for version 7+) to impute missing values and �micombine� to analyze multiple imputed data sets (macros written by Dr. Patrick Royston) �Help� ? �Search� ? choose �Search all� and type keywords �multiple imputation� ? click the links to download the macros

31. 31

32. 32

33. 33

34. 34 Multiple imputation - SAS SAS version 9.1+ (SAS Institute Inc., Cary, NC) ~ $100 for UW faculty/students, need to renew every year �PROC MI� to impute missing values �PROC MIANALYZE� to analyze multiple imputed data sets

35. 35

36. 36

37. 37

38. 38

39. 39

40. 40

41. 41 Summary Missing data � results might be biased Multiple imputation � needs additional works but generally yields better results Many statistical software have programs available for imputing missing values and analyzing imputed data

Missing Data Analysis Multiple Imputation

Missing Data Analysis Multiple Imputation

Presentation Transcript

Multiple Imputation with large proportions of missing data :how much is too much?

Repetition Multiple imputation

MISSING DATA

Data Imputation

Multiple Imputation of missing data in longitudinal health records

Least-squares imputation of missing data entries

Efficient Algorithms for Imputation of Missing SNP Genotype Data

Multiple Imputation

LECTURE 15 MULTIPLE IMPUTATION

Missing Data

Missing Data

Missing Data

Missing Data

Introduction to Multiple Imputation

Multiple Imputation

Missing Data: Analysis and Design

Classification on Missing Data for Multiple Imputations

Multiple Imputation using SOLAS for Missing Data Analysis

Sensitivity to MAR Assumption in Missing Data: Case Studies Using Model-Based Multiple Imputation