450 likes | 1.25k Views
2. Outline. Missing MechanismMultiple imputationPropensity scoresApplicationsSOLASStataSAS. 3. How to deal with missing data. Do nothingExclude subjects with missing values? Expand the results from the sub-sample to the whole sampleMake a guess, replace with the guessed valuesFill
E N D
1. 1 Missing Data Analysis – Multiple Imputation Ming-Yu Fan, PhD
April 30, 2008
2. 2 Outline Missing Mechanism
Multiple imputation
Propensity scores
Applications
SOLAS
Stata
SAS
3. 3 How to deal with missing data Do nothing
Exclude subjects with missing values
? Expand the results from the sub-sample to
the whole sample
Make a guess, replace with the guessed values
Fill in with simple guess, e.g. sample mean
? Expand the results from the sub-sample to
the whole sample
? Similar to “do nothing”
Fill in with better guessed values
? Imputation
4. 4 Missing pattern (1)
5. 5 Missing pattern (1) – cont. Do nothing
Figure 1.1 has the same mixed color
as figure 1.2
Fill in with sample mean
The two mixed colors are still identical
Fill in with better guessed values
Nice but not necessary
6. 6 Missing pattern (2)
7. 7 Missing pattern (2) – cont. Do nothing
Figure 1.1 and figure 1.2 have different mixed
colors
Fill in with sample mean
Two figures will have different mixed color
Fill in with better guessed values
Necessary
If we can correctly identify the “slices”, we can
better guess a missing value according to the
observed value in the same slice
The final mixed colors might be similar
8. 8 Missing pattern (3)
9. 9 Missing pattern (3) – cont. Do nothing
Figures 3.1 and 3.2 have different mixed colors
Fill in with sample mean
Two figures will have different mixed colors
Fill in with better guessed values
Even if we can identify the slices, we won’t be
able to correctly guess the missing value
Ex: we won’t be able to guess the missing
brown piece based on the grey observed piece
10. 10 Missing Mechanism Missing Completely At Random (MCAR)
The best scenario
Simple approaches can yield unbiased results
Missing At Random (MAR)
The less ideal scenario
More advanced approaches are necessary; can yield
unbiased results
Not Missing At Random (NMAR)
The worse scenario
No approaches can help with the biased results
11. 11 Missing Mechanism - cont. How do we determine the missing mechanism?
Since missing information is not observed, we really
don’t know how the complete sample looks like, and
thus we can’t say for sure the missing data are MCAR,
MAR, or NMAR
Can we guess?
E.g. (1): income missing because the patient’s income is
extremely high
E.g. (2): gender missing because the reviewer forgets to
fill in the information
E.g. (3): SCL-20 items missing because older men don’t
like to answer some of the questions
12. 12
13. 13 Missing Mechanism - cont. Respondents and non-respondents are different in some
baseline characteristics
Probably not MCAR
MAR or NMAR? The truth is, most of the time we can’t really
determine that
A common approach: unless a missing value is clearly NMAR
(e.g. income), we would assume MAR on the missing data
and apply methods that are based on this assumption (e.g.
propensity score weighting, multiple imputation)
In reality, it is not common to have MCAR, hence “do
nothing” and “fill in with sample mean” approaches are likely
to introduce bias
14. 14 Imputation Assumption: MAR
Challenges:
How to identify the “slices”
How to guess the missing values
15. 15 Imputation – example How to impute the missing SCL for patient # 5?
Sample mean: (3.8 + 0.6 + 1.1 + 1.3)/4 = 1.7
By age: (3.8+0.6)/2 = 2.2
By sex: 1.1
By education: 1.3
By race: (3.8 + 0.6 + 1.3)/3 = 1.9
By ADL: (1.1 + 1.3)/2 = 1.2
Who is/are in the same “slice” with #5?
16. 16 Propensity score Measure the similarity by “the likelihood of being
observed/missing”
Use logistic regression models to estimate this
likelihood
Dependent variable Z =
1 if a subject’s outcome is observed
0 if a subject’s outcome is missing
Independent variables = anything that might be
associated with the outcome being missing (Z=1)
Demographic information
Baseline characteristics
17. 17 Propensity score – cont. Model:
p = prob(Z=1)
log(p/(1-p)) = ß0 + ß1X1 + ß2X2 + … + ßkXk
Z has no missing values
X1~Xk all must have non-missing values
Statistical significance of ß’s is not important
The predicted p’s derived from the model are the
propensity scores
18. 18 Propensity score – example Dependent variable:
Y = 12-month SCL score
Z = 1 if Y is observed, Z = 0 if Y is missing
Independent variable:
X1 = age = Age
X2 = sex ( = 1 if male, = 0 if female) = Sex
X3 = number of chronic conditions = NumC
X4 = baseline SCL score = SCL00
Model:
log(p/(1-p)) = ß0 + ß1X1 + ß2X2 + ß3X3 + ß4X4
Result:
ß0 = 0.31; ß1 = 0.003; ß2 = -0.58; ß3 = -0.25; ß4 = 0.25
19. 19 Propensity score – example log(p/(1-p)) =
(0.31) + (0.003)·Age + (-0.58)·Sex + (-0.25)·NumC + (0.25)·SCL00
Derive the propensity scores for subject A & B:
Subject A: 70-year-old male, 3 chronic conditions, SCL00 = 1.7
(0.31)+(0.003)*70+(-0.58)*1+(-0.25)*3+(0.25)*1.7 = -0.385
log(p/(1-p)) = - 0.385 ? p = 0.405
Subject B: 85-year-old female, 4 chronic conditions, SCL00 = 0.7
(0.31)+(0.003)*85+(-0.58)*0+(-0.25)*4+0.25*0.7 = -0.26
log(p/(1-p)) = -0.26 ? p = 0.435
20. 20 Propensity score – cont. We can compute the propensity score for every
subject, including those with missing outcome
We already know whether a subject’s outcome is
observed or missing
Propensity scores do not “predict” the probability of
missing outcome in the sample
They estimate the “likelihood/probability” of “having
the outcome observed” for ANY subject with a similar
background measured by the independent variables
Subjects with close propensity scores are considered
“similar” (in the same “slice”)
21. 21 Imputation – hot-deck How to impute the missing SCL for patient # 5?
4 strata ? closest to #2 ? impute with 0.6
2 strata ? closest to both #2 and #3 ? impute with a randomly
selected value from (0.6, 1.1)
The method is called “Hot-Deck”; #2, #3 are called “donors”
Common approach:
Stratify the sample by the propensity scores (e.g. 5 strata)
Randomly select a donor from the same stratum and impute the
missing value with the donor’s observed value
22. 22 Imputation – regression Model:
SCL = b0 + b1Age + b2Sex + b3Edu + b4Race + b5ADL + b6Pain + b7Comorb
Fit the model to the observed data:
b0=-0.8, b1=0.02, b2=-0.5, b3=0.05, b4=-0.6, b5=0.1, b6=0.1, b7=0.05
Plug in the information of #5 to derive the predicted value:
(-0.8) + (0.02)·70 + (-0.5)·0 + (0.05)·21 + (-0.6)·1 + (0.1)·2
+ (0.1)·4 + (0.05)·3
= 1.8 = predicted SCL
Notes:
Predicted values might be out of the natural range of the outcome
(e.g. SCL > 4 or SCL < 0)
For ordinal outcomes, the predicted values might not be plausible
(e.g. number of people living in the house = 2.7)
23. 23 Multiple imputation For each missing value, impute m data points
m >1, usually m = 5
For single imputation m = 1
What’s wrong with single imputation?
Imputed values are derived from the observed sample,
and thus the imputed sample is more homogeneous
Variances are under-estimated
(10, 20, 30) ? mean = 20, variance = 100
(10, 20, 30, 20, 20, 20) ? mean = 20, variance = 40
More likely to yield biased result
Advantage of multiple imputation
Add the variation across the m data sets “back” to the
estimation of variance
Result is less likely to be biased
24. 24 Multiple imputation – cont. To impute is easy – repeat for m times
To analyze is more complicated
Suppose m = 5
? (mean, median, proportion, etc)
s = squared standard error = se2 = sd2/N
Derived (?1, ?2, ?3, ?4, ?5 ) and (s1, s2, s3, s4, s5) from the 5
imputed data sets
Rubin 1987
The combined ? = (?1 + ?2 + ?3+ ?4 + ?5)/5
The combined s = v1 + [1+(1/m)]×v2
v1 = (s1 + s2 + s3+ s4 + s5)/5
v2 = variance across (?1, ?2, ?3, ?4, ?5)
= {(?1-?)2 + (?2-?)2 + (?3-?)2 + (?4-?)2 + (?5-?)2}/ (5-1)
For more complicated analyses we need statistical
software
25. 25 Multiple imputation - SOLAS SOLAS 3.2 (Statistical Solutions Ltd.)
~ $1000, no need for renewal
Recommended by Dr. Rubin
Can impute longitudinal data with both item
missing and wave missing
Can impute many variables with missing data
simultaneously (internal algorithm to form
monotone missing pattern)
26. 26
27. 27
28. 28
29. 29
30. 30 Multiple imputation - Stata Stata version 7.0 and above
(Stata Corporation, College Station TX)
~ $100 for UW faculty/students, no need for renewal
Free download: “ice” (or “mice” for version 7+) to
impute missing values and “micombine” to analyze
multiple imputed data sets
(macros written by Dr. Patrick Royston)
“Help” ? “Search” ? choose “Search all” and type
keywords “multiple imputation” ? click the links to
download the macros
31. 31
32. 32
33. 33
34. 34 Multiple imputation - SAS SAS version 9.1+
(SAS Institute Inc., Cary, NC)
~ $100 for UW faculty/students, need to
renew every year
“PROC MI” to impute missing values
“PROC MIANALYZE” to analyze multiple
imputed data sets
35. 35
36. 36
37. 37
38. 38
39. 39
40. 40
41. 41 Summary Missing data – results might be biased
Multiple imputation – needs additional
works but generally yields better results
Many statistical software have programs
available for imputing missing values and
analyzing imputed data