320 likes | 711 Views
Missing values. Common in social researchnonresponse, loss to followuplack of overlap between linked data setssocial processesdropping out of school, graduation, etc.survey design
E N D
1. Data analysis with missing valueshttp://www.sociology.ohio-state.edu/people/ptv/faq/missing/missing.ppt Ohio State University
Department of Sociology brownbag
Paul T. von Hippel
May 2, 2003
2. Missing values Common in social research
nonresponse, loss to followup
lack of overlap between linked data sets
social processes
dropping out of school, graduation, etc.
survey design
skip patterns between respondents
3. Methods Always bad methods
Mean (median, mode) imputation
Pairwise deletion a.k.a. available case analysis
Dummy variable adjustment
Often good methods
Listwise deletion (LD) a.k.a. complete case analysis
Multiple imputation (MI)
(Full information) maximum likelihood (ML)
4. Simulated data Population
maleness X1=1 if male, 0 if female
age X2 ~ N (50,102)
weight Y = b0 + X1 b1 + (X220)b2 + e
e ~ N(0,s2)
b0=125, b1=40, b2=1, s=15
Samples
small n=20 to illustrate procedures
large N=10,000 to check bias and efficiency
Various patterns of missingness
X1 is completely observed
X2 and/or Y may have missing values
5. Simulated sample n=20
Women (gray)less likely to disclose
weight (Y)
age (X2)
6. Method 1:Listwise deletion delete all cases with missing values for Y, X1, or X2
analyze remaining (complete) cases
common software default
7. Myths about listwise deletion Myth
LD always inefficient
Fact
LD efficient if only Y is missing
Myth
LD biased unless cases are deleted at random
Fact
LD unbiased unless deletion depends on e
8. Assumption: listwise deletion LD assumes deletion does not depend on e
Otherwise es for complete cases wont have mean of 0
Y = b0 + X1 b1 + (X220)b2 + e, where e ~ N(0,s2)
Assumption satisfied
women (X1=0) less likely to disclose weight Y or age X2
deletion depends on X1
Assumption violated
overweight (e>0) less likely to disclose weight Y or age X2
delete mostly positive es, leaving negative es
complete cases are mostly underweight
Results are biased
9. N.B.: Assumptions relate to model If model neglects sex and age
Y = m + e, where e ~ N(0,s2)
then sex X1 is in e, and womens nondisclosure causes bias
More simply
Complete cases mostly men (e>0)
m will be overestimated
10. Method 2: Multiple imputation
11. a. Mean imputation Technique
Calculate mean over cases that have values for Y
Impute this mean where Y is missing
Ditto for X1, X2, etc.
Implicit models
Y=mY
X1=m1
X2=m2
Problems
ignores relationships among X and Y
underestimates covariances
12. b. Conditional mean imputation Technique & implicit models
If Y is missing
impute mean of cases with similar values for X1, X2
Y = b0 + X1 b1 + X2 b2
Likewise, if X2 is missing
impute mean of cases with similar values for X1, Y
X1 = g0 + X1 g1 + Y g2
If both Y and X2 are missing
impute means of cases with similar values for X1
Y = d0 + X1 d1
X2= f0 + X1 f1
Problem
Ignores random components (no e)
?Underestimates variances, ses
13. c. Single random imputation Implemented in SPSS MVA module
available in SRL
http://www.spss.com/PDFs/SMV115SPClr.pdf
Like conditional mean imputation
but imputed value includes a random residual
Implicit models
If Y is missing
Y = b0 + X1 b1 + X2 b2 + eY.12
Likewise, if X2 is missing
X2 = g0 + X1 g1 + Y g2 + e2.1Y
If both Y and X2 are missing
Y = d0 + X1 d1 + eY.1
X2= f0 + X1 f1 + e2.1
14. Problem with single imputation Still underestimates ses!
treats imputed values like observed values
when they are actually less certain
ignores imputation variation
15. Imputation variation Sampling variation
If you take a different sample
you get different parameter estimates
Standard errors reflect this
One way to estimate sampling variation
measure variation across multiple samples
called bootstrapping
Imputation variation
If you impute different values
you get different parameter estimates
Standard errors should reflect this, too
One way to estimate imputation variation
measure variation across multiple imputed data sets
called multiple imputation
16. d. Multiple imputation Case 1 is missing weight
Given 1s sex and age
and relationships in other cases
generate a plausible distributionfor 1s weight
17. d. Multiple imputation We impute these plausible values, creating 5 versions of the data setmultiple imputations
18. d. Multiple imputation For each imputeddata set, estimate
parameters(white)
sampling variances and covariances (gray)
19. Sampling variation vs. imputation variation Over the 5 analyses,
Mean( b0 ) estimates b0
Mean(s2b0) estimates the variance in b0 due to sampling
Var(b0 ) estimates the variance in b0 due to imputation
20. MI standard errors Total variance in b0
Variation due to sampling + variation due to imputation
Mean(s2b0) + Var(b0 )
Actually, theres a correction factor of (1+1/M)
for the number of imputations M. (Here M=5.)
So total variance in estimating b0 is
Mean(s2b0) + (1+1/M) Var(b0 ) = 179.53 + (1.2) 511.59 = 793.44
Standard error is ?793.44 = 28.17
21. MI estimates in SAS
22. Suppose Y has a missing value
We estimate the distribution of possible Y values (Full information) maximum likelihood (ML)
23. ML in AMOS
24. ML vs. MI: Example
25. Assumption: MI and ML Remember: LD assumes deletion independent of e
MI and ML have a less restrictive assumption:
Values are missing at random (MAR)
The probability that a value is missing
depends only on values that are not missing
e.g., women X1 (complete) are more likely to withhold weight Y and age X2
26. MAR with deletion independent of e Women (X1=0) less likely to disclose
weight Y and age X2
Data MAR
Deletion independent of e
All methods approximately unbiased
LD slightly less efficient
27. MAR with deletion dependent on e Overweight (e>0) less likely to disclose age X2
LD biased because deletion depends on e
bias evident in b2, s, ses
MI & ML approximately unbiased because values are MAR
28. Summary MI & ML
more efficient than LD
unless only Y is missing
unbiased under less restrictive assumptions
MI & ML require MAR
LD requires deletion independent of e
But theres a fly in the ointment
29. Values missing Probability that values are missing depends on the missing values themselves
e.g., the probability that weight Y is missing
is higher for the overweight (depends on Y)
is higher for women (depends on X1)
and sometimes X1 is missing, too.
30. NMAR If values are NMAR,
e.g., overweight less likely to disclose weight
all todays methods are biased
31. Software Both AMOS ML and SAS PROC MI assume
missing values are multivariate normal
But your data may be
nonnormal
categorical
clustered or nested
Consider ad hoc adjustments (Allison 2002)
Or use different software
MI
www.stat.psu.edu/~jls/misoftwa.html
www.multiple-imputation.com
review in Horton & Lipsitz (2001)
ML for categorical data (links from Allison 2002)
http://www.kub.nl/faculteiten/fsw/organisatie/departementen/mto/software2.html (Lem)
www2.qimr.edu.au/davidD (LOGLIN)
32. Concise reference works Allison, P. (2002). Missing data. Thousand Oaks, CA: Sage [greenback].
Horton, NJ & Lipsitz, SR. (2001) Multiple imputation in practice: Comparison of software packages for regression models with missing variables. The American Statistician 55(3): 244-254.
Little, R.J.A. (1992) Regression with missing Xs: A review. Journal of the American Statistical Association 87(420):1227-1237.
33. Further references Mostly ML
Anderson, T.W. (1956) Maximum likelihood estimates for a multivariate normal distribution when some observations are missing.
Little, RL & Rubin, DB. (1st ed. 1990, 2nd ed. 2002). Statistical analysis with missing data. New York: Wiley.
Mostly MI
Schafer, JL. (1997a). Analysis of Incomplete Multivariate Data. London: Chapman & Hall.
Rubin, DB. (1987). Multiple imputation for survey nonresponse. New York: Wiley.