Data analysis with missing values sociology.ohio-state

1. Data analysis with missing valueshttp://www.sociology.ohio-state.edu/people/ptv/faq/missing/missing.ppt Ohio State University Department of Sociology brownbag Paul T. von Hippel May 2, 2003

2. Missing values Common in social research nonresponse, loss to followup lack of overlap between linked data sets social processes dropping out of school, graduation, etc. survey design �skip patterns� between respondents

3. Methods Always bad methods Mean (median, mode) imputation Pairwise deletion a.k.a. available case analysis Dummy variable adjustment Often good methods Listwise deletion (LD) a.k.a. complete case analysis Multiple imputation (MI) (Full information) maximum likelihood (ML)

4. Simulated data Population maleness X1=1 if male, 0 if female age X2 ~ N (50,102) weight Y = b0 + X1 b1 + (X2�20)b2 + e e ~ N(0,s2) b0=125, b1=40, b2=1, s=15 Samples small n=20 to illustrate procedures large N=10,000 to check bias and efficiency Various patterns of missingness X1 is completely observed X2 and/or Y may have missing values

5. Simulated sample n=20 Women (gray)less likely to disclose weight (Y) age (X2)

6. Method 1:Listwise deletion delete all cases with missing values for Y, X1, or X2 analyze remaining (complete) cases common software default

7. Myths about listwise deletion Myth LD always inefficient Fact LD efficient if only Y is missing Myth LD biased unless cases are deleted at random Fact LD unbiased unless deletion depends on e

8. Assumption: listwise deletion LD assumes deletion does not depend on e Otherwise e�s for complete cases won�t have mean of 0 Y = b0 + X1 b1 + (X2�20)b2 + e, where e ~ N(0,s2) Assumption satisfied women (X1=0) less likely to disclose weight Y or age X2 deletion depends on X1 Assumption violated overweight (e>0) less likely to disclose weight Y or age X2 delete mostly positive e�s, leaving negative e�s complete cases are mostly underweight Results are biased

9. N.B.: Assumptions relate to model If model neglects sex and age Y = m + e, where e ~ N(0,s2) then sex X1 is in e, and women�s nondisclosure causes bias More simply Complete cases mostly men (e>0) m will be overestimated

10. Method 2: Multiple imputation

11. a. Mean imputation Technique Calculate mean over cases that have values for Y Impute this mean where Y is missing Ditto for X1, X2, etc. Implicit models Y=mY X1=m1 X2=m2 Problems ignores relationships among X and Y underestimates covariances

12. b. Conditional mean imputation Technique & implicit models If Y is missing impute mean of cases with similar values for X1, X2 Y = b0 + X1 b1 + X2 b2 Likewise, if X2 is missing impute mean of cases with similar values for X1, Y X1 = g0 + X1 g1 + Y g2 If both Y and X2 are missing impute means of cases with similar values for X1 Y = d0 + X1 d1 X2= f0 + X1 f1 Problem Ignores random components (no e) ?Underestimates variances, se�s

13. c. Single random imputation Implemented in SPSS MVA module available in SRL http://www.spss.com/PDFs/SMV115SPClr.pdf Like conditional mean imputation but imputed value includes a random residual Implicit models If Y is missing Y = b0 + X1 b1 + X2 b2 + eY.12 Likewise, if X2 is missing X2 = g0 + X1 g1 + Y g2 + e2.1Y If both Y and X2 are missing Y = d0 + X1 d1 + eY.1 X2= f0 + X1 f1 + e2.1

14. Problem with single imputation Still underestimates se�s! treats imputed values like observed values when they are actually less certain ignores imputation variation

15. Imputation variation Sampling variation If you take a different sample you get different parameter estimates Standard errors reflect this One way to estimate sampling variation measure variation across multiple samples called �bootstrapping� Imputation variation If you impute different values you get different parameter estimates Standard errors should reflect this, too One way to estimate imputation variation measure variation across multiple imputed data sets called �multiple imputation�

16. d. Multiple imputation Case 1 is missing weight Given 1�s sex and age and relationships in other cases generate a plausible distributionfor 1�s weight

17. d. Multiple imputation We impute these plausible values, creating 5 versions of the data set�multiple imputations

18. d. Multiple imputation For each imputeddata set, estimate parameters(white) sampling variances and covariances (gray)

19. Sampling variation vs. imputation variation Over the 5 analyses, Mean( b0 ) estimates b0 Mean(s2b0) estimates the variance in b0 due to sampling Var(b0 ) estimates the variance in b0 due to imputation

20. MI standard errors Total variance in b0 Variation due to sampling + variation due to imputation Mean(s2b0) + Var(b0 ) Actually, there�s a correction factor of (1+1/M) for the number of imputations M. (Here M=5.) So total variance in estimating b0 is Mean(s2b0) + (1+1/M) Var(b0 ) = 179.53 + (1.2) 511.59 = 793.44 Standard error is ?793.44 = 28.17

21. MI estimates in SAS

22. Suppose Y has a missing value We estimate the distribution of possible Y values (Full information) maximum likelihood (ML)

23. ML in AMOS

24. ML vs. MI: Example

25. Assumption: MI and ML Remember: LD assumes deletion independent of e MI and ML have a less restrictive assumption: Values are missing at random (MAR) The probability that a value is missing depends only on values that are not missing e.g., women X1 (complete) are more likely to withhold weight Y and age X2

26. MAR with deletion independent of e Women (X1=0) less likely to disclose weight Y and age X2 Data MAR Deletion independent of e All methods approximately unbiased LD slightly less efficient

27. MAR with deletion dependent on e Overweight (e>0) less likely to disclose age X2 LD biased because deletion depends on e bias evident in b2, s, se�s MI & ML approximately unbiased because values are MAR

28. Summary MI & ML more efficient than LD unless only Y is missing unbiased under less restrictive assumptions MI & ML require MAR LD requires deletion independent of e But there�s a fly in the ointment�

29. Values missing Probability that values are missing depends on the missing values themselves e.g., the probability that weight Y is missing is higher for the overweight (depends on Y) is higher for women (depends on X1) and sometimes X1 is missing, too.

30. NMAR If values are NMAR, e.g., overweight less likely to disclose weight all today�s methods are biased

31. Software Both AMOS ML and SAS PROC MI assume missing values are multivariate normal But your data may be nonnormal categorical clustered or nested Consider ad hoc adjustments (Allison 2002) Or use different software MI www.stat.psu.edu/~jls/misoftwa.html www.multiple-imputation.com review in Horton & Lipsitz (2001) ML for categorical data (links from Allison 2002) http://www.kub.nl/faculteiten/fsw/organisatie/departementen/mto/software2.html (Lem) www2.qimr.edu.au/davidD (LOGLIN)

32. Concise reference works Allison, P. (2002). Missing data. Thousand Oaks, CA: Sage [greenback]. Horton, NJ & Lipsitz, SR. (2001) Multiple imputation in practice: Comparison of software packages for regression models with missing variables. The American Statistician 55(3): 244-254. Little, R.J.A. (1992) Regression with missing X�s: A review. Journal of the American Statistical Association 87(420):1227-1237.

33. Further references Mostly ML Anderson, T.W. (1956) Maximum likelihood estimates for a multivariate normal distribution when some observations are missing. Little, RL & Rubin, DB. (1st ed. 1990, 2nd ed. 2002). Statistical analysis with missing data. New York: Wiley. Mostly MI Schafer, JL. (1997a). Analysis of Incomplete Multivariate Data. London: Chapman & Hall. Rubin, DB. (1987). Multiple imputation for survey nonresponse. New York: Wiley.

Data analysis with missing values sociology.ohio-state

Data analysis with missing values sociology.ohio-state

Presentation Transcript

Meta-analysis with missing data: metamiss

Sensitivity Analysis of Randomized Trials with Missing Data

Replacing Missing Values

Working with Missing Values

Fingerprint Clustering with Bounded Number of Missing Values

Missing values problem in Data Mining

Missing Values

Learning with Missing Data

Treatment of missing values

A Robust Approach for Dealing with Missing Values in Compositional Data

Sensitivity Analysis of Randomized Trials with Missing Data

Missing Data

Binary Clustering with Missing Values (BCMV)

Data Processing with Missing Information

Missing Data

Missing Data

Missing Values in SAS

Special Topic: Missing Values

Missing Data: Analysis and Design

Data Cleansing: Filling Missing Values in Data

Rough Set Strategies to Data with Missing Attribute Values