1 / 32

Data analysis with missing values sociology.ohio-state

Missing values. Common in social researchnonresponse, loss to followuplack of overlap between linked data setssocial processesdropping out of school, graduation, etc.survey design

arleen
Download Presentation

Data analysis with missing values sociology.ohio-state

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. Data analysis with missing values http://www.sociology.ohio-state.edu/people/ptv/faq/missing/missing.ppt Ohio State University Department of Sociology brownbag Paul T. von Hippel May 2, 2003

    2. Missing values Common in social research nonresponse, loss to followup lack of overlap between linked data sets social processes dropping out of school, graduation, etc. survey design skip patterns between respondents

    3. Methods Always bad methods Mean (median, mode) imputation Pairwise deletion a.k.a. available case analysis Dummy variable adjustment Often good methods Listwise deletion (LD) a.k.a. complete case analysis Multiple imputation (MI) (Full information) maximum likelihood (ML)

    4. Simulated data Population maleness X1=1 if male, 0 if female age X2 ~ N (50,102) weight Y = b0 + X1 b1 + (X220)b2 + e e ~ N(0,s2) b0=125, b1=40, b2=1, s=15 Samples small n=20 to illustrate procedures large N=10,000 to check bias and efficiency Various patterns of missingness X1 is completely observed X2 and/or Y may have missing values

    5. Simulated sample n=20 Women (gray) less likely to disclose weight (Y) age (X2)

    6. Method 1: Listwise deletion delete all cases with missing values for Y, X1, or X2 analyze remaining (complete) cases common software default

    7. Myths about listwise deletion Myth LD always inefficient Fact LD efficient if only Y is missing Myth LD biased unless cases are deleted at random Fact LD unbiased unless deletion depends on e

    8. Assumption: listwise deletion LD assumes deletion does not depend on e Otherwise es for complete cases wont have mean of 0 Y = b0 + X1 b1 + (X220)b2 + e, where e ~ N(0,s2) Assumption satisfied women (X1=0) less likely to disclose weight Y or age X2 deletion depends on X1 Assumption violated overweight (e>0) less likely to disclose weight Y or age X2 delete mostly positive es, leaving negative es complete cases are mostly underweight Results are biased

    9. N.B.: Assumptions relate to model If model neglects sex and age Y = m + e, where e ~ N(0,s2) then sex X1 is in e, and womens nondisclosure causes bias More simply Complete cases mostly men (e>0) m will be overestimated

    10. Method 2: Multiple imputation

    11. a. Mean imputation Technique Calculate mean over cases that have values for Y Impute this mean where Y is missing Ditto for X1, X2, etc. Implicit models Y=mY X1=m1 X2=m2 Problems ignores relationships among X and Y underestimates covariances

    12. b. Conditional mean imputation Technique & implicit models If Y is missing impute mean of cases with similar values for X1, X2 Y = b0 + X1 b1 + X2 b2 Likewise, if X2 is missing impute mean of cases with similar values for X1, Y X1 = g0 + X1 g1 + Y g2 If both Y and X2 are missing impute means of cases with similar values for X1 Y = d0 + X1 d1 X2= f0 + X1 f1 Problem Ignores random components (no e) ?Underestimates variances, ses

    13. c. Single random imputation Implemented in SPSS MVA module available in SRL http://www.spss.com/PDFs/SMV115SPClr.pdf Like conditional mean imputation but imputed value includes a random residual Implicit models If Y is missing Y = b0 + X1 b1 + X2 b2 + eY.12 Likewise, if X2 is missing X2 = g0 + X1 g1 + Y g2 + e2.1Y If both Y and X2 are missing Y = d0 + X1 d1 + eY.1 X2= f0 + X1 f1 + e2.1

    14. Problem with single imputation Still underestimates ses! treats imputed values like observed values when they are actually less certain ignores imputation variation

    15. Imputation variation Sampling variation If you take a different sample you get different parameter estimates Standard errors reflect this One way to estimate sampling variation measure variation across multiple samples called bootstrapping Imputation variation If you impute different values you get different parameter estimates Standard errors should reflect this, too One way to estimate imputation variation measure variation across multiple imputed data sets called multiple imputation

    16. d. Multiple imputation Case 1 is missing weight Given 1s sex and age and relationships in other cases generate a plausible distribution for 1s weight

    17. d. Multiple imputation We impute these plausible values, creating 5 versions of the data setmultiple imputations

    18. d. Multiple imputation For each imputed data set, estimate parameters (white) sampling variances and covariances (gray)

    19. Sampling variation vs. imputation variation Over the 5 analyses, Mean( b0 ) estimates b0 Mean(s2b0) estimates the variance in b0 due to sampling Var(b0 ) estimates the variance in b0 due to imputation

    20. MI standard errors Total variance in b0 Variation due to sampling + variation due to imputation Mean(s2b0) + Var(b0 ) Actually, theres a correction factor of (1+1/M) for the number of imputations M. (Here M=5.) So total variance in estimating b0 is Mean(s2b0) + (1+1/M) Var(b0 ) = 179.53 + (1.2) 511.59 = 793.44 Standard error is ?793.44 = 28.17

    21. MI estimates in SAS

    22. Suppose Y has a missing value We estimate the distribution of possible Y values (Full information) maximum likelihood (ML)

    23. ML in AMOS

    24. ML vs. MI: Example

    25. Assumption: MI and ML Remember: LD assumes deletion independent of e MI and ML have a less restrictive assumption: Values are missing at random (MAR) The probability that a value is missing depends only on values that are not missing e.g., women X1 (complete) are more likely to withhold weight Y and age X2

    26. MAR with deletion independent of e Women (X1=0) less likely to disclose weight Y and age X2 Data MAR Deletion independent of e All methods approximately unbiased LD slightly less efficient

    27. MAR with deletion dependent on e Overweight (e>0) less likely to disclose age X2 LD biased because deletion depends on e bias evident in b2, s, ses MI & ML approximately unbiased because values are MAR

    28. Summary MI & ML more efficient than LD unless only Y is missing unbiased under less restrictive assumptions MI & ML require MAR LD requires deletion independent of e But theres a fly in the ointment

    29. Values missing Probability that values are missing depends on the missing values themselves e.g., the probability that weight Y is missing is higher for the overweight (depends on Y) is higher for women (depends on X1) and sometimes X1 is missing, too.

    30. NMAR If values are NMAR, e.g., overweight less likely to disclose weight all todays methods are biased

    31. Software Both AMOS ML and SAS PROC MI assume missing values are multivariate normal But your data may be nonnormal categorical clustered or nested Consider ad hoc adjustments (Allison 2002) Or use different software MI www.stat.psu.edu/~jls/misoftwa.html www.multiple-imputation.com review in Horton & Lipsitz (2001) ML for categorical data (links from Allison 2002) http://www.kub.nl/faculteiten/fsw/organisatie/departementen/mto/software2.html (Lem) www2.qimr.edu.au/davidD (LOGLIN)

    32. Concise reference works Allison, P. (2002). Missing data. Thousand Oaks, CA: Sage [greenback]. Horton, NJ & Lipsitz, SR. (2001) Multiple imputation in practice: Comparison of software packages for regression models with missing variables. The American Statistician 55(3): 244-254. Little, R.J.A. (1992) Regression with missing Xs: A review. Journal of the American Statistical Association 87(420):1227-1237.

    33. Further references Mostly ML Anderson, T.W. (1956) Maximum likelihood estimates for a multivariate normal distribution when some observations are missing. Little, RL & Rubin, DB. (1st ed. 1990, 2nd ed. 2002). Statistical analysis with missing data. New York: Wiley. Mostly MI Schafer, JL. (1997a). Analysis of Incomplete Multivariate Data. London: Chapman & Hall. Rubin, DB. (1987). Multiple imputation for survey nonresponse. New York: Wiley.

More Related