1 / 25

Multiple Imputation for households surveys A comparison of methods

This article compares different methods for multiple imputation (MI) in Chilean household surveys, specifically focusing on the application of MI to the EFH survey. The article discusses the assumptions of MI and explores conditional and multivariate methods for imputation. It also highlights the challenges and limitations of imputing missing values in household surveys. The article is based on the Stata Users Group Meeting presentation by Rodrigo Alfaro and Marcelo Fuenzalida.

erikk
Download Presentation

Multiple Imputation for households surveys A comparison of methods

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multiple Imputation for households surveysA comparison of methods Stata Users Group Meeting Rodrigo Alfaro – Marcelo Fuenzalida

  2. Outline Multiple Imputation (MI) Methods for MI Chilean Household Surveys Application of MI to EFH Comments and Conclusions Appendix

  3. Multiple Imputation In empirical applications researchers must work with incomplete data sets. A “solution” to the problem above is known as Multiple Imputation (MI) procedure. MI relies on the assumption Missing At Random: “…the probability of missing data on Y is unrelated to the value of Y, after controlling for other variables in the analysis.” (Allison, 2002) In our empirical application we assume MAR. Validity of assumption is beyond the scope of this analysis.

  4. Multiple Imputation MI is based on the assumption that we have a good proxies of distributions of missing observations in the sample. Given this: We could “fill the blanks” taking random realizations. We create m-versions of the complete datasets in order to reflect randomness of the procedure. So, we change our incomplete data set for a set of complete ones. We do not “solve” missing data problem we just measure it. Analysis of multiple data sets could be combine using Rubin’s rules (Rubin, 1987).

  5. Methods for MI We divide methods into 2 categories Conditional: Hot-Deck and Univariate. Multivariate: Normal and Chained Equations. Hot-Deck (hotdeck.ado) Replace missing observations with an observed one taken randomly within a specific group: males with college. Informal conversations: std. dev.’s are still small. Univariate (uvis.ado) Regress variable with missing observations on exogenous variables with no missing. Draw posterior of estimators (beta & sigma) “Predict” missing values.

  6. Methods for MI Chained Equation (ice.ado) Based on Univariate method, but with the possibility of having missing values in exogenous variables. Using reverse equation missing values of exogenous variables are replaced. Loop over previous steps. Normal Assume Multivariate Normal. Estimate parameters using EM algorithm (or other initial value), and draw imputations using Data Augmentation procedure. Theory relies on the convergence of EM. No implemented in Stata. Schafer’s stand-alone package.

  7. Methods for MI Schafer’s stand-alone package: Data.

  8. Methods for MI Schafer’s stand-alone package: EM.

  9. Methods for MI Schafer’s stand-alone package: DA.

  10. Methods for MI Schafer’s stand-alone package: DA.

  11. Chilean Households Surveys We have households surveys with a few number of waves: CASEN, and EPS. CASEN was created to measure poverty. EPS was created to evaluate pension system. At the Central Bank we have been using these surveys to analyze financial fragility of households. However, CASEN and EPS were not created for this purpose. We need new sample designs. In 2007, we started a new survey designed for our purposes: EFH.

  12. Chilean Households Surveys Our surveys have different levels of information Personal information of each member of the household. For example: age, year of education, labor income, etc. Aggregate information of the household. For example: value of assets (cars, house, financial instruments, etc.), debts (mortgage, consumer loans, educational loans, etc.) Our variables of interest could be irrelevant for some households in the sample. Many households have loans with retails-companies instead of borrowing the money directly from banks. Few households have personal savings invested in financial instruments such as stocks, bonds, etc.

  13. Application of MI to EFH Using conditional methods, we could attach the constraints to the imputation procedure. We are able to impute labor income for each member of the household, considering only individual level vars. At the household level, we could impute “banks loans” in a sub-sample of households that declared to have that kind of debt. We use as exogenous variables age, years of education, and gender of interviewee. We impute “debt in retails-companies” with a different sub-sample but with the same exogenous variables. But, we cannot impute “debt in retails-companies” with “banks loans” because sub-samples may be different.

  14. Application of MI to EFH Multivariate methods imply groups of households. Suppose that a household without a house, we could pretend that the value of its house is “zero”. However, that will be affect the correlation between value of the house and total amount of debts. Our first round includes 3 groups defined by the credit access. First group includes households without debts in financial institutions and without any kind of assets. Second group includes households with real assets: cars, and primary house. Third group adds households with debts in financial institutions.

  15. Missing Information Results for EFH • Low missing rate of information. • However, combining variables reduces sample size. • We will concentrate our analysis in the second group. • We use logit transformation to avoid unbounded results. Source: EFH 2007.

  16. Group 2: Conditional Source: EFH 2007.

  17. Households with debt in retails companies. UVIS and HD could have lower variances than raw data. We note that ICE and NORM are consistent in sd’s. Group 2: Multivariate Source: EFH.

  18. Comments and Conclusions In our empirical application Hot-Deck as well as Univariate imputation have smaller variances than multivariate methods. Under a multivariate imputation we are able to have a reasonable standard deviation that reflects the uncertainty of complete data sets. Moving to multivariate methods we have ICE and NORM. Both have advantages and disadvantages. ICE is implemented in Stata with many features available to accommodate several models. In that respect is more general than NORM.

  19. Comments and Conclusions NORM relies in EM algorithm in theoretical terms and DA for imputation method. For that we observed that we need convergence and “reasonable” positive definite matrix. In practical terms we observed that a high rate of missing data is associated with non-converge of EM algorithm and/or some problems with DA. In the case of ICE we observed that the algorithm does not have “convergence problem”. We were able to impute data with a high rate of incomplete information.

  20. Comments and Conclusions We think that NORM provides a useful information about the stability of the model. For that its implementation in Stata would be a good complement for ICE. Don’t you think? We found 2 versions of NORM code: miss.sas by Paul Allison and norm.R by Alvaro Novo. We translated miss SAS routine into Stata-ado programming. Allison used SAS package IML (Interactive Matrix Language). We observed that IML is similar to Mata.

  21. Comments and Conclusions However, original code was not “optimized” as Schafer suggested in his book. A month ago we move to R-code. We found that original code in Fortran was included in R routine. So, norm.R allows to use Fortran code directly in R. Speedy up for this meeting, we translated 800 lines of Fortran into Mata in a week. However, our translation from R to Mata is not good enough… yet.

  22. Problem Besides technical issues on MI we have an unsolved topic for which your opinion is crucial: Aggregation We want to work at the household level, then aggregation of individual information must be done somehow. In order to deal with missing observation at individual level we could apply “improper imputations”: (1) replacing with zeros, or (2) replacing with a predicted value. Alternative, we could code “missing” to the household if any member has missing observation. Because we lost information we discarded it. Any feasible mixture? Is it possible to add variables in order to account for incomplete information at individual level?

  23. Multiple Imputation for households surveysA comparison of methods Stata Users Group Meeting Rodrigo Alfaro – Marcelo Fuenzalida

  24. All Households: Conditional Source: EFH 2007.

  25. All Households: Conditional Source: EFH 2007.

More Related