1 / 20

Testing New Imputation Methods for Earnings collected by the Survey of Income and Program Participation SIPP

2. Census Imputation Research Plan. Few changes made to actual production imputation methods in many yearsWith redesign of the SIPP, this is an opportunity to consider what changes might be madeGoal of this paper: test a new method on an important income variable: job-level earnings. 3. Proposed

kalila
Download Presentation

Testing New Imputation Methods for Earnings collected by the Survey of Income and Program Participation SIPP

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. Testing New Imputation Methods for Earnings collected by the Survey of Income and Program Participation (SIPP) Presentation to the ASA/SRM SIPP Working Group November 17, 2009 Gary Benedetto and Martha Stinson

    2. 2 Census Imputation Research Plan Few changes made to actual production imputation methods in many years With redesign of the SIPP, this is an opportunity to consider what changes might be made Goal of this paper: test a new method on an important income variable: job-level earnings

    3. 3 Proposed Improvements Model-based approach Use administrative data to mitigate problems caused when survey data are not “missing at random” Multiple imputation

    4. 4 Model-based Approach: Advantages Current method (hot-deck) depends on a donor matrix with reasonable cell sizes Problem: large number of stratification variables produce cells with no donors Solution: cold-deck values are used Result: imputations take account of fewer job or person characteristics

    5. 5 Model-based Approach: Implementation We employed an imputation method that used linear regressions to impute missing values Stratified sample by set of characteristics, ran regressions for each sub-group that was large enough Sub-groups that were too small were combined Variables that were dropped from stratification list were added as explanatory variables in the regression

    6. 6 Data not “Missing at Random”: The Problem All imputation methods that use survey data exclusively are built on the assumption that the relationships between survey variables are the same for everyone, regardless of missing data Assume relationship between X1, X2, X3 and Y can be estimated Assume if Y is missing, X1, X2, and X3 are good predictors However if the relationship between Y and X1, X2, X3 is different when Y is missing, the imputation will be flawed

    7. 7 Data not “Missing at Random”: Lessening the Impact Information from an outside source can help account for unobservable (in the survey) differences between people We used administrative earnings data in the imputation model for this purpose

    8. 8 Multiple Imputation: Motivation Since the 1970s, Donald Rubin has argued that imputation adds variability to user-calculated statistics Traditional methods impute only once User has no way to account for variability Multiple imputation allows the user to calculate variance that includes a piece due to imputation

    9. 9 Multiple Imputation: Our Approach We impute earnings 4 times by estimating the Posterior Predictive Distribution and taking draws from this distribution This creates 4 separate data sets, or implicates For people with non-imputed values, earnings are identical across implicates For people with imputed values, earnings vary across implicates Use these 4 data sets in our analysis that follows Combine results from the 4 implicates using simple formulae published by Rubin

    10. 10 Project Specifics: SIPP data SIPP collects information on 2 jobs per wave Earnings in the public-use data are given for each job in each month Imputation flags indicate when a hot-deck imputation was performed We create a person-job level data set with person characteristics, job characteristics, and reported earnings but no original imputations We merge on data from the Detailed Earnings Record (DER) extract from the Social Security Administration’s Master Earnings File DER data are earnings reported on W-2 forms: gross, uncapped, and employer specific

    11. 11 Project Specifics: Sample We impute earnings for people who: matched to the administrative data were 15+ years old at the time of the job We impute earnings for jobs that: were not unpaid family jobs were not originally type Z imputation We impute earnings for months when: the respondent was actually interviewed (i.e. we don’t do missing wave imputation) the job was on-going Summary: We impute missing earnings reports during the time period of a reported job

    12. 12 Project Specifics: Process Step 1: Use Bayesian Bootstrap to impute whether missing month had positive or zero earnings Find a donor based on stratification variables but take account of sample uncertainty Chose this because this is relatively rare event and sample size wasn’t big enough to do model-based imputation

    13. 13 Project Specifics: Process (cont) If the respondent was imputed to have positive earnings, use linear regression model to impute earnings Imputed monthly earnings is a random variable Distribution has two sources of variation: variation in error term in regression model variation in estimated parameters: ?’s and ?2 Take draws from distributions of ?’s and ?2 and error term Use draws to calculate predicted value based on observed X variables Predicted value is new impute Take four separate draws to create four implicates

    14. 14 Project Specifics: Modeling We use the following variables to stratify the sample (byvars): Age categories, number of jobs in SIPP by number of jobs in DER, positive earnings from DER, month in SIPP sample, positive earnings in SIPP in prior and post month We use the following variables as control variables (xvars) in the linear regressions age, male, race, education, leads and lags of positive earnings indicators from DER and SIPP, leads and lags of earnings from DER and SIPP

    15. 15 Results: Distributions Job-level earnings for January 2004

    16. 16 Results: Distributions Job-level earnings for January 2004, by imputation group

    17. 17 Results: Distributions Person-level earnings for 2004

    18. 18 Results: Sub-Sample Means

    19. 19 Results: Earnings Volatility

    20. 20 Results: Correlations

    21. 21 Conclusion Research phase of new imputation methods takes time Next steps: try imputing for those without admin. data iterate several times – may smooth out volatility try with new EHC instrument

More Related