200 likes | 491 Views
2. Census Imputation Research Plan. Few changes made to actual production imputation methods in many yearsWith redesign of the SIPP, this is an opportunity to consider what changes might be madeGoal of this paper: test a new method on an important income variable: job-level earnings. 3. Proposed
E N D
1. Testing New Imputation Methods for Earnings collected by the Survey of Income and Program Participation (SIPP) Presentation to the ASA/SRM SIPP Working Group
November 17, 2009
Gary Benedetto and Martha Stinson
2. 2 Census Imputation Research Plan Few changes made to actual production imputation methods in many years
With redesign of the SIPP, this is an opportunity to consider what changes might be made
Goal of this paper: test a new method on an important income variable: job-level earnings
3. 3 Proposed Improvements Model-based approach
Use administrative data to mitigate problems caused when survey data are not “missing at random”
Multiple imputation
4. 4 Model-based Approach: Advantages Current method (hot-deck) depends on a donor matrix with reasonable cell sizes
Problem: large number of stratification variables produce cells with no donors
Solution: cold-deck values are used
Result: imputations take account of fewer job or person characteristics
5. 5 Model-based Approach: Implementation We employed an imputation method that used linear regressions to impute missing values
Stratified sample by set of characteristics, ran regressions for each sub-group that was large enough
Sub-groups that were too small were combined
Variables that were dropped from stratification list were added as explanatory variables in the regression
6. 6 Data not “Missing at Random”:The Problem All imputation methods that use survey data exclusively are built on the assumption that the relationships between survey variables are the same for everyone, regardless of missing data
Assume relationship between X1, X2, X3 and Y can be estimated
Assume if Y is missing, X1, X2, and X3 are good predictors
However if the relationship between Y and X1, X2, X3 is different when Y is missing, the imputation will be flawed
7. 7 Data not “Missing at Random”:Lessening the Impact Information from an outside source can help account for unobservable (in the survey) differences between people
We used administrative earnings data in the imputation model for this purpose
8. 8 Multiple Imputation: Motivation Since the 1970s, Donald Rubin has argued that imputation adds variability to user-calculated statistics
Traditional methods impute only once
User has no way to account for variability
Multiple imputation allows the user to calculate variance that includes a piece due to imputation
9. 9 Multiple Imputation: Our Approach We impute earnings 4 times by estimating the Posterior Predictive Distribution and taking draws from this distribution
This creates 4 separate data sets, or implicates
For people with non-imputed values, earnings are identical across implicates
For people with imputed values, earnings vary across implicates
Use these 4 data sets in our analysis that follows
Combine results from the 4 implicates using simple formulae published by Rubin
10. 10 Project Specifics: SIPP data SIPP collects information on 2 jobs per wave
Earnings in the public-use data are given for each job in each month
Imputation flags indicate when a hot-deck imputation was performed
We create a person-job level data set with person characteristics, job characteristics, and reported earnings but no original imputations
We merge on data from the Detailed Earnings Record (DER) extract from the Social Security Administration’s Master Earnings File
DER data are earnings reported on W-2 forms: gross, uncapped, and employer specific
11. 11 Project Specifics: Sample We impute earnings for people who:
matched to the administrative data
were 15+ years old at the time of the job
We impute earnings for jobs that:
were not unpaid family jobs
were not originally type Z imputation
We impute earnings for months when:
the respondent was actually interviewed (i.e. we don’t do missing wave imputation)
the job was on-going
Summary: We impute missing earnings reports during the time period of a reported job
12. 12 Project Specifics: Process Step 1: Use Bayesian Bootstrap to impute whether missing month had positive or zero earnings
Find a donor based on stratification variables but take account of sample uncertainty
Chose this because this is relatively rare event and sample size wasn’t big enough to do model-based imputation
13. 13 Project Specifics: Process (cont) If the respondent was imputed to have positive earnings, use linear regression model to impute earnings
Imputed monthly earnings is a random variable
Distribution has two sources of variation:
variation in error term in regression model
variation in estimated parameters: ?’s and ?2
Take draws from distributions of ?’s and ?2 and error term
Use draws to calculate predicted value based on observed X variables
Predicted value is new impute
Take four separate draws to create four implicates
14. 14 Project Specifics: Modeling We use the following variables to stratify the sample (byvars):
Age categories, number of jobs in SIPP by number of jobs in DER, positive earnings from DER, month in SIPP sample, positive earnings in SIPP in prior and post month
We use the following variables as control variables (xvars) in the linear regressions
age, male, race, education, leads and lags of positive earnings indicators from DER and SIPP, leads and lags of earnings from DER and SIPP
15. 15 Results: DistributionsJob-level earnings for January 2004
16. 16 Results: Distributions Job-level earnings for January 2004, by imputation group
17. 17 Results: DistributionsPerson-level earnings for 2004
18. 18 Results: Sub-Sample Means
19. 19 Results: Earnings Volatility
20. 20 Results: Correlations
21. 21 Conclusion Research phase of new imputation methods takes time
Next steps:
try imputing for those without admin. data
iterate several times – may smooth out volatility
try with new EHC instrument