200 likes | 376 Views
Research on Improvements to Current SIPP Imputation Methods. ASA-SRM SIPP Working Group September 16, 2008 Martha Stinson. Census Imputation Research Plan. Few changes made to actual production imputation methods in many years
E N D
Research on Improvements to Current SIPP Imputation Methods ASA-SRM SIPP Working Group September 16, 2008 Martha Stinson
Census Imputation Research Plan • Few changes made to actual production imputation methods in many years • With redesign of the SIPP, this is an opportunity to consider what changes might be made • New committee formed with members from content, data processing, sampling, and statistical methodology divisions • Incremental approach: test new methods and consider short list of variables that might be substantially improved
Proposed Improvements • Model-based approach • Use administrative data to mitigate problems caused when survey data are not “missing at random” • Multiple imputation
Model-based Approach • Hot-deck depends on a donor matrix with reasonable cell sizes • Small cells must sometimes be collapsed • Collapsing cells creates a more heterogeneous group of donors • Hot-deck can’t take account of variables that are dropped in order to combine cells
Model-based Approach: Research • Consider an imputation method that uses a linear regression to impute missing values • Stratify sample by set of characteristics, run regressions for each sub-group that is large enough • Sub-groups that are too small are combined • Variables that are dropped from stratification list are added as explanatory variables in the regression
Example • Earnings imputation • Stratify by age, gender, race, education, industry, and disability • Including disability may cause some small cells • Perhaps combine sub-groups of disabled and not-disabled white women in their fifties • For this sub-group, include disability status as explanatory variable in regression of earnings on SIPP characteristics
Data Not “Missing At Random” • All imputation methods that use survey data exclusively are built on the assumption that the relationships between survey variables are the same for everyone, regardless of missing data • Assume relationship between X1, X2, X3 and Y can be estimated • Assume if Y is missing, X1, X2, and X3 are good predictors • However if the relationship between Y and X1, X2, X3 is different when Y is missing, the imputation will be flawed
Data Not “Missing At Random”: Research • We can evaluate the magnitude of this problem and mitigate the impact on imputation using administrative data • Information from an outside source can help account for unobservable (in the survey) differences between people
Example: 2004 SIPP panel • 2004 Annual earnings at two main jobs • Earnings at each job are imputed on a monthly basis • Sum across jobs and then across months to get annual earnings • Create count of number of imputed months in the year (range from 0-12) • If either job has imputed earnings, count the full month as imputed
Example: 2004 SIPP panel (cont.) • Split SIPP respondents into groups 1. No months of imputed or missing data 2. 1-4 months of imputed data (no missing) 3. 5-8 months of imputed data (no missing) 4. 9-12 months of imputed data (no missing) • Match earnings report from W-2 records summed for all employers
Example: 2004 SIPP panel (cont.) • If earnings are missing at random, relationship between admin. earnings and other SIPP variables should be the same for all four groups • Test • regress admin. earnings on SIPP demographic variables separately for each group • predict earnings for each group using each set of coefficients (four predicted values per group) • compare each prediction to actual admin. earnings • if coefficients are good predictors, difference should be zero on average
Multiple Imputation • Since the 1970s, Donald Rubin has argued that imputation adds variability to user-calculated statistics • Traditional methods impute only once • User has no way to account for variability • Multiple imputation allows the user to calculate variance that includes a piece due to imputation
Multiple Imputation: Example • How might variance estimates change when switch from single to multiple imputation? • Consider random variable X with mean of .5 • Generate 1000 random samples by taking draws for 80 people • 20 people have missing data for X
Multiple Imputation: Example (cont.) • Impute missing data using 2 methods: • single implicate/hot deck – every observed value has equal prob. of being donor • multiple imputation/Bayesian Bootstrap – prob. of being donor changes across implicates but centered around 1/n; create 32 implicates • Calculate mean and 95% confidence interval for all 1000 random samples
Multiple Imputation: Example (cont.) • Case of 1 implicate • 95% confidence interval contains the true value 88% of the time • Case of multiple implicates • Calculate variance of mean using Rubin formula • 95% confidence interval contains the true value 96.5% of the time • What does this mean? • Statistical hypotheses will be rejected too often using single imputation methods because variance estimates are too small
Examples of Census Research on Imputation Methods • Generalized Additive Model (GAM) • Predictive Mean Matching • Bayesian Bootstrap • Sequential Regression Multiple Imputation (SRMI)
Questions for Panel Discussion • General thoughts and suggestions on model-based imputation? • Suggest specific models? • Which variables should we prioritize? • Would SIPP user community be willing/able to handle multiple implicates?