340 likes | 455 Views
Evaluating Different Approaches for Multiple Imputation Under Linear Constrains. Jörg Drechsler (Institute for Employment Research, Germany) & Trivellore Raghunathan (University of Michigan). UNECE workshop on data editing and imputation, Vienna 22. April 2008. Overview. The Problem
E N D
Evaluating Different Approaches for Multiple Imputation Under Linear Constrains Jörg Drechsler (Institute for Employment Research, Germany) & Trivellore Raghunathan (University of Michigan) UNECE workshop on data editing and imputation, Vienna 22. April 2008
Overview • The Problem • The Data • A Little Background on Multiple Imputation • The Methodology • The Simulation Design • The Results • Conclusions/Future Work
The Problem • Some Variables Y1, Y2,…, Yk have to some up to a given total Yt • Examples • - turnover in different regions • - number of employees with different qualification levels • - Investment in different subcategories
Overview • The Problem • The Data • A Little Background on Multiple Imputation • The Methodology • The Simulation Design • The Results • Conclusions/Future Work
The Data • The IAB Establishment Panel • The number of employees • with • - Yttotal number of employees • - Yworknumber of blue collar + white collar workers • - Ytrainnumber of trainees • - Yexecnumber of executives • - Yownnumber of owners + working family members • - Ymargnumber of “marginal” workers not covered by social security • - Yothernumber of other employees
The Data • Summary Statistics • - data is heavily skewed • - most variables are semi-continuous • - low variation for the number of owners • - additional constrain: all variables >=0
Overview • The Problem • The Data • A Little Background on Multiple Imputation • The Methodology • The Simulation Design • The Results • Conclusions/Future Work
A Little Background on Multiple Imputation • Generate random draws from • Imputation in two steps • 1. Generate random draws for θ from its posterior distribution given the observed values • 2. Generate random draws for the missing values from the conditional predictive distribution given the drawn parameters • Drawing from 1. can be difficult • Solution MCMC-Techniques
Gibbs Sampling • Generate random draws from conditional univariate distributions • P(Y1|Y-1,θ1) • P(Yk|Y-k,θk) • Iteration provides draws from the joint distribution • Imputation in two steps for every univariate distribution • Imputation model can vary for different variable types
Overview • The Problem • The Data • A Little Background on Multiple Imputation • The Methodology • The Simulation Design • The Results • Conclusions/Future Work
The Methodology • Five imputation methods • - simple imputation of all variables • - independent imputation considering semi-continuity • - nested imputation of the proportions • - non-Bayesian Dirichlet imputation • - Bayesian Dirichlet/Multinomial imputation
Simple Imputation • Impute all variables independently • Transform all continuous variables by taking the cubic root • Ignore semi-continuity • Use simple linear models • Use same models as for independent imputation under semi-continuity • Fulfill constrains by: • setting if • Down weighting all imputed subcategories if Yt is observed or
Independent imputation • Impute all variables independently • Run a logit regression for all variables to address semi-continuity • Outcome: 1 if Yij>0, 0 otherwise • Run a linear regression only for the units with Yij>0 and impute only for missing units with positive outcome in the logit regression • set all other values to 0 • Depending on number of units with Yij>0 stratify for Western/Eastern Germany and two quantiles for establishment size • Use only 20 explanatory variables for number of executives and other workers, ≈ 100 variables for all other dependent variables • Use same correction methods afterwards
Nested Imputation of Proportions • Address semi-continuity with logit-model • Caculate proportions of the total for all subcategories with positive outcome • Use a logit transformation on the proportions • Variables are distributed between ]-Inf;Inf[ • Impute variables with linear models • Use almost the same models as for independent imputation under semi-continuity • Nested Imputation: after imputing number of workers define proportions as • After imputation transform variables back and multiply with totals • Use same correction methods afterwards
Non Bayesian Dirichlet Distribution • Following an idea by Tempelman (2007) • Ignore semi-continuity • Calculate nested proportions again • Assume Dirichlet distribution for the proportions • Generate starting values using the EM-Algorithm for the Dirichlet Distribution
Non Bayesian Dirichlet Distribution II • Imputation Algorithm (Data Augmentation): - draw new values for from obtained by Maximum-Likelihood-Estimation - draw new values for mi number of observations to impute for unit i - Calculate • Not fully Bayesian since the distribution of is only approximated • Use same correction methods afterwards
Bayesian Dirichlet/Multinomial Imputation • Generate starting values using the simple imputation approach • For each unit generate a random draw from the Dirichlet distribution with • For each unit generate a random draw from a multinomial distribution with and • weighted vector p for missing obs, • Use same correction methods afterwards
Overview • The Problem • The Data • A Little Background on Multiple Imputation • The Methodology • The Simulation Design • The Results • Conclusions/Future Work
The Simulation Design • Use fully observed survey data (n=11536) • Generate a random sample with replacement of size n • Generate ≈30% missings for each variable (MAR) • Impute missings with different approaches (m=10, iterations=20) • Calculate different quantities of interest • Repeat whole process of sampling and imputation 100 times
Generating missing values • X1 expected development for the number of employees in the next five years (6 categories) • X2 number of unskilled workers • X3 industry-wide wage agreement (1=Yes) • Increase for any X leads to decrease of pmis
Quality measures • For all estimates of interest: • Compute the estimate from the original survey • Compute the average estimate across the 100 samples • Compute the average estimate across the 100 imputed samples • Compute the 95% coverage rate for the fully observed samples and the imputed samples • Compute • Compute • Compute the average confidence interval overlap for the fully observed sample and the imputed sample
Confidence interval overlap • Suggested by Karr et al. (2006) • Measure the overlap of CIs from the original data and CIs from the imputed data • The higher the overlap, the higher the data utility • Compute the average relative CI overlap for any CI for the imputed data CI for the original data
Estimates of Interest • Mean (Yi) in the 16 German Länder • Logit regression to explain collective wage agreements by establishment size • Use number of employees covered by social security in 6 categories (employees covered by social security = workers + trainees): • Y~emp<10+emp<50+emp<100+emp<250+emp<750+emp>750+industry.dummies • Compare the estimates for the establishment size from the different imputation methods
Overview • The Problem • The Data • A Little Background on Multiple Imputation • The Methodology • The Simulation Design • The Results • Conclusions/Future Work
Overview • The Problem • The Data • A Little Background on Multiple Imputation • The Methodology • The Simulation Design • The Results • Conclusions/Future Work
Conclusions • All methods provide good repeated sampling properties • Differences between the approaches are relatively small • Dirichlet and proportions approach tend to introduce more variability • Dirichlet and proportions approach don’t work very well for owners and others • The simple approach seems to work best with high coverage and low additional variability Future Work • Compare same approaches for more equally distributed subcategories