220 likes | 336 Views
Calibrated imputation of numerical data under linear edit restrictions. Jeroen Pannekoek Natalie Shlomo Ton de Waal. Missing data. Data may be missing from collected data sets Unit non-response Data from entire units are missing Often dealt with by means of weighting Item non-response
E N D
Calibrated imputation of numerical data under linear edit restrictions Jeroen Pannekoek Natalie Shlomo Ton de Waal
Missing data • Data may be missing from collected data sets • Unit non-response • Data from entire units are missing • Often dealt with by means of weighting • Item non-response • Some items from units are missing • Usually dealt with by means of imputation
Linear edit restrictions • Data often have to satisfy edit restrictions • For numerical data most edits are linear • Balance equations: a1x1 + a2x2 + … + anxn + b = 0 • Inequalities: a1x1 + a2x2 + ... + anxn + b ≥ 0
Totals • Sometimes also totals are known
Eliminating balance equations • We can “eliminate balance equations” • Example: set of edits • net + tax – gross = 0 • net ≥ tax • net ≥ 0 • Eliminating the balance equations • net = gross – tax • gross – tax ≥ tax • gross – tax ≥ 0
Eliminating balance equations • We can “eliminate balance equations” • Example: set of edits • net + tax – gross = 0 • net ≥ tax • net ≥ 0 • Eliminating the balance equations • net = gross – tax • gross – tax ≥ tax • gross – tax ≥ 0
Eliminating balance equations • By eliminating all balance equations we only have to deal with inequality edits • If we sequentially impute variables, we only have to ensure that imputed values lie in an interval • Li ≤ xi ≤ Ui • We can now focus on satisfying totals
Imputation methods • Adjusted predicted mean imputation • Adjusted predicted mean imputation with random residuals • MCMC approach
Adjusted predicted mean imputation • We use sequential imputation • All missing values for a variable (the target variable) are imputed simultaneously • We impute target column xt • We use the model xt = β0 + βxp + e • We impute xt = β0 + βxp • Imputed values do not satisfy edits nor totals
Satisfying totals • The totals of missing data for target variable (Xt,mis) as well as predictor (Xp,mis) are known • We construct the following model for observed data • xt,obs = β0 + βxp,obs + e • Xt,mis = β1m + βXp,mis • m is the number of missing values • We apply OLS to estimate model parameters • We impute xt,mis = β1 + βxp,mis • Sum of imputed values then equals known value of this total
Satisfying totals and intervals (edits) • We impute xt,mis = β1 + βxp,mis + at • at,i are chosen in such a way that • Imputed values lie in their feasible intervals • Σi at,i = 0 • Appropriate values for at,i can be found by means of operations research technique • For simple alternative technique, see paper
Satisfying totals and intervals (edits) • Alternatively, draw m residuals by Acceptance/Rejection sampling from a Normal Distribution (zero mean and residual variance of the regression model) that satisfy interval constraints • Adjust random residuals to meet the sum constraints as carried out for at,i
MCMC approach • Start with pre-imputed consistent dataset • Randomly select two records • We select a variable in these records. Note that we know the sum of these two values of this variable for the two records
MCMC approach • We then apply following two steps • We determine intervals for the two values. • We then draw value for one missing value. Other value then immediately follows. • Now, repeat Steps 1 and 2 until “convergence”. • In Step 2 we draw a value from a posterior predictive distribution implied by a linear regression model under uninformative prior, conditional on the fact that it has to lie inside corresponding interval
Evaluation study: methods • Evaluated imputation methods: • UPMA: unbenchmarked simple predictive mean imputation with adjustments to imputations that satisfy interval constraints • BPMA: benchmarked predictive mean imputation with adjustments to imputations that satisfy interval constraints and totals • MCMC: BPMA with adjustments was used as pre-imputed data set for MCMC approach
Evaluation study: data set • 11,907 individuals aged 15 and over that responded to all questions in 2005 Israel Income Survey and earned more than 1000 Israel Shekels for their monthly gross income • Item non-response was introduced randomly to income variables • 20% of records were selected randomly and their net income variable deleted • 20% of records were selected randomly and their tax variable deleted while 10% of those records were in common with the missing net income variable • Totals of each of the income variables are known
Evaluation study: data set • We focus on three variables from the Income Survey: • gross: gross income from earnings • net: net income from earnings • tax: tax paid • Edits: • net + tax = gross • net ≥ tax • gross ≥ 3 x tax • gross ≥ 0, net ≥ 0, tax ≥ 0 • Log transform was carried out on variables to ensure normality of data
Evaluation criteria • dL1 • average distance between imputed and true values • Z • number of imputed records on boundary of feasible region defined by edits • K-S(Kolmogorov-Smirnov) • compares empirical distribution of original values to empirical distribution of imputed values • Sign • sign test carried out on difference between original value and imputed value • Kappa • Kappa statistic for 2-dimensional contingency table; compares agreement against that which might be expected by chance
Conclusions • MCMC approach is doing worse than other methods on all criteria except number of records that lie on boundary • However, MCMC allows multiple imputation in order to take imputation uncertainty into account in variance estimation • BPMA appear to be slightly better compared to UPMA except for K-S statistic • Number of records that lie on boundary for UPMAis cause for concern • MCMC approach is doing slightly better than BPMA approach in this respect
Future research • Improving MCMC approach • Carrying out multiple imputation using MCMC approach to obtain proper variance estimation • In our study a log transformation was carried out on variables to ensure normality of data • Correction factor was introduced into constant term of regression model to correct for this log transformation • Better approach to this problem will be investigated • Extending problem to situations where one has non-equal sampling weights