140 likes | 280 Views
Prediction and Imputation in ISEE. - Tools for more efficient use of combined data sources. Li-Chun Zhang, Statistics Norway Svein Nordbotton, University of Bergen. ISEE model: A sketch. Use of a statistical register. Combining administrative and survey data
E N D
Prediction and Imputation in ISEE - Tools for more efficient use of combined data sources Li-Chun Zhang, Statistics Norway Svein Nordbotton, University of Bergen
Use of a statistical register • Combining administrative and survey data • Model-based prediction or weighting • Construction of statistical registers • Uses of a statistical register • Prediction of (sub-)population totals • Multiple uses & general database quality => inferential concerns associated with imputation • How to balance between the two types inferential concerns?
A triple-goal criterion for statistical registers • Effisicient population totals of interest • Correct co-variances among survey variables, as well as between survey and auxiliary variables • Non-stochastic & constant tabulation
A simultaneous prediction method • NNI as the only feasible approach in terms of preserving co-variances among all the variables. To improve efficiency: introduce restrictions on the imputed totals, which may be obtained separately from imputation, say, through regression prediction. To be referred to as NNI with restrictions (NNI-WR). • A simultaneous prediction method • Values are generated outside of the sample • Efficient for prediction of population totals • Not optimal (or best) prediction of each specific unit, but for the assemble of units, now that attention is given to the co-variances among the variables.
About NNI-WR • Separation of prediction of totals from general imputation concerns, allowing full freedom in search of efficient methods • Solves variance estimation problem at the same time • Genuine multivariate imputation with realistic imputed values • Non-parametric nature and mild regularity condition suggest robustness, compared to standard regression based approaches • NNI can be made non-stochastic, yielding constant tabulations on repetition
An algorithm and current research • An algorithm • Jump-start phase: to speed up the imputation procedure if desirable • Fine-tune phase: relaxation to k-nearest neighbor imputation for better agreement with restrictions; consistency remains • Adjustment between the two phases • Current research • How well does the algorithm perform in real statistical productions? • Effective way of setting up the restrictions, i.e. maximum control with minimum number of explicit restrictions for imputation? • Evaluation of micro-data quality
Background information: Some standard methods of prediction and imputation
Basic prediction approach • Under the general linear model: • Target parameter T = linear combination of y-values in the population • Estimation of T Prediction of T outside of the selected sample • Prediction of individuals: A special case • Main problems for a statistical register • Lack of natural variation in data; especially if many units have the same x-values • Infeasible simultaneously for a large amount of variables; impractical as production mode; leading to inconsistency of cross-tabulation
Random regression imputation (RRI) • To emulate the natural variation in data: Add a random residual to the best predicted y-value • Hot-deck as a special case • Main problems: • Extra variance of imputed estimator due to random imputation => never fully efficient • Random imputation not the only means for creating natural variation in data • Different tabulations on repetition => lack of acceptability and face-value in official statist.
Multiple imputation (MI) • Independent random imputations + formulae for combining results • Bayesian or frequentist approach • Main problems: • Removes all the extra imputation variance only if infinite number of repetitions. Otherwise, still not fully efficient & non-constant tabulations • A common misunderstanding: only MI can yield acceptable measures of accuracy.
Predictive mean matching (PMM) • Find the donor among the observed units who has the same predict y-value & impute the observed y-value • Noticeable difference from RRI as the chance of multiple donors decreases; PMM is more efficient due to the removal of imputation variance. • Essentially a marginal, variable-by-variable approach
Nearest neighbor imputation (NNI) • Provided a set of covariates and a distance metric, the donor is the ‘nearest’ observed unit. • A non-parametric generalization of PMM & dot-deck as a special case. More flexible and practical for multivariate imputation than regression models. • Chen and Shao (2000): consistent estimator of totals as well as finite population distributions, provided the absolute difference in conditional means of y is bounded by the ‘distance’ between two units. Linear models as special cases. • Can be made non-stochastic by introducing extra seemingly uncorrelated covariates, such as Zip code. • Main draw back: Usually not efficient (i.e. local smoothing instead of global regression predictor)
Artificial neural network (ANN) • Class of functional imputation • ANN as generalized regression functions (Bishop, 1995) • No analytic predictor • Unrealistic imputed values for categorical variables of interest • Usually not fully efficient