1 / 14

Prediction and Imputation in ISEE

Prediction and Imputation in ISEE. - Tools for more efficient use of combined data sources. Li-Chun Zhang, Statistics Norway Svein Nordbotton, University of Bergen. ISEE model: A sketch. Use of a statistical register. Combining administrative and survey data

sybil
Download Presentation

Prediction and Imputation in ISEE

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Prediction and Imputation in ISEE - Tools for more efficient use of combined data sources Li-Chun Zhang, Statistics Norway Svein Nordbotton, University of Bergen

  2. ISEE model: A sketch

  3. Use of a statistical register • Combining administrative and survey data • Model-based prediction or weighting • Construction of statistical registers • Uses of a statistical register • Prediction of (sub-)population totals • Multiple uses & general database quality => inferential concerns associated with imputation • How to balance between the two types inferential concerns?

  4. A triple-goal criterion for statistical registers • Effisicient population totals of interest • Correct co-variances among survey variables, as well as between survey and auxiliary variables • Non-stochastic & constant tabulation

  5. A simultaneous prediction method • NNI as the only feasible approach in terms of preserving co-variances among all the variables. To improve efficiency: introduce restrictions on the imputed totals, which may be obtained separately from imputation, say, through regression prediction. To be referred to as NNI with restrictions (NNI-WR). • A simultaneous prediction method • Values are generated outside of the sample • Efficient for prediction of population totals • Not optimal (or best) prediction of each specific unit, but for the assemble of units, now that attention is given to the co-variances among the variables.

  6. About NNI-WR • Separation of prediction of totals from general imputation concerns, allowing full freedom in search of efficient methods • Solves variance estimation problem at the same time • Genuine multivariate imputation with realistic imputed values • Non-parametric nature and mild regularity condition suggest robustness, compared to standard regression based approaches • NNI can be made non-stochastic, yielding constant tabulations on repetition

  7. An algorithm and current research • An algorithm • Jump-start phase: to speed up the imputation procedure if desirable • Fine-tune phase: relaxation to k-nearest neighbor imputation for better agreement with restrictions; consistency remains • Adjustment between the two phases • Current research • How well does the algorithm perform in real statistical productions? • Effective way of setting up the restrictions, i.e. maximum control with minimum number of explicit restrictions for imputation? • Evaluation of micro-data quality

  8. Background information: Some standard methods of prediction and imputation

  9. Basic prediction approach • Under the general linear model: • Target parameter T = linear combination of y-values in the population • Estimation of T  Prediction of T outside of the selected sample • Prediction of individuals: A special case • Main problems for a statistical register • Lack of natural variation in data; especially if many units have the same x-values • Infeasible simultaneously for a large amount of variables; impractical as production mode; leading to inconsistency of cross-tabulation

  10. Random regression imputation (RRI) • To emulate the natural variation in data: Add a random residual to the best predicted y-value • Hot-deck as a special case • Main problems: • Extra variance of imputed estimator due to random imputation => never fully efficient • Random imputation not the only means for creating natural variation in data • Different tabulations on repetition => lack of acceptability and face-value in official statist.

  11. Multiple imputation (MI) • Independent random imputations + formulae for combining results • Bayesian or frequentist approach • Main problems: • Removes all the extra imputation variance only if infinite number of repetitions. Otherwise, still not fully efficient & non-constant tabulations • A common misunderstanding: only MI can yield acceptable measures of accuracy.

  12. Predictive mean matching (PMM) • Find the donor among the observed units who has the same predict y-value & impute the observed y-value • Noticeable difference from RRI as the chance of multiple donors decreases; PMM is more efficient due to the removal of imputation variance. • Essentially a marginal, variable-by-variable approach

  13. Nearest neighbor imputation (NNI) • Provided a set of covariates and a distance metric, the donor is the ‘nearest’ observed unit. • A non-parametric generalization of PMM & dot-deck as a special case. More flexible and practical for multivariate imputation than regression models. • Chen and Shao (2000): consistent estimator of totals as well as finite population distributions, provided the absolute difference in conditional means of y is bounded by the ‘distance’ between two units. Linear models as special cases. • Can be made non-stochastic by introducing extra seemingly uncorrelated covariates, such as Zip code. • Main draw back: Usually not efficient (i.e. local smoothing instead of global regression predictor)

  14. Artificial neural network (ANN) • Class of functional imputation • ANN as generalized regression functions (Bishop, 1995) • No analytic predictor • Unrealistic imputed values for categorical variables of interest • Usually not fully efficient

More Related