Introduction to Emerging Methods for Imputation in Official Statistics

Introduction to Emerging Methods for Imputation in Official Statistics Ventspils 08/2006 Pasi Piela

Overview 1. Imputation in the quality framework of official statistics 2. Classes of imputation methods 3. Requirements for imputation in official statistics 4. Past research work 5. Statistical clustering 6. Best imputation methods 7. Processing imputation (and editing) 8. New research plans 9. Multiple imputation 10. Fractional imputation 11. Multilevel model based imputation Pasi Piela

Imputation is defined as the process of statistical replacement of missing values • Editing and imputation are undertaken as part of a quality improvement strategy to improve accuracy, consistency and completeness. Pasi Piela

Classes of imputation methods • A1. Deterministic imputation, or • A2. Stochastic imputation • B1. Logical imputation, • B2. Real donor imputation, or • B3. Model donor imputation • C1. Single imputation, or • C2. Multiple imputation • D1. Hot-deck • D2. Cold-deck Pasi Piela

Five requirements for imputation in official statistics (Chambers, 2001) • Predictive Accuracy: The imputation procedure should maximise the preservation of true values. • Distributional Accuracy: The preservation of the distribution of the true values is also important. • Estimation Accuracy: The imputation procedure should reproduce the lower order moments of the distributions of the true values. • Imputation Plausibility: The imputation procedure should lead to imputed values that are plausible. • Ranking Accuracy: The imputation procedure should maximise the preservation of order in the imputed values. Pasi Piela

Past research work • traditional methods: cell-mean imputation, regression imputation, random donor, nearest neighbour, etc. • advanced imputation techniques • based on statistical clustering (e.g. K-means) • homogenous imputation classes • hierarchical clustering (e.g. classification and regression trees) • ”modern” statistical pattern recognition methods Pasi Piela

Statistical clustering for imputation • imputation classes, cells, clusters, groups average point locating the cluster Computational methods? ...or searching appropriate imputation cells by using categorical sorting variables? Pasi Piela

K-means Clustering • The basic variance minimization clustering algorithm. • The ”Voronoi region” (as part of the tesselation) for cluster i is given by where || refers to the Euclidean norm (distance). At the iteration time t + 1 the weight wi is updated by where #Vi refers to the number of units xk in Vi. Pasi Piela

Classification and Regression Trees • WAID - Weighted Automatic Interaction Detection software • EU FP4 Project AUTIMP Target variable: categorical or continuous Node Y Predictor variable The original data Binary splits X1<5 X1>= 5 X2<7 X2>=7 Original data splitted into two separate parts Only one variable in turn defines the split. Pasi Piela

Neural networks for clustering: Self-Organizing Maps, SOM • The basic SOM defines a mapping from the input data space Rnonto a latent space consisted typically of a two-dimensional array of nodes or neurons.

Best methods • Nearest neighbour (hot-deck) by Euclidean distance metrics • The bestmethod is actually a system that includes several competitive imputation methods • The development and evaluation of new imputation methods is closely connected to the software development. Pasi Piela

Processing imputation (and editing) • The Banff system of Statistics Canada is a good example about computerized editing and imputation process. It is a collection of specialized SAS® procedures "each of which can be used independently or put together in order to satisfy the edit and imputation requirements of a survey" as stated in the Banff manual (Statistics Canada, 2006). • successor to more well-known GEIS • currently Statistics Finland is evaluating Banff Pasi Piela

Pasi Piela

Verifying edits, generating implied edits and extremal points Pasi Piela

(Pre-)view of failure rates, fine-tuning the edits Pasi Piela

Three basic outlier detection methods Pasi Piela

Identifying which fields require imputation and how to satisfy edits. Pasi Piela

Here : logical imputation (one possible value allowing to pass the edits). Pasi Piela

Nearest neighbour via constructing a k-dimensional tree. Pasi Piela

Using user-defined or some of the 20 hard-coded algorithms. Pasi Piela

Adjusting and rounding the data so that they add to a specific totals. Pasi Piela

Mass imputation procedure (not handled in this presentation) Pasi Piela

Pasi Piela

Plans for future research - intro • Multiple imputation, MI, was not handled here because of the context of the research. But also detailed research in imputation variance and careful analysis of the datasets with hierarchical, multilevel nature (note the difference to the previously mentioned hierarchical clustering methods) containing cross-classifications and missingness were also excluded. This will lead us to the forthcoming research that will be next outlined. Pasi Piela

Multiple imputation • Donald Rubin, 1987 • Bayesian framework • very famous and popular family of the imputation methods • rarely used in official statistics • several imputed datasets are combined to reach an estimate for the imputation variance • assumptions are strict and there also exist some difficulties with MI variance estimation as discussed by Rao (2005) and Kim et al. (2004) Pasi Piela

Fractional imputation • a sort of mixed real donor and model donor “hot-deck” imputation method for a population divided into imputation cells • involves using more than one donor for a recipient • simply: three imputed values might be assigned to each missing value, with each entry allocated a weight of 1/3 of the nonrespondent's original weight • Kim and Fuller (2004): superior to MI • designed to reduce the imputation variance while MI gives only a simple way to estimate it Pasi Piela

Fractional imputation • a sort of mixed real donor and model donor “hot-deck” imputation method for a population divided into imputation cells • involves using more than one donor for a recipient • simply: three imputed values might be assigned to each missing value, with each entry allocated a weight of 1/3 of the nonrespondent's original weight • Kim and Fuller (2004): superior to MI • designed to reduce the imputation variance while MI gives only a simple way to estimate it Assume the within-cell uniform response model in which the responses in a cell g are equivalent to a Bernoulli sample sample selected from the elements ng. Pasi Piela

A Fractional imputation 2 AR sample non-respondents AM • Formulating the fractionally imputed estimator: sample respondents imputation fraction population quantity of interest Pasi Piela

Fractional imputation 3 • Kim and Fuller (2004) showed also how fractional imputation combined with the proposed replication variance estimator gives a set of replication weights that can be used to construct unbiased variance estimators for estimators based on imputed data (and for estimators based on the completely responding variables). • fully efficient fractional imputation, FEFI • produces zero imputation variance Pasi Piela

Fractional imputation 3 Approximations to FEFI by fixed number of donors. -- e.g. systematic sampling with probability proportional to the weights to select donors for each recipient. After the donors are assigned, the initial fractions are adjusted so that the sum of the weights gives the fully efficient fractionally imputed estimator. • Kim and Fuller (2004) showed also how fractional imputation combined with the proposed replication variance estimator gives a set of replication weights that can be used to construct unbiased variance estimators for estimators based on imputed data (and for estimators based on the completely responding variables). • fully efficient fractional imputation, FEFI • produces zero imputation variance Pasi Piela

Multilevel modeling for imputation • Many kinds of data have a hierarchical or clustered nature. • We refer to a hierarchy as consisting of units grouped at different levels. Thus children may be the level 1 units in a 2-level structure where the level 2 units are the families and students may be the level 1 units clustered within schools that are the level 2 units. • More recently there has been a growing awareness that many data structures are not purely hierarchical but contain cross-classifications of higher level units and multiple membership patterns. Pasi Piela

Multilevel modeling for imputation 2 • Multilevel models (Goldstein, 2003) take advantage of the correlation structure between different levels of hierarchy. • Correlation structures and connections between the study variables can be a challenge in imputation tasks as well. • Currently in literature, there are some papers about the use of multiple imputation with multilevel models. • Carpenter and Goldstein (2004): if a dataset is multilevel then the imputation model should be multilevel too. • MLwiN Pasi Piela

Plans for future research • new imputation methods and the analysis of imputed and missing data when data structures are complex • use of multilevel imputation models in which clustersin thecluster sampling design can be incorporated in an imputation model as random effects • information on the complex sampling design can be incorporated, for example, by using strata indicators as fixed covariates • modified fractional and single imputation will be considered Pasi Piela

Plans for future research 2 • Another new avenue of research is the use of multilevel models and imputation in the context of small area estimation (Rao, 2003). Pasi Piela

Paldies, pasi.piela@stat.fi Pasi Piela

References 1/2 • Aitkin, M., Anderson, D., and Hinde, J. (1981): Statistical Modelling of Data on Teaching Styles (with discussion). Journal of the Royal Statistical Society, A 144, 148-1461. • Breiman, L., Friedman, J.H., Olsen, R.A., and Stone, C.J. (1984): Classification and Regression Trees. Wadsworth, Belmont, CA. • Carpenter, J.R., and Goldstein, H. (2004): Multiple Imputation Using MLwiN, Multilevel Modelling Newsletter, 16, 2. ** • Charlton, J. (2003): Editing and Imputation Issues. Towards Effective Statistical Editing and Imputation Strategies - Findings of the Euredit project. * • Chambers, R. (2001): Evaluation Criteria for Statistical Editing and Imputation. National Statistics Methodological Series, United Kingdom, 28, 1-41. * • Eurostat (2002): Quality in the European Statistical System - the Way Forward. European Comission. • Goldstein, H. (2003). Multilevel Statistical Models (Third Edition). Edward Arnold, London. • Hill, P.W., and Goldstein, H. (1998): Multilevel Modelling of Educational Data with Cross Classification and Missing Indentification of Units, Journal of Educational and Behavioural Statistics, 23, 117-128. • Kalton, G., and Kish, L. (1984): Some Efficient Random Imputation Methods. Communications in Statistics, A13, 1919-1939. • Kim, J.K., and Fuller, W.A. (2004): Fractional hot deck imputation, Biometrika, 91, 559-578. • Fuller, W.A., and Kim, J.K. (2005): Hot Deck Imputation for the Response Model, Survey Methodology, 31, 2, 139-149. • Kim, J.K., Brick, J.M., Fuller, W.A. and Kalton, G. (2004): On the Bias of the Multiple Imputation Variance Estimator in Survey Sampling. Technical Report. • Kohonen, T. (1997): Self-Organizing Maps. Springer Verlag, New York. • Koikkalainen, P., Horppu, I., and Piela P. (2003): Evaluation of SOM based Editing and Imputation. Towards Effective Statistical Editing and Imputation Strategies - Findings of the Euredit project. * • Lehtonen, R., Särndal, C.-E., and Veijanen, A. (2003): The Effect of Model Choice in Estimation for Domains, Including Small Domains, Survey Methodology, 29, 33-44. • Lehtonen, R., Särndal, C.-E., and Veijanen, A. (2005): Does the Model Matter? Comparing Model-assisted and Model-dependent Estimators of Class Frequencies for Domains, Statistics in Transition, 7, 649-673. Pasi Piela

References 2/2 • Piela, P., and Laaksonen, S. (2001): Automatic Interaction Detection for Imputation – Tests with the WAID Software Package. In: Proceedings of Federal Committee on Statistical Methodology Research Conference,Statistical Policy Working paper, 34, 2, 49-59, Washington, DC. • Piela, P. (2002): Introduction to Self-Organizing Maps Modelling for Imputation - Techniques and Technology. Research in Official Statistics, 2, 5-19. • Piela, P. (2004): Neuroverkot ja virallinen tilastotoimi, Suomen Tilastoseuran Vuosikirja 2004. Helsinki. • Piela, P. (2005): On Emerging Methods for Imputation in Official Statistics. Licentiate Thesis. University of Jyväskylä. • Rasbash, J., Steele, F., Browne, W., and Prosser, B. (2004): A User’s Guide to MLwiN (version 2.0). London. ** • Rao, J.N.K. (2003): Small Area Estimation. Wiley, New York. • Rao, J.N.K. (2005): Interplay Between Sample Survey Theory and Practice: An Appraisal, Survey Methodology,Statistics Canada, 31, 2, 117-138. • Ritter, H., Martinez, T., and Schulten, K. (1992): Neural Computation and Self-Organizing Maps: An Introduction. Addison-Wesley, Reading, MA. • Schafer, J.L., and Graham, J.W. (2002): Missing Data: Our View of the State of the Art, Psychological Methods7, 2, 147-177. • Statistics Canada (2003): Statistics Canada Quality Guidelines. Catalogue no. 12-539-XIE. • Statistics Canada (2006): Functional Description of the Banff System for Edit and Imputation. Statistics Canada. • Rubin, D. (1987): Multiple Imputation in Surveys. John Wiley & Sons, New York. • *) The Euredit project: http://www.cs.york.ac.uk/euredit/ • **) Centre for Multilevel Modelling: http://www.mlwin.com/ Pasi Piela

Introduction to Emerging Methods for Imputation in Official Statistics

Introduction to Emerging Methods for Imputation in Official Statistics

Presentation Transcript

PSYM021 Introduction to Methods Statistics

Official Statistics

Editing And Imputation For Manufacturing Statistics At Statistics Canada

Modernising Official Statistics

Responding to users’ needs in official statistics

Official Statistics – an introduction and overview

Official International Statistics

master in official statistics

Statistics 101 Course Notes Introduction to Quantitative Methods for

Opportunities for research related to Official Statistics

Imputation in External Trade Statistics in Statistics Denmark

An Introduction to producing Official Statistics

Understanding Official Statistics

S-012 Empirical Methods: Introduction to Statistics for Research

Introduction to Multiple Imputation

S-012 Empirical Methods: Introduction to Statistics for Research

Data Imputation Methods and Technologies

Quality criteria for official statistics

Innovation in Official Statistics