100 likes | 201 Views
Implementation of the Bayesian approach to imputation at SORS Zvone Klun and Rudi Seljak Statistical Office of the Republic of Slovenia. Oslo, September 2012. EU-SILC data. Data on income and living conditions Data on household members and selected individuals
E N D
Implementation of the Bayesian approach to imputation at SORSZvone Klun and Rudi SeljakStatistical Office of the Republic of Slovenia Oslo, September 2012
EU-SILC data • Data on income and living conditions • Data on household members and selected individuals • Among the large number of variables we selected: VARIABLE TO BE IMPUTED • PY010G - Gross annual income • Completely at random deleted about 11% data EXPLANATORY VARIABLES • PE040 - Level of education attained • PL060 - Number of hours usually worked per week • AGE - Age of person
Analysis PY010G • PY010G is very asymmetrical • Analysis according PE040 • Because PE040 is categorical 5 equal models
Further analysis PY010G • For each level of education achieved • Analysis according to AGE and PL060 • For 5th education level
Model for PY010G • Estimations: • Example for: PE040=5, AGE=40, PL060=40 • Graphs of normal distributionwith respect to the data (red) and regression model (green).
Bayes aproach • Equal treatment for • DATA: (PY010G) and • PARAMETERS: • Parameters: • are not fixed values, • have their own probability distribution.
Simulations and Multiple imputation • Simulationsof parameters: • first draw variance: --, • then draw coefficients: • Simulations of missing values (Multiple imputation) • draw missing value: • independently for each missing value(). • 5 imputations almost 98% efficiency (Rubin`s formula for about 11% rate of missing information.)
Imputed values • Example of 5 imputations for: PE040=5, AGE=40, PL060=40
Evaluation • Comparison of the average gross annual income (Initial data: data before deleting.) • Small relative errors • Relatively narrow 95% confidence intervals • Poorer results for model 6, because: • only 58 units • high variance from the linear regression (252862689)
Discussion • Method is effective, • if data are successfully described by the selected model. • Mechanism of missing values is ignorable, if • missing data are MAR and • parameters of model and parameters of mechanism of missing values are divisible (parameters are independent). • Imputed and explanatory variables have to be numerical. • We tested the method progressively by using the SAS programme. • The method is already included in the MCMC procedure in newer version (9.2 and 9.3) of the SAS. Thank you for your attention !