Multivariate Selective Editing via Mixture Models: First Applications to Italian Business Surveys

  1. UNECE Worksession on Statistical Data Editing Oslo, 22-24 September 2012 Multivariate selective editing via mixture models: first applications to Italian structural business surveys Orietta Luzi, GuarneraU., Silvestri F., Buglielli T., Nurra A., Siesto G. Italian National InstituteofStatistics

  2. UNECE Worksession on Statistical Data Editing • Objective of the work • The SeleMix approach to selective editing • The Software SeleMix • The Applications • Final remarks and future work Outline September 22-24, Oslo

  3. UNECE Worksession on Statistical Data Editing Objective of the work • Assessing the advantages (in terms of quality improvements and costs reduction) deriving from the use of a multivariate model-based robust selective editing approach for the detection of influential errors in business surveys. • Exploring the potential benefits deriving from the use of administrative data in the context of the detection of influential errors in economic business surveys The idea is to improve the effectiveness of selective editing by directly incorporating the auxiliary information available in external (both administrative and statistical) sources in the selective editing strategy. September 22-24, Oslo

  4. UNECE Worksession on Statistical Data Editing • Key elements: • score function • cut-off value(threshold)determining the units to be manually reviewed • The components of a score function are: • risk~ probability of error occurrence • influence~ (expected) impact on estimates Selective Editing September 22-24, Oslo

  5. UNECE Worksession on Statistical Data Editing • Alocal scoreis often defined for each record and each variable through a comparison of current values and “estimated” true values, e.g. • historical values on the same units (when available) • estimates (predictions) obtained using auxiliary information (e.g. admin data) or covariates from the same survey • Different local scores are combined in a singleglobal score. The cut-off value of the global score determines which units are to be manually reviewed Score Function September 22-24, Oslo

  6. Selective Editing UNECE Worksession on Statistical Data Editing • The difference between observed and predicted values is due to • the potential error • the natural variability of the analyzed quantity. • In the usual setting, there is no possibility to distinguish these two elements, and the score of an observation is not directly related to the expected error of that unit. • As a consequence we will not be able to relate the selective editing threshold to the desired degree of accuracy in the final estimates. • Problem: • Relate the threshold value of the score function to the desired estimate accuracy (i.e. residual error left in data) September 22-24, Oslo

  7. Model-based Selective Editing UNECE Worksession on Statistical Data Editing • Proposed solution: use an approach based on • explicit modeling of both data and error mechanism (via mixture models). In particular, a latent variable model allows, under certain assumptions, to estimate the expected error associated to each unit. • The method uses contamination normal models, where it is assumed that the distribution of the erroneous data can be obtained from the distribution of the error free data by inflating the variance • 2) definition of the score function in terms of the conditional distribution of “true” data given observed data September 22-24, Oslo

  8. Y* true data Y observed data X covariates (no error) B regression coefficients U residuals I Bernoullian variable: True data model: ~ Error model: e ~ Distribution of observed data: The model UNECE Worksession on Statistical Data Editing September 22-24, Oslo

  9. Model parameters can be estimated based on the observed data via EM. These estimates can be used to estimate the conditional distribution of true data given observed data: The method UNECE Worksession on Statistical Data Editing posterior probabilty for unit i We obtain a prediction for unit i as: September 22-24, Oslo

  10. The expected error is: Risk and Influence UNECE Worksession on Statistical Data Editing risk component influencecomponent The expected error is the product of the two components It is natural to define the score function in terms of the expected error. September 22-24, Oslo

  11. The score function UNECE Worksession on Statistical Data Editing If a total Yin a finite populationis to be estimated on a sample S via the robust estimator: we define a (local) score function as: (weighted expected errorfor variable Y in unit i) Ordering (in descending order) the records by that score function, correcting the firstk units, and summing the riY scores over all the not edited units, we obtain an estimate of therelative expected residual errorRkYin data: September 22-24, Oslo

  12. Warnings UNECE Worksession on Statistical Data Editing 1) Model assumptions - true data are assumed to be normal/log-normal - error is modeled as additive and Gaussian (in a suitable scale) - covariance matrices of true data and error distributions are supposed to be proportional 2) Population Estimates The score function and the stopping criterion have a straightforward interpretation only for linear estimates like means or totals. September 22-24, Oslo

  13. The software SeleMix UNECE Worksession on Statistical Data Editing • SeleMixisan R package for selective editing based on a contamination model. Its main functionalities are: • parameter estimation via ECM algorithm • prediction of “true” values conditional on observed values according to the estimated model • computation of score functions, ordering of the units, and identification of influential errors according to the user-specified threshold • SeleMix also provides anticipated values (predictions) for units where • some (or all) of the Yvariables are not observed. Missing values in the • X covariates are not allowed. September 22-24, Oslo

  14. The Applications: the surveys UNECE Worksession on Statistical Data Editing • The Economic Surveys • the annual sampling survey on Information and Communication Technology usage and e-commerce in industry (ICT) • the annual sampling survey on Small and Medium Enterprises (SME) • The target variables: Turnover, Costs • The target Parameters: Variables’ Totals (by domain) September 22-24, Oslo

  15. The Applications:the auxiliary sources UNECE Worksession on Statistical Data Editing • Administrative Archives • Financial Statements (FS) • Corporate companies (~ 15.000 enterprises) • Best harmonized source w.r.t. SBS Regulation definitions • Sector Studies Survey (SS) • Fiscal survey (~ 4 million enterprises) • Detailed costs and income • Like financial statement • Statistical Sources • Annual total Survey on the Economic Accounts of Enterprises (SEA) (100 employees; ~12,000 enterprises) September 22-24, Oslo

  16. ICT - Experiment 1 UNECE Worksession on Statistical Data Editing • Objective :Evaluating the effectiveness of the proposed selective editing in terms of correct identification of influential errors and correct treatment of both influential errors and of item non responses in the ICT context • Experimental approach • Simulation of contaminated values and item non responses on edited values of Turnover and Costs on the sub-.sample of corporate enterprises of the 2009 ICT sample • MonteCarlo evaluation of selective editing & imputation w.r.t. FS (different thresholds, h); “corrections” based on either 2009 FS (true) data or model-based predictions • Auxiliary variables: Turnover and Costs from 2008 FS data • Results Editing a small number of units is sufficient to remove the most influential errors: bias of the estimates based on edited data is always below 0.3%, while the RRMSE is quite close to the threshold value (0.5%) for almost all domains September 22-24, Oslo

  17. ICT - Results of experiment 1 UNECE Worksession on Statistical Data Editing Relative bias and root mean square error (RRMSE) for the estimates based on raw data (RAW), edited data (EDITED) and SeleMix predictions (ROB.EST) (h=0.005)

  18. ICT - Experiment 2 UNECE Worksession on Statistical Data Editing • Objective: Assessing the advantages in terms of potential reduction of follow-up and interactive editing costs deriving by integrating selective editing in the current E&I procedure • Experimentalapproach • Application of selective editing to raw Turnover and Costs of all the 2008 ICT responding units (different thresholds, h) • Comparative evaluation of parameters’ estimates obtained after selective editing with estimates obtained by the current procedure • Auxiliary variables: Turnover and Costs available in at least one external source (SEA , FS, SME, SS, with priority), year 2008 • Correction using either ICT edited data or model-based predictions • Results • High reduction of units selected as suspect vs the corresponding number of manually revised units based on the current approach • Low distances among totals’ estimates based on selective editing wrt the corresponding final ICT estimates for the most part of domains September 22-24, Oslo

  19. ICT - Results of experiment 2 UNECE Worksession on Statistical Data Editing Relative distances between SeleMix estimates (Sel) with estimates based on raw data (Raw) and ICT edited data (ICT) (h=0.01)

  20. SME - Experiment 1 UNECE Worksession on Statistical Data Editing • Objective • Assessing the advantages in terms of potential reduction of follow-up and interactive editing that could derive by integrating selective editing in the current E&I procedure • Experimental approach • Application of selective editing and imputation to raw Turnover and Costs of all the 2008 SME responding units (different thresholds and imputation approaches) • Comparative evaluation of parameters’ estimates obtained after selective editing &imputation and the “true” estimates obtained from administrative archives • Auxiliary variables: Turnover and Costs available in at least one external source (FS, SS, with priority), year 2007 September 22-24, Oslo

  21. SME - Results of experiment 1 UNECE Worksession on Statistical Data Editing • As expected, higher levels of  imply a consistent reduction of expected revisions which is balanced by less accurate estimates • In SME this seems to happen in a too high number of domains • =0.01 • 869 units selected as influential (~2.9% of the experimental sub-sample) • Diff(True.Sel) ≤ 1.5 in the 89% of domains (the median of the distribution of Diff(True.Sel) over the domains is 0.65) • =0.02 • 382 influential units selected (~0.01% of the experimental sub-sample), • Diff(True.Sel) ≤ 1.5 in the 75% of domains (the median of the distribution of Diff(True.Sel) over the considered domains is 0.9)

  22. SME - Results of experiment 1 UNECE Worksession on Statistical Data Editing Turnover– Relative differences between Diff(True.Sel) when h=0,01 and when h=0,02

  23. Conclusions UNECE Worksession on Statistical Data Editing • Application to ICT data • Fully satisfactory results. The integration of the method in the current E&I procedure is already in progress • Application to SME data • Further analyses are needed: • Different thresholds for different domains? • Additional covariates? September 22-24, Oslo

  24. UNECE Worksession on Statistical Data Editing Thank you for your attention

