150 likes | 244 Views
Migration of a large survey onto a micro-economic platform. Val Cox April 2014. Micro-economic Platform (MEP). Standardises and automates processes - Provides more efficient processing, more analysis Enables Statistics NZ to gain more from available data
E N D
Migration of a large survey onto a micro-economic platform Val Cox April 2014
Micro-economic Platform (MEP) • Standardises and automates processes - Provides more efficient processing, more analysis • Enables Statistics NZ to gain more from available data - Basic principle: use administrative data wherever possible, with surveys filling the gaps - Objective: bring core information about every business in the economy into the Longitudinal Business DB to allow Statistics NZ to respond quickly to changing needs for economic statistics
Aim of paper • To discuss the challenges of building a non-response imputation package for a large survey on the MEP - Rationalises the use of • Banff for outlier detection and imputation • SEVANI (System for Estimation of Variance due to Nonresponse and Imputation) to estimate sampling and non-sampling errors
Annual Enterprise Survey(AES) • Provides statistics on the financial performance and position of New Zealand businesses - Captures about 90% of New Zealand's GDP • Uses four different major data sources • Three administrative (covers 72% of the population) • One postal survey
Editing strategy of AES on MEP • Guided by the Methodological Standard for E&I • Key objective of standard - Editing is fit-for-purpose and enables continuous improvement of processes and data quality • Key principles used • Automate editing processes where possible • Use Statistics NZ standard editing tools, wherever possible, to achieve standardisation
Editing system of AES in MEP • Uses Banff to automate and standardise editing and imputation processes • Uses analytical views to assess the quality of the edited data
Challenges and solutions A. Sheer volume of data - 28 questionnaires, 113 industries and 180 variables • Solution: Use of a “thin slice” approach • Restrict dataset to one questionnaire and one industry to show all stages of E&I are working • Once successful, expand dataset to include more industries until all 28 questionnaires are replicated • Successful in determining optimal level of automation for correcting failed edits
Challenges and solutions • Determining which variable is erroneous when groups of variables must add or subtract to a total - Banff “errorloc” procedure always recommends to change one variable by a large amount - Change is done by “deterministic” procedure • Solution: Assign weights to variables • Assign lower weights to more reliable variables so Banff doesn’t change their values Examples: totals, gross profit, since respondents use this to determine the tax they pay
Challenges and solutions C. Outlier detection - Old system detects outlier in 3 key variables but unlinks whole unit (all variables) - Banff does univariate outlier detection • Solution: Compared 2 E&I runs of data • 1st run had only the 3 key variables set as outliers and 2nd had all variables included in outlier steps • Decision: Choose variables to be set as outliers based on the effect on the totals
Challenges and solutions • Running imputation one variable at a time would have been very time-consuming • Solution: Group variables • By imputation method (4 methods) • By industry (some industries have different characteristics) • By type of variable (e.g. some variables can be negative)
Challenges and solutions E. Imputation failed for some variables - Some imputation cells were too small • Solution: Merged small imputation cells • Each imputation stage was run twice, the first without cell merging and the second with cell merging, resulting in 8 imputation stages • Use of a “catch-all” stage at the end (9th stage) to carry out mean imputation by industry
Challenges and solutions F. Challenges with no solutions - Analysis of improvements in the E&I was slow as it took several hours to run E&I and write back to the main data storage area to view data in a cube • Attempt to replicate published results as closely as possible created a dilemma: When to stop trying? • What was the “right” answer?
SEVANI • Provided a standardised and automated method to report on estimates of variances due to sampling as well as non-response and imputation • Challenges: - Can produce output for one variable at a time - SEVANI required a lot of parameters to set-up - MEP is unit-based so can’t easily output SEVANI results • Solution: - Use of a macro to identify variable names - Created a SAS code to set-up parameters - Output SEVANI results outside MEP
Next steps • Educate the users of the new system on MEP • Identify potential areas to make improvements in the editing and imputation system • Create a new MEP collection for Charities data to include its own editing and imputation system