160 likes | 166 Views
This article delves into methodological hurdles in merging different data sources for business statistics, covering topics like data quality, administrative data considerations, questionnaire design, output data quality, and models for combining data.
E N D
Methodological challenges in integrating data collections in business statistics Paul Smith Office for National Statistics
Outline • Data quality for different sources • quality measures for survey and administrative inputs • quality measures for outputs • Combinations of sources • familiar and more advanced situations • Mode effects • Models • Discussion
Statistical data collections - quality • Relevance • generally questions conform to desired concepts • may be tailoring for • practicality • consistency across collections even if concepts differ • Accuracy • affected by sampling • impacts from non-response, measurement error • Timeliness • generally relatively timely
Administrative data - quality • Relevance • questions conform to administrative (not statistical) concepts • few concessions to statistical needs • Accuracy • unaffected by sampling • processes to discourage non-response • treatment of measurement error differs by variable • Timeliness • generally slow
Differences between types of source • Sampling accuracy is measurable for surveys, not relevant for administrative data sources • confidence in quality reduced for admin data • balance of accuracy measures different • Building statistical requirements into administrative series • requires negotiation and agreement • VAT classification information in the UK • INSEE has statistical and accounting information well integrated
Questionnaire design • Questionnaire design principles mostly used in designing statistical collections • Administrative data seen as “forms” not “questionnaires” • less attention to question phrasing to obtain required answer • more on statutory requirements
Output data quality • Data quality from combined outputs can be challenging to measure • function of the qualities of the input sources, and the methods used to combine them • some well-known general approaches • development of measures needed for particular cases (eg from models)
Combinations of sources - 1 • Frame and sample information • Sampling frames typically derived from administrative sources • Multiple uses of frame information • sample design • sample selection • validation and editing • estimation and variance estimation • Quality easily derived – standard situation
Combinations of sources - 2 • Dual-frame surveys • More than one administrative source • Pension funds survey in the UK • Units • Business register • Challenges of population inflation if matching not perfect • Estimate probability that unit appears in sample from either source • use in appropriate weighting procedure • adjustment for P(in both surveys) depends on survey type
Combinations of sources - 3 • Multiple surveys • different periodicity • summary information monthly, detail annually • for example capital expenditure – quarterly breakdown, annual summary • Benchmarking • where short-period surveys small (and variable) and annual larger (and less variable) • Quality measures • account for sampling error in both sources • account for non-response and measurement errors in larger survey
Combinations of sources - 4 • Auxiliary information • If administrative concept not close to statistical concept, data may still be useful • Auxiliary information in estimation • not required to be correct, only correlated with outcome • the better the correlation, the better the accuracy • Auxiliary information in validation • use tax data to improve validation follow-up activity • Data confrontation • Use multiple sources to identify discrepancies • Balancing
Mode effects • Mode effects manifest in several ways • differences in contact rate • differences in response rate given contact • differences in question replies given response • Test differences through a designed experiment (van den Brakel & Renssen 1998, 2005) • evaluates whole-process differences (not individual steps) • non-response adjustment if good predictors for response amongst auxiliary data (var increases) • model-based adjustments for other changes
Temporal differences • Administrative data often have longer reference period than statistical requirement • Implies temporal disaggregation (model-based) – Dagum & Cholette 2006 • Quality implications • estimated data as inputs • sensitivity of model to interesting changes
Models for combining data • Full flexibility in combining data available through modelling approach • Models at boundary between statistical producer and user • Ideally statistical results insensitive to model assumptions • small area estimates • useful for social surveys • challenges for business surveys not yet resolved • modelling for unit structures - BRES
Discussion • Aim: more from existing sources • often imperfect matches • modelling only appropriate approach • subjective • robust to assumptions • sensitivity analysis • Mixed mode collections • usability and low cost • data combination • quality components harder to measure
for more details see the paper, or contact • paul.smith@ons.gov.uk