170 likes | 180 Views
This study delves into improving the methodology for matching the 2021 Census to the Census Coverage Survey to estimate non-response, focusing on minimizing false positives and negatives. Automated and clerical matching methods are compared to enhance accuracy.
E N D
Improvements in methodology for matching the 2021 Census to the Census Coverage Survey Sarah Cummins, Shelley Gammon, Peter Jones
What is the Census Coverage Survey? • the census aims to count everybody but some people will be missed • the Census Coverage Survey (CCS) is used to facilitate the estimation of non-response • 1% sample of postcodes • run 6 weeks after census • match to the census to estimate non-response CCS Census
Census to CCS matching • quality requirements for matching very high since errors in matching will impact the coverage adjustment: • false positive (FP) or incorrect link • false negative (FN) or missed match
2011 Census to CCS matching • automatic match rate: 70% of person matches • clerical resource: equivalent of 30 FT staff over 30 weeks • methods: • exact and probabilistic matching (high threshold) • hierarchical approach matching households first and then individuals within households • batch processed by geography • quality: FP rate <0.01%, FN rate <0.25%
Purpose of research • Can automated matching be increased in 2021without incurring unacceptable numbers of false positives? trade off between clerical review FP error
Automated matching • Progress made in automated matching methods to deal with large pseudonymised datasets • match-keys (deterministic) • automated probabilistic matching - Fellegi-Sunter • other experimental methods, i.e. associative matching
Methods • 2011 Census and CCS matching used as ‘gold standard’ dataset • links made in 2011 were treated as true matches due to high quality standards • Census and CCS were re-matched using new methods • quality of new matching determined by comparing to links made in 2011 to estimate: • % false positive rates • % false negative rates • Research conducted in secure environment
Methods • hierarchical approach first looking at record pairs with agreement on postcode only • (1) match keys within postcode • + (2) probabilistic matching within postcode • + (3) match keys outside postcode • + (4) probabilistic matching outside postcode • aim for FP <0.25%
Results (1) - match keys within postcode each required agreement on postcode • overall FP rate = 0.07% • overall FN rate = 16.5% *cumulative
Implications – clerical resolution 2011 Census to CCS matching: • matches left ≈ 195,000 • (1) match keys at 0.07% FP error: • matches left ≈ 100,000 • (2) probabilistic at 0.1% FP error: • matches left ≈ 60,000 • (2) probabilistic at 0.25% FP error: • matches left ≈ 25,000
Implications – bias in linkage • quinary sex / age Z = sex missing 999 = age missing
Implications – bias in linkage • local authority
Further work • Error: • What is the tolerance for error? • Can we focus on adjusting for error rather than minimising it? • Clerical resource: • How can we accurately estimate potential reductions in clerical resource? • How can we minimise clerical searching? • Design changes in Census/CCS: • i.e. response channel
Future work – response channel • Can we generalise these results if the 2021 Census will be predominantly online? • comparing FP and FN rate of online forms and paper forms • comparing responses from people who have submitted both a paper and online 2011 Census form • Early results indicate that in particular forename and surname are better quality when submitted online
Thank you for listening Any questions? Feel free to contact us: sarah.cummins@ons.gov.uk data.linkage@ons.gov.uk