200 likes | 213 Views
This study explores disclosure risk assessment for microdata by analyzing record-level and file-level risk measures to ensure data privacy. The research delves into model sensitivity, bias criteria, model choice, and practical implications in data analysis.
E N D
WP 9 Assessing Disclosure Risk in Microdata using Record Level Measures Chris SkinnerUniversity of SouthamptonC.J.Skinner@soton.ac.uk Natalie Shlomo University of SouthamptonOffice for National Statisticsn.shlomo@soton.ac.uk
Disclosure Risk Assessment for Microdata • Assume: • sample • categorical key variables • no measurement error • Seek: • record level risk measures • aggregated to file level measures
Record Level Measures Record with combination of key variable values Sample count with same combination = Population count with same combination = Only consider sample unique records , i.e. = Pr(population unique) = = Pr(correct match)=
Aggregated File-level Measures Expected number of population uniques in sample Expected number of correct matches among sample uniques to the population Note: sample uniques
Estimation Problem • To make inference about: • Record level measures and for sample unique • File level measures and
Log-linear Model • , and independent given • where , sampling fraction Estimate by maximum likelihood , , ,
Some Literature Skinner and Holmes (1998, JOS): good properties of under all two-way interactions log-linear model, where: , Elamir and Skinner (2006, JOS): good properties of and under all two-way interactions model, but no need for term.
Model Sensitivity All two-way interactions model performs well, but… still evidence of some model-dependence of and in neighborhood of this model. Tendency for risk to decrease as model complexity increases.
Model Choice • Goodness of fit tests? • Pearson? • Likelihood ratio? • AIC, BIC? • Problems with very large and sparse tables
Bias Criterion Allow for small departures from Estimate bias of by: Choose model to minimise Similar to choosing model to minimise
Minimising Over- (Under-) Dispersion Model estimates degree of over- or under-dispersion tests hypothesis of equal dispersion Cameron and Trivedi (1998)
Samples from 2001 UK Census Two areas with population of 944,793. ‘Large’ Key: Area (2), Sex (2), Age (101), Marital Status (6), Ethnicity (17), Economic Activity (10) 412,080 cells ‘Small’ Key: same except Age (18) 73,440 cells
Small key, Simple random sample of size 18,896 True values: number of population uniques in sample: sum of over sample uniques:
Model Search Algorithm • Starting solution: all 2-way interactions log-linear model • Search by: • Removing terms • Adding terms • Swapping terms • TABU method of Drezner, Marcoulides and Salhi (1999)
Record Level Risk Measures Preferred Model: {ea}{s*a}{s*m}(s*et}{s*ec}{a*m}{a*et}{a*ec}(m*et}{m*ec}True Global Risk: Estimated Global Risk
Record Level Risk Measures Preferred Model: {ea}{s*a}{s*m}(s*et}{s*ec}{a*m}{a*et}{a*ec}(m*et}{m*ec}True Global Risk: Estimated Global Risk
Conclusions • Model selection by assessing over-, under-dispersion • Similar risk estimates for models with nearly Poisson dispersion • Further work: • - stratification of files • - complex survey designs