Assessing Disclosure Risk in Sample Microdata Under Misclassification

Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton N.Shlomo@soton.ac.uk This is joint work with Prof Chris Skinner

Assessing Disclosure Risk in Sample Microdata Under Misclassification

Presentation Transcript

  1. Assessing Disclosure Risk in Sample Microdata Under Misclassification Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton N.Shlomo@soton.ac.uk This is joint work with Prof Chris Skinner 1

  2. Topics Covered Introduction and motivation Disclosure risk assessment for sample microdata: Probabilistic modelling extended for misclassification Probabilistic record linkage – linking the frameworks Risk-utility framework for microdata subject to misclassification Discussion 2

  3. Introduction Disclosure risk scenario: ‘intruder’ attack on microdata through linking to available public data sources Linkage via identifying key variables common to both sources, eg. sex, age, region, ethnicity Agencies limit risk of identification through statistical disclosure limitation (SDL) methods: Non-perturbative – sub-sampling, recoding and collapsing categories of key variables, deleting variables Perturbative – data swapping, additive noise, misclassification (PRAM) and synthetic data 3

  4. Introduction Need to quantify the risk of identification to inform microdata release Probabilistic models for quantifying risk of identification based on population uniqueness on identifying key variables Population counts in contingency tables spanned by key variables unknown Distribution assumptions to draw inference from the sample for estimating population parameters Assumes that microdata has not been altered either through misclassification arising from data processes or purposely introduced for SDL 4

  5. Expand probabilistic modelling for quantifying risk of identification under misclassification/perturbation For perturbative methods of SDL, risk assessment typically based on probabilistic record linkage Conservative assessment of risk of identification Assumes that intruder has access to original dataset and does not take into account protection afforded by sampling Fit probabilistic record linkage into the probabilistic modelling framework for categorical matching variables Introduction 5

  6. Disclosure Risk Assessment • Probabilistic Modelling – no misclassification • Let denote a q-way frequency table which is a sample from a population table where indicates a cell population count and sample count in cell • Disclosure risk measure: • For unknown population counts, estimate from the conditional distribution of

  7. Disclosure Risk Assessment • Natural assumption: • Bernoulli sampling: • It follows that: and • where are conditionally independent is the sampling fraction in cell k

  8. Disclosure Risk Assessment • Skinner and Holmes, 1998, Elamir and Skinner, 2006 use log linear models to estimate parameters • Sample frequencies are independent Poisson distributed with a mean of • Log-linear model for estimating expressed as: • where design matrix of key variables and their interactions • MLE’s calculated by solving score function:

  9. Disclosure Risk Assessment • Fitted values calculated by: and • Individual risk measures estimated by: • Skinner and Shlomo (2009) develop goodness of fit criteria which minimize the bias of disclosure risk estimates, for example, for

  10. Disclosure Risk Assessment • Criteria related to tests for over and under-dispersion: • over-fitting - sample marginal counts produce too many random zeros, leading to expected cell counts too high for non-zero cells and under-estimation of risk • under-fitting - sample marginal counts don’t take into account structural zeros, leading to expected cell counts too low for non-zero cells and over-estimation of risk • Criteria selects the model using a forward search algorithm which minimizes for where is the variance of

  11. Disclosure Risk Assessment Example: Population of 944,793 from UK 2001 Census SRS sample size 9,448 Key: Area (2), Sex (2), Age (101), Marital Status (6), Ethnicity (17), Economic Activity (10) - 412,080 cells Model Selection: Starting solution: main-effects log-linear model which indicates under-fitting (minimum error statistics too large) Add in higher interaction terms until minimum error statistics indicate fit

  12. Model Search Example (SRS n=9,448)True values , Area–ar, Sex-s, Age–a, Marital Status–m, Ethnicity–et, and Economic Activity-ec ,

  13. Model Search Example Preferred Model: {a*ec}{a*et}{a*m}(s*ec}{ar*a}True Global Risk: Estimated Global Risk Log-scale True risk measure Estimated per-record risk measure

  14. Disclosure Risk Assessment • Skinner and Shlomo, 2009 address complex survey designs: • Sampling clusters introduce dependencies - key variables cut across clusters and assumption holds in practice • Stratification – include strata id in key variables to account for differential inclusion probabilities • Survey weights - Use pseudo maximum likelihood estimation where score function modified to: • changed to: where • Partition large tables and assess partitions separately (assumes that partitioning variable has an interaction with other key variables)

  15. Disclosure Risk Assessment Under Misclassification • Modelassumes no misclassification errors either arising from data processes or purposely introduced for SDL • Shlomo and Skinner, 2010 address misclassification errors • Let: • where cross-classified key variables: • in population fixed • in microdata subject to misclassification

  16. Disclosure Risk Assessment Under misclassification • The per-record disclosure risk measure of a match of external unit B to a unique record in microdata A that has undergone misclassification: • (1) • For small misclassification and small sampling fractions: or (2) • Global measure: estimated by: • (3) • where per-record risk:

  17. Perturbation Methods of SDL • PRAM ( Post-randomisation method) • Probability transition matrix containing conditional probabilities for a category c: • Let T be a vector of frequencies • On each record, category c changed or not changed according to and the result of a draw of a random variate u • vector of perturbed frequencies • Unbiased moment estimator of the original data:assuming has an inverse (dominant on the diagonals)

  18. Perturbation Methods of SDL • PRAM ( Post-randomisation method) - cont. • Invariant PRAM - Define: • (vector of the original frequencies eigenvector of ) • To ensure correct non-perturbation probability on diagonal, define • Expected values of marginal distribution preserved • Exact marginal distribution preserved using a without replacement selection strategy to select records for perturbation

  19. Exchange values of a key variable between pairs of records Pairs selected within control strata to minimize bias Typically geographical variable is swapped within a large area: Geography highly identifiable Conditional independence assumption usually met (sensitive variables relatively independent of geography) Does not produce inconsistent records Marginal distributions preserved at higher geographies Can be targeted to high risk records Random Record Swapping Perturbation Methods of SDL 19

  20. Population of individuals from 2001 United Kingdom (UK) Census N=1,468,255 1% srs sample n=14,683 Six key variables: Local Authority (LAD) (11), sex (2), age groups (24), marital status (6), ethnicity (17), economic activity (10) K=538,560. Misclassification Example 20

  21. Record Swapping: LAD swapped randomly, eg. for a 20% swap: Diagonal: Off diagonal: where is the number of records in the sample from LAD k Pram: LAD misclassified, eg. for a 20% misclassification Diagonal: Off diagonal: Parameter: Misclassification Example 21

  22. Random 20% perturbation on LAD Global risk measures: Expected correct matches from SU’s Misclassification Example Expected correct match per sample unique: Pram: 10.8% Record swapping: 10.6%

  23. Estimating individual per-record risk measures for 20% random swap based on log linear modelling (log scale): From perspective of intruder, difficult to identify high risk (population unique) records Misclassification Example Risk Measure (1) Estimated Risk Measure (3) 23

  24. Information Loss Measures • Utility measured by whether inference can be carried out on perturbed data similar to original data • Use proxy information loss measures on distributions calculated from microdata: • Distance Metrics: • where number of cells in distribution • Let • measure of average absolute perturbation compared to average cell size • Also, can consider Kolmogorov-Smirnov statistic, Hellinger’s Distance and relative differences in means or variances

  25. Pram record swapping Risk-Utility Map Random perturbation versus Targeted perturbation on non-white ethnicities 25

  26. value of vector of cross-classified identifying key variables for unit in the microdata ( ) corresponding value for unit in the external database ( ) ( ) Misclassification mechanism via probability matrix: Comparison vector for pairs of units For subset partition set of pairs in Matches (M) Non-matches (U) through likelihood ratio: where Probabilistic Record Linkage 26

  27. probability that pair is in M Probability of a correct match: Estimate parameters using previous test data or EM algorithm and assuming conditional independence Probabilistic Record Linkage 27

  28. No misclassification Linking the Two Frameworks 28

  29. With misclassification, denote Linking the Two Frameworks 29

  30. Matching 2,853 sample uniques to the population and blocking on all key variables except LAD result in 1,534,293 possible pairs On average across blocks, probability of a correct match given an agreement on LAD Linking the Two Frameworks 30

  31. Probability of a correct match given an agreement for each Compare to risk measure Summing over the global disclosure risk measure of 289.5. Linking the Two Frameworks 31

  32. Global disclosure risk measures accurately estimated for a risk-utility assessment assuming known non-misclassification probability Empirical evidence of connection between F&S record linkage and probabilistic modelling for estimating identification risk Estimation carried out through log linear modelling for the probabilistic modelling or the EM algorithm for the F&S record linkage Individual disclosure risk measures more difficult to estimate without knowing true population parameters in both frameworks From the perspective of the intruder, it is difficult to identify sample uniques that are population uniques Discussion 32

  33. Discussion • Statistical Agencies (MRP) need to: • - Assess disclosure risk objectively • - Set tolerable risk thresholds according to different access modes • - Optimize and combine SDL techniques • - Provide guidelines on how to analyze disclosure controlled datasets • Future dissemination strategies presents new challenges: • - Synthetic data for web access prior to accessing real data • - Online SDL techniques for flexible table generating software and remote access • - Auditing query systems • Bridge the Statistical and Computer Science literature on privacy preserving algorithms

