150 likes | 268 Views
Measuring Disclosure Risk and Data Utility for Flexible Table Generators. Natalie Shlomo, Laszlo Antal, Mark Elliot University of Manchester Natalie.Shlomo@manchester.ac.uk.
E N D
Measuring Disclosure Risk and Data Utility for Flexible Table Generators Natalie Shlomo, Laszlo Antal, Mark Elliot University of Manchester Natalie.Shlomo@manchester.ac.uk The research leading to these results has received funding from the European Union's Seventh Framework Programme (FP7/2007-2013) under grant agreement n° 262608 (DwB - Data without Boundaries). 1
Topics Covered Introduction Design of Flexible Table Generating Servers Information Based Risk-Utility Measures Example Application and Results Discussion 2
Introduction Large demand for specialized and tailored tables from policy makers and researchers NSIs considering the internet to disseminate outputs through flexible table generators, eg. US Census Bureau, Australia ABS, Israel CBS Key questions: (1) What data should be used to produce the tables? Original microdata with or without SDC methods, often aggregated to hypercubes (2) At what stage to apply the SDC? Apply to underlying data and all tables considered safe - Compounds SDC and reduces utility Apply to final output tables - Problem to ensure consistency and additivity 3
Introduction Types of disclosure risk: Identity disclosure where small cells may lead to an identification Attribute disclsoure where rows/columns have structural zeros and only one or two cells populated (small cells on margins) Differencing tables leading to higher risks of above disclosures For output based query systems, eg. flexible table generator, need perturbative methods of SDC (see: CS literature on differential privacy) Flexible table generating requires ‘on the fly’ disclosure risk assessment, application of SDC methods and data utility measures 4
Designing a Flexible Table Generating Server SDC rules easily programmed, some examples: Limit the number of dimensions Avoid disclosure by differencing by ensuring consistent and nested categories Minimum population thresholds, average cell size, etc. Algorithm: Determine by SDC rules if table can be produced Assess disclosure risk Apply SDC method if needed Recalculate disclosure risk If safe table then output with utility measure, else go to (3) 5
Designing a Flexible Table Generating Server Types of Data: Census Data- whole population counts European Census Hub with all member states providing common hypercubes Different SDC methods across member states reduces the utility of the hub Business data – different type of tables (magnitude) and not considered further Survey data from Social Surveys typically have non-perturbative SDC methods (coarsening) Weighted counts generally safe due to large and varying weights with low sample counts deleted for low quality Unweighted counts not differentially private due to sample uniques that are population uniques (Shlomo and Skinner 2012) and must be avoided 6
Information Based Disclosure Risk and Data Utility Measures To assess attribute disclosure in tables mainly caused by structural zeros, use the entropy where vector of frequency counts and Entropy bounded by 0 if all cells are zero except one cell, and log(K) if all cell values are equal, i.e. cell proportions are 1/K Risk measure: Combine with other measures (proportion of zeros and size of the population)and define weighted average: 7
Information Based Disclosure Risk and Data Utility Measures Take into account perturbation that introduces random zeros: Adjust first term comparing number of zeros before and after perturbation Smooth out perturbed cell counts based on their expectation under the transition matrix (lowers the second term) Example: For random rounding, replace perturbed zeros with: where frequencies of cell values and frequencies of perturbed cell values For sampling, smooth out sample counts by using probabilistic Log-linear-Poisson model approach (Skinner and Shlomo 2008) Replace population counts in the entropy term by Estimate number of zeros by: 8
Information Based Disclosure Risk and Data Utility Measures Utility measure: Hellenger’s Distance where original counts perturbed counts Hellenger’s Distance bounded by 0 and and can be used to compare SDC methods 9
Example: Simulation Hypercube Population N=1,500,000 NUTS2 Region - two regions Gender – 2 categories Banded age groups – 21 categories Current Activity Status – 5 categories Occupation – 13 categories Educational attainment – 9 categories Country of citizenship – 5 categories Calculate cell proportions from 2001 UK Census via iterative proportional fitting All proportions multiplied by population size and rounded 10
Flexible Table Generating Servers Define a 3- dimensional table with one variable to define the population: banded age group, education group and occupation group defined for NUTS2=1 Table has 2,457 cells, 854,539 individuals, average cell size of 347.8 For comparison, we carry out a semi-controlled random rounding to base 3 on the output table calculated from original data 11
SDC Methods for Hypercube Random record swapping by selecting 5% of the individuals in NUTS2 region and swapping LAU2, thus a total of 10% of individuals swapped Semi-controlled random rounding to base 3 controlled for two NUTS2 totals Invariant PRAM with control of totals for two NUTS regions Perturbation on cell values 1 to 10 and above 11 no perturbation Low entropy, i.e. cells perturbed to neighbouring cells only Risk measure: weights: .1, .7 (small cells), .1, .1 Adjust measure for perturbations by transition matrix Sample based measure: all 2 way interaction log-linear model (entropy term: populaton 0.318, sample 0.323, estimate 0.319) 12
Results • Record swapping applied to hypercube did little to reduce disclosure risk since small cells remain and utiity is high • Stochastic perturbation has lower disclosure risk but low utility • Semi-controlled random rounding also reduces disclosure risk and good utility but need to ensure consistency and additivity so could lower utility • Comparing the rounding before and after shows that SDC ‘on the fly’ has lower disclosure risk and the highest utility out of all the methods since perturbation is not confounded • Sample based risk measure resulted in higher risk measure (future work) with very low utility 13
While agencies can claim there is uncertainty in the tables from record swapping, there is little actual reduction in disclosure risk which is problematic when disseminating tables freely over the internet • Record swapping and the proposed stochastic perturbation have little impact on disclosure by differencing since it leaves original counts in the table • Perturbative methods where all cells are perturbed can provide more protection and can be made differentially private • To avoid confounding SDC methods, apply perturbative method ‘on the fly’ within the table generating server on final output table • Using stochastic perturbative methods allow users to account for the perturbation in their analysis • Future research: Improve SDC methods for additivity and consistency ; Consider conditional entropy to account for perturbation and sampling Discussion 14