Assessing the Impact of SDC Methods on Census Frequency Tables

Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton

Topics: • Introduction • Disclosure risk • SDC methods for protecting Census frequency tables • Disclosure risk and data utility measures • Description of table • Risk-Utility analysis • Summary of Analysis • Discussion and future work

Introduction Identification Individual Attribute Disclosure • Disclosure risk in Census tables: • Need to protect many tables from one dataset containing population counts which can be linked and differenced • Need to consider output strategies for standard tables and web based table generating applications • Need to interact with users and develop SDC framework with a focus on both disclosure risk and data utility

Disclosure Risk For Census tables: • 1’s and 2’s in cells are disclosive since these cells lead to identification, • 0’s may be disclosive if there are only a few non-zero cells in a row or column (attribute disclosure) Consideration of disclosure risk: • Threshold rules (minimum average cell size, ratio of small cells to zeros, etc.) • Proportion of high-risk cells (1 or 2) • Entropy (minimum of 0 if distribution has one non-zero cell and all others zero, maximum of (log K) if all cells are equal).

SDC Methods for Protecting Frequency Tables • Pre-tabular methods (special case of PRAM) Random Record Swapping TargetedRecord Swapping In a Census context, geographical variables typically swapped to avoid edit failures and minimize biasImplementation: Randomly select p% of the households Draw a household matching on set of key variables (i.e. household size and broad sex-age distribution) and swap all geographical variables Can target records for swapping that are in high-risk cells of size 1 or 2

SDC Methods for Protecting Frequency Tables • Rounding Unbiased random roundingEntries are rounded up or down to a multiple of the rounding base depending on pre-defined probabilities and a stochastic draw Example: For unbiased random rounding to base 3: 1 0 w.p of 2/3 1 3 w.p 1/3 2 0 w.p of 1/3 2 3 w.p 2/3 Expectation of rounding is 0 Margins and internal cells rounded separately Small cell rounding: internal cells aggregated to obtain margins

SDC Methods for Protecting Frequency Tables • Rounding (cont.) Semi-controlled unbiased random rounding Control the selection strategy for entries to round, i.e. use a “without replacement” strategy Implementation: - Calculate the expected number of entries to round up - Draw an srswor sample from among the entries and round up, the rest round down. Can be carried out per row/column to ensure consistent totals on one dimension (key statistics) Eliminates extra variance as a result of the rounding

SDC Methods for Protecting Frequency Tables • Rounding (cont.) Controlled rounding Feature in Tau-Argus(Salazar-González, Bycroft and Staggemeier, 2005) - Uses linear programming techniques to round entries up or down, results similar to deterministic rounding - All rounded entries add up to rounded margins - Method not unbiased and entries can jump a base

SDC Methods for Protecting Frequency Tables 3. Cell Suppression Hypercube method (Giessing, 2004) Feature in Tau-Argus and suited for large tables Uses heuristic based on suppressing corners of a hypercube formed by the primary suppressed cell with optimality conditions Imputing suppressed cells for utility evaluation: Replace suppressed cell by the average information loss in each row/column. Example: Two suppressed cells in a row and known margin is 500. The total of non-suppressed cells is 400. Each cell is replaced with a value of 50

Disclosure Risk Measures Need to determine output strategies and SDC together • Hard-copy tables, non-flexible categories and geographies: can control SDC methods to suit the tables • Web-based tables and flexible categories and geographies: need to add noise or round for every query Disclosure risk measures: • Proportion of high-risk cells (C1 and C2) not protected • Percent true zeros out of total zeros

Utility Measures • Distance metric - distortion to distributions (Gomatam and Karr, 2003): • Internal cells: • Let be a table for row k, the number ofrows, and the cell frequency for cell c, • Margins: • Let M be the margin, the number of categories, the number of persons in the category:

Utility Measures • Impact on Tests for Independence:Cramer’s V measure of association: where is the Pearson chi-square statistic • Same utility measure for entropy and the Pearson chi- square statistics • Impact on log linear analysis for multi-dimensional tables, i.e. deviance

Utility Measures • “Between” Variance: • Let be a target proportion for a cell c in row k, • and let be the overall • proportion across all rows of the table • The “between” variance is defined as: • and the utility measure is:

Utility Measures • Variance of Cell Counts: • The variance of the cell count for row k: where is the number of columns The average variance across all rows: The utility measure is:

Description of Table • 2001 UK Census Table: Rows: Output Areas (1,487) Columns: Economic Activity (9) * Sex (2)* Long- Term Illness (2) Table includes 317,064 persons between 16-74 in 53,532 internal cells Average cell size: 5.92 although table is skewed Number of zeros: 17,915 (33.5%) Number of small cells: 14,726 (27.5%)

Summary of Analysis • Rounding eliminates small cells but need to protect against disclosure by differencing and linking when random rounding • Rounding adds more ambiguity into the zero counts • Random rounding to base 5 has greatest impact on distortions to distribution • Semi-controlled rounding has almost no effect on distortions to internal cells but has less distortion on marginal cells • Full controlled rounding has less distortion to internal cells since it is similar to deterministic rounding • Cell suppression with simple imputation method has highest utility (no perturbation on large cells) but difficult to implement in a Census

Summary of Analysis • High percent of true small cells in record swapping and less ambiguity of zero cells • Record swapping has less distortion to internal cells than rounding which increases with higher swapping rates • Targeted swapping has more distortion on internal cells than random swapping but has less impact on marginal cells • Column margins of the table have no distortion because of controls in swapping • Combining record swapping with rounding results in more distortion but provides added protection

Summary of Analysis • Record swapping across geographies attenuates: - loss of association (moving towards independence) - counts “flattening” out - proportions moving to the overall proportion • Attenuation increases with higher swapping rates Targeted record swapping has less attenuation than random swapping • Rounding introduces more zeros: - levels of association are higher - cell counts “sharper” Effects less severe for controlled rounding • Combing record swapping and rounding cancel out opposing effects depending on the direction and magnitude of each procedure separately

Discussion • Choice of SDC method depends on tolerable risk thresholds and demands for “fit for purpose” data • Modifying and combining SDC methods (non-perturbative and perturbative methods) can produce higher utility, i.e. ABS developed microdata keys for consistency in rounding • Dissemination of quality measures and guidance for carrying out statistical analysis on protected tables • Future output strategies based on flexible table generating software. More need for research into disclosure risk by differencing and linking (collaboration with CS community) • Safe setting, remote access and license agreements for highly disclosive Census outputs (sample microdata and origin-destination tables)

Natalie ShlomoN.Shlomo@Soton.ac.uk

Assessing the Impact of SDC Methods on Census Frequency Tables