160 likes | 294 Views
Estimating Identification Risks for Microdata. Jerome P. Reiter Institute of Statistics and Decision Sciences Duke University, Durham NC, USA. Measures of identification disclosure risk.
E N D
Estimating Identification Risks for Microdata Jerome P. Reiter Institute of Statistics and Decision Sciences Duke University, Durham NC, USA
Measures of identification disclosure risk • Number of population uniques:Does not incorporate intruders’ knowledge.May not be useful for continuous data.Hard to gauge effects of SDL procedures.Hard to estimate accurately. • Probability-based methods(Direct matching using external databases.Indirect matching using existing data set.)Require assumptions about intruder behavior.May be costly to obtain external databases.
Notation for methods • Actual record j : • Released record j : • Available data: • Unavailable + perturbed data combined:
Probability of identification • Let J = j when record j in Z matches the target record, t. • J = r + 1 when target is not in Z.
Calculating CASE 1: Target assumed to be in Z: • Units whose do not match target’s values have zero probability. • For matches, probability equals 1/nt where nt is number of matches in Z. • Probability equals zero for j = r+1.
Calculating CASE 2: Target not assumed to be in Z: • Units whose do not match target’s values have zero probability. • For matches, probability is 1/Ntwhere Nt is number of matches in pop’n. • For j = r+1, probability is (Nt – nt) / Nt
Calculating • Data swapping:Repeatedly simulate swapping mechanism using Z.Estimate probabilities for combinations of original + swapped values.
Calculating • Noise addition:Assume variable k perturbed using Gaussian noise with mean zero and known variance σ2.
Calculating • First distribution is for SDL methods. • Second distribution is best model for predicting unavailable variables given what is known.
Calculating when values in U are not perturbed. Intruders may act this way to avoid computations. It is prudent to evaluate risk assuming they do.
Calculating • Assume independence to obtain: where
Simulations • 51,016 heads of household from 2000 CPS. • Potentially available variables: Age, Sex, Race, Marital Status, Property Tax • Unavailable variables:Education, Income, Social Security, Child Support Payments
Simulations: SDL Procedures • Age: Group in five year intervals. • Race and Marital Status:Swap randomly 30% of values for each variable. • Property taxes:For positive taxes, add noise from N(0, 2902). Constrain values to be positive. Do not alter 0s. • Other variables: Leave at original values.
Simulations: Targets • Everyman : has values near median for all variables. • Unique : Sample unique on combination of age, sex, race, marital status. • Big I : Highest income in data set. • Big P : Highest property tax in data set.
Simulations: Summary of results • Swaps needed to protect Unique. • Age recode plus swaps good protection. • Knowing property taxes greatly increases probabilities of identification. • Adding noise to positive tax values is not sufficient. (Top-coding helps.)