320 likes | 422 Views
Protecting the Confidentiality of Tables by Adding Noise to the Underlying Microdata. Paul Massell and Jeremy Funk Statistical Research Division U.S. Census Bureau Washington, DC 20233 Paul.B.Massell@census.gov. Talk Outline. Overview of EZS Noise
E N D
Protecting the Confidentiality of Tables by Adding Noise to the Underlying Microdata Paul Massell and Jeremy Funk Statistical Research Division U.S. Census Bureau Washington, DC 20233 Paul.B.Massell@census.gov
Talk Outline • Overview of EZS Noise • Measuring Effectiveness of Perturbative Protection • Noise Applied to Weighted Data • Noise Applied to Unweighted Data: Random vs. Balanced Noise • Conclusions and Future Research
The EZS Noise Method (Evans, Zayatz, Slanta) • Developed by Tim Evans, Laura Zayatz, and John Slanta in the 1990’s • Multiplicative noise is added to the underlying microdata, before table creation • A noise factor or multiplier is randomly generated for each record
The EZS Noise Method (Evans, Zayatz, Slanta) • The distribution of the multipliers should produce unbiased estimates, and ensure that no multipliers are too close to 1 • Weights both known and unknown to users are combined with the noise factors to obtain ‘noisy’ values for all records • When tabulated, in general, sensitive cells are changed quite a bit and non-sensitive cells are changed only by a small amount
Attractive Features of EZS • Tables with noisy data are created in • the same way as the original tables: • simply: replace var X with var X-noisy • Tables are automatically additive • An approximate value could be released for every cell • (depends on agency policy) • No Complementary Suppressions
Attractive Features of EZS • Linked tables and special tabs are automatically protected consistently • EZS allows for protection at the company level (Census requirement) • Ease of implementation compared to methods such as cell suppression
Measuring Effectiveness of the EZS Method • Step 1: Determine which cells in a table are sensitive – e.g., using p% Sensitivity Rule • Step 2: Measure level of protection to sensitive cells (using protection multipliers) • Step 3: Measure amount of perturbation to non-sensitive cells (via % change graph)
The p% Sensitivity Rule • Unweighted Data: • Let T = cell total ; x1, x2 top 2 contributions • Let ‘rem’ denote remainder • Set rem = T – (x1 + x2) • Let ‘prot’ denote suggested protection • Set prot = (p/100) * x1 – rem • if prot > 0, when Contributor 2 tries to • estimate x1, rem does NOT provide enough uncertainty ; additional protection is needed; noise may provide this uncertainty
p% Sensitivity Rule • Weighted Data: • TA = Fully Weighted Cell Estimate • X1 = Largest Cell Respondent Contribution • X2 = 2nd Largest Cell Contribution • wkn = Known Weights • wun = Unknown Weights
Extended p% rule w. weights & rounding • rem = TA – (X1 * wkn1 + X2 * wkn2 ) • prot = ( (p/100) * X1 * wkn1 ) – rem
Measuring the Effectiveness of a Perturbative Protection Method • Protection of Sensitive Cells : • Define Protection Multiplier (PM) • PM = abs (perturbation) / prot • Find how many (or %) have PM < 1 • Data Quality: • Important: % change for non-sensitive cells • Less important: % over-pertubation for • sensitive cells
EZS Noise Factors for Unweighted Data • Let X = original microdata value • Let Y = perturbed value • Let M = noise multiplier; i.e. a draw from a specified noise distribution of EZS type • Y = X * M
Noise Distribution used for all examples: • (a=1.05, b=1.15) 5% to 15% noise
Noise Applied to Weighted Data • Key idea: weights (e.g., sample weights) • provide protection to microdata since users typically “know” weights only roughly (except when close to 1) • Not necessary to apply full M factor to X unless w = 1
EZS Noise Factor for Weighted Data • Weighted Data: • For a simple weight w with associated uncertainty interval at least as wide as 2*b*w • the noise factor S can be combined with w to • form the Joint Noise-Weight Factor
Noise Formula for Known and Unknown Weights • Calculation of Perturbed Values: • wkn is the known weight • wun is the unknown weight.
Noise for Weighted Data:Commodity Flow Survey (CFS) • Measures flow of goods via transport system in U.S. • Estimates volume and value of each commodity shipped: by origin, destination, modes of transport • Used for transport modeling, planning, ... Some users have objected to disclosure suppressions
Effect of Noise on High Level Aggregate Cells • CFS Table: National 2-DigitCommodityData Quality Measure: 43 cells; 0 are sensitive • 41 cells change by [0 - 1] % • 2 cells change by [1 - 2] %
CFS Test Table • (Origin State by Destination State by 2 digit Commodity) • 61,174 cells of which 230 are sensitive • Data Quality and Protection Assessments • (following slides)
CFS Noise ResultsData Quality Assessment • While some cells may receive large doses of noise, vast majority get less than 1% or 2%
CFS Random NoiseProtection Assessment • Most sensitive cells receive significant noise, i.e. 5% to 11% • Only 2 out of 230 sensitive cells do not receive full protection from noise, as measured by Protection Multipliers (PM)
Noise for Unweighted DataNon-Employers Statistics • Special Features of Microdata • Unweighted adminstrative data • Only 1 variable to protect: receipts • Many small integers (after rounding to $1000) • Special Features of Key Table • Many cells have a small number of contributors; these include many safe cells • Many sensitive cells with only 1 or 2 contributors
NE Noise ResultsData Quality Assessment • Lack of weights results in much more distortion to non-sensitive cells than occurs for CFS
NE Noise ResultsProtection Assessment • Resembles noise factor distribution, due to prevalence of 1 respondent cells in NE test table and no weights
Noise Balancing • Is there a way to improve data quality in this situation? • Yes, if one can focus on one key table T • Idea: balance noise at each cell in ‘balancing sub-table B of T ’ (defn: every micro value is in at most one cell of B) • Choose noise directions to maximize noise cancellation for each cell of B
Noise BalancingSupportive NE Characteristics • Balancing works especially well for NE because a high % of microdata is single unit • After balancing interior cells, need to check noise effect on aggregate cells in same table • Also need to check noise effect in higher and lower tables; these we call “trickle up” and “trickle down” effects • For NE, there are few of these other tables; • this makes balancing decision easier
NE – Balanced NoiseData Quality Assessment • Vast improvement in data quality • Resembles that of weighted data in CFS
NE – Balanced NoiseProtection Assessment • Very similar to Random Noise application • 91.7% of sensitive cells fully protected
Random Noise vs. Balanced NoiseNon Employer Test Data • Data Quality is greatly improved • Protection Level is not significantly reduced • Thus Balanced Noise is a Good Choice Here PM density curves on [0,1] are nearly identical for 2 methods
Conclusions • Conclusions: • EZS Noise is a useful method for protecting tables from a variety of economic programs • There are now several variations of the basic EZS method ; which is best for a survey depends on both microdata and table characteristics
Future Research • 1. Should some sensitive cells be suppressed; high noise cells flagged ? • 2. How to handle multiple variables ? • 3. What is the most that users can be told about noise process without compromising data protection ? • 4. How to handle company dynamics (births, deaths, mergers, ….) ? • 5. How to coordinate survey protection ?