180 likes | 185 Views
This working paper explores the use of fixed intervals instead of cell suppression to protect sensitive cells in statistical data confidentiality. The proposed approach aims to minimize disclosure risks while minimizing information loss.
E N D
UNECE/Work Session on Statistical Data ConfidentialityWorking Paper 38 Using Fixed Intervals to Protect Sensitive Cells Instead of Cell Suppression By Steve Cohen and Bogong Li U.S. Bureau of Labor Statistics
BLS Quarterly Census of Employment and Wages (QCEW) • Monthly employment and wages • All 6-digit industries by county, by ownership, and by size group • Used as benchmark source for other important surveys such as Current Employment Statistics survey and Occupational Employment Statistics. • As important input for other Federal and State programs • Based upon UI administrative records • BLS protects the identity of cooperating employers; disclosure restrictions apply. The QCEW program that publishes a census of employment and wages covering 98 percent of employment, at the county, MSA, state and national levels by 6-digit North American Industrial Classification System every quarter:
Current Publication Format and Suppression rules … an example from total employment
Research Goal • Replace Primary and Secondary Suppression with intervals containing the suppressed value • Fixed set of intervals • Previous efforts 1. J. J. Salazar UNECE/Eurostat Work Session on Statistical Data Confidentiality 2001 2. Fischetti and Salazar 1999
Proposed Change to Current Publication Format Using Fixed Intervals … pre-defined, fixed intervals replacing nondisclosable cells
Disclosure Risks Associated with Fixed Interval publication format • By obtaining ranges or bounds of previously suppressed cells and incorporate them into the additive relationship in the table, outside attackers could improve their estimation precision of the primary cells that current CSP methods intend to protect • Contributor to a cell or knowledgeable insiders may subtract its value from FI bounds to obtain a narrower estimate of other contributors in the same cell • For cells with few contributors, small contributors can significantly improve their estimate of the dominate contributor by knowing which end of the FI bound to use • For single contributor cells, one end of the FI bound may be too close to the actual value that the single respondent feels uncomfortable about
Our Proposed Selection-Improvement Solution to the Fixed Interval Publication Problem (FIPP) Step 1. Identify primary and secondary cells via a CSP method and publish them in pre-defined FIs Step 2. Apply linear constrained optimization to identify those primary cells with disclosure risks (audit) Step 3. Select additional protecting cells for those primary cells at risk while minimizing information loss Step 4. Audit the table one more time, exit if all primary cells are protected, otherwise reiterate steps 3-4.
The “Selection-Improvement” algorithm re-iterate itself until the table is fully safe, while minimizing information loss during each iteration loop
Methods to Select Additional Protecting Cells (PCs) • Systematic method: selects the smallest cell in value among all cells that form additive relationships with two primary cells at risk. Publish this cell in pre-defined FI. • Single Source Shortest Path (SSSP) method: selects the cells on the “shortest path” connecting all primary cells at risk on the table network, fixing the order of the vertices. • Random Selection method: randomly select a cell that form additive relationship with the primary exposure cells. No minimization of information loss is aimed, last resort when above two methods fail.
Advantages of Our Selection-Improvement Method • Easy implementation • Zero disclosure risk • Applicable to tables with n-dimensions • Order of complexity is that of the auditing program used
The Publication Table Used to Evaluate Selection-Improvement Algorithm • Employment of eight 2-digit NAICS super-sector industries of an U.S. state, 48,250 cells, including Manufacturing, Retail Trade, Transportation, Information, Finance and Insurance, Real Estate and Rental, Professional services, Healthcare • 60,845 establishments • 1,166,388 employments • 60% of publication cells 14% of total employment 7.6% of establishments are completely suppressed (primary & secondary) under current disclosure protection rule.
Tabular Output Comparisons Prior to using Selection-Improvement algorithm (current)
Tabular Output Comparisons (cont’d) … After applying Selection-Improvement Algorithm (proposed)
Additional Protecting Cells Selected for Alternative Selecting Schemes
“Selection-Improvement Method Produced Safe Publication Tables with a User’s Gain of Information” • The entire table is safely protected • Number of selection-improvement iterations is between 2 and 5 times • Increase in employment level in protecting cells is approx. 1% • Increase in number of establishment in protecting cells in approx. 2.5% • Increase in number of protecting cells is between 1% to 10% (with Random Selection method being least efficient) • All cells are published in pre-defined, fixed intervals! Conclusion: A gain of industrial employment information provided to data users is achieved through minimal amount of additional selection cycles.
Limitations and Considerations • The cell selection process is not repeatable. • Random selection method produces different sets of protecting cells each time. • The method applies to table with multi-dimensions and hierarchies, but modeling its relationship could be complex and cumbersome. • No production computer software exists.
Contact Information Bogong T. Lili_t@bls.gov 202-691-7415 Steve Cohen cohen_steve@bls.gov 202-691-7400 Bureau of Labor Statistics / OSMR 2 Massachusetts Ave. N.E. Washington, DC 20212-0001 BLS QCEW program http://www.bls.gov/cew/home.htm