120 likes | 130 Views
This talk discusses the background of a project involving the confidentiality protection and disclosure analysis of a longitudinal linked file. It explores the use of synthetic data and probabilistic record linkage to ensure confidentiality while maintaining data usefulness.
E N D
Assessing Disclosure for a Longitudinal Linked File Sam Hawala – US Census Bureau Sam.hawala@census.gov November 9th, 2005
Outline of the talk • Background of the project • Confidentiality protection • Disclosure analysis • Conclusions
Linked SIPP-SSA-IRS Data • The Longitudinal Employer-Household Dynamics (LEHD) Program created a confidential data set that integrates five SIPP panels (1990, 1991, 1992, 1993, 1996), and Earnings Records and SSA benefits data • Data very useful to disability and retirement research communities • LEHD will provide public-use version (PUF) of the integrated microdata using the synthetic data approach
Synthetic Data • Fully-synthetic micro data • Uses the population or record linkage structure of the gold standard micro data • Generates synthetic entities and data elements from appropriate probability models • Partially-synthetic micro data • Preserves the record structure or sampling frame of the gold standard micro data • Replaces the data elements with synthetic values sampled from an appropriate probability model
Data Confidentiality Public product should prevent individuals from being re-identified in the current public use SIPP products Limit number of SIPP variables included Protect survey data, administrative data, and the links between the files
Confidentiality Protection • Protection is based on the inability of PUF users to re-identify the SIPP record upon which the PUF record is based • This prevents wholesale addition of SIPP data to the IRS and SSA data in the PUF • Goal: re-identification of SIPP records from the PUF should result in true matches and false matches with equal probability
Disclosure Analysis • Uses probabilistic record linking • Each synthetic implicate is matched back to the original file • All unsynthesized variables are used as blocking variables
Matching the Files • Two files A (original confidential data file) and B (synthetic data file)… over 200,000 records in each • Blocking criterion (unsynthesized variables) • Matching set of variables • Agreement criterion (M and U probabilities)
Refinements Suggestd by the Disclosure Review Board • The ratios of true matches to false matches should be close to 1. • The overall count of matches should be reduced. • Investigate a method to optimally choose the probabilities for the conditional matching and non-matching agreements
Conclusion • Confidentiality is an increasing problem for agencies releasing public use data • Linked longitudinal worker-employer data is difficult to protect through usual methods • Probabilistic record linkage technology can be a powerful way to assess when data may be at risk.