150 likes | 166 Views
Learn about creating synthetic versions of confidential social science data to preserve privacy while maintaining data validity. Explore examples like Origin-Destination Commuter Data and resources for this innovative process.
E N D
Building Synthetic Versions of Confidential Social Science Data John M. AbowdEdmund Ezra Day Professor, CornellDistinguished Senior Research Fellow, US Census Bureau
Outline • What are synthetic data? • The feedback cycle • Example: Origin-Destination Commuter Data • Example: Demographic Data • Example: Employer-Employee Matched Data • How do you do it? • Resources
What Are Synthetic Data? • Micro data constructed by sampling from appropriate distributions to simulate a target population • Preserve the multivariate distribution of the underlying confidential data (along all dimensions) by appropriate modeling • Protect the confidentiality of the underlying data by releasing only the synthetic version
Conceptual Framework D • The mission of data providers • Disseminate information • Protect confidentiality • Efficient production is on the PPF (not Z) • X is optimal when preferences favor dissemination • Y is optimal when preferences favor protection • X and Y are both feasible X Y Z P
The Research – Synthetic Data Feedback Cycle ConfidentialityProtection ScientificModeling DataSynthesis AnalyticValidity
Example: Origin-Destination Data • Block-level transportation analysis of commuting patterns • Origin household characteristics (including address) protected by synthetic data methods • Destination establishment characteristics protected by dynamic noise infusion (see example 3) • Released in prototype by the LEHD Program at the Census Bureau • http://lehd.dsd.census.gov/led/datatools/onthemap.html
Example: Demographic Data • Survey of Program Participation linked to all available Social Security Administration earnings and benefit data • Joint project of LEHD at Census and SSA ORES • Protected by partial synthetic data • Scheduled for release within a year
Example: Longitudinally Linked Employer-Employee Data • LEHD Program Quarterly Workforce Indicators • Current releases protected by dynamic noise infusion • Future releases will be protected by synthetic data methods • http://lehd.dsd.census.gov/led/datatools/qwi-online.html
How Do You Do It? • Assemble all data in a protected, confidential enclave (e.g., US Census Bureau) • Identify important relations to preserve • Model • Synthesize • Test • Repeat until the results are satisfactory • User feedback improves future versions
Resources • John Abowd’s Social and Economic Data Course • LEHD Program Technical Papers • NSF-ITR Virtual Research Data Center