150 likes | 162 Views
Building Synthetic Versions of Confidential Social Science Data. John M. Abowd Edmund Ezra Day Professor, Cornell Distinguished Senior Research Fellow, US Census Bureau. Outline. What are synthetic data? The feedback cycle Example: Origin-Destination Commuter Data Example: Demographic Data
E N D
Building Synthetic Versions of Confidential Social Science Data John M. AbowdEdmund Ezra Day Professor, CornellDistinguished Senior Research Fellow, US Census Bureau
Outline • What are synthetic data? • The feedback cycle • Example: Origin-Destination Commuter Data • Example: Demographic Data • Example: Employer-Employee Matched Data • How do you do it? • Resources
What Are Synthetic Data? • Micro data constructed by sampling from appropriate distributions to simulate a target population • Preserve the multivariate distribution of the underlying confidential data (along all dimensions) by appropriate modeling • Protect the confidentiality of the underlying data by releasing only the synthetic version
Conceptual Framework D • The mission of data providers • Disseminate information • Protect confidentiality • Efficient production is on the PPF (not Z) • X is optimal when preferences favor dissemination • Y is optimal when preferences favor protection • X and Y are both feasible X Y Z P
The Research – Synthetic Data Feedback Cycle ConfidentialityProtection ScientificModeling DataSynthesis AnalyticValidity
Example: Origin-Destination Data • Block-level transportation analysis of commuting patterns • Origin household characteristics (including address) protected by synthetic data methods • Destination establishment characteristics protected by dynamic noise infusion (see example 3) • Released in prototype by the LEHD Program at the Census Bureau • http://lehd.dsd.census.gov/led/datatools/onthemap.html
Example: Demographic Data • Survey of Program Participation linked to all available Social Security Administration earnings and benefit data • Joint project of LEHD at Census and SSA ORES • Protected by partial synthetic data • Scheduled for release within a year
Example: Longitudinally Linked Employer-Employee Data • LEHD Program Quarterly Workforce Indicators • Current releases protected by dynamic noise infusion • Future releases will be protected by synthetic data methods • http://lehd.dsd.census.gov/led/datatools/qwi-online.html
How Do You Do It? • Assemble all data in a protected, confidential enclave (e.g., US Census Bureau) • Identify important relations to preserve • Model • Synthesize • Test • Repeat until the results are satisfactory • User feedback improves future versions
Resources • John Abowd’s Social and Economic Data Course • LEHD Program Technical Papers • NSF-ITR Virtual Research Data Center