1 / 15

Building Synthetic Versions of Confidential Social Science Data

Building Synthetic Versions of Confidential Social Science Data. John M. Abowd Edmund Ezra Day Professor, Cornell Distinguished Senior Research Fellow, US Census Bureau. Outline. What are synthetic data? The feedback cycle Example: Origin-Destination Commuter Data Example: Demographic Data

gkowalewski
Download Presentation

Building Synthetic Versions of Confidential Social Science Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Building Synthetic Versions of Confidential Social Science Data John M. AbowdEdmund Ezra Day Professor, CornellDistinguished Senior Research Fellow, US Census Bureau

  2. Outline • What are synthetic data? • The feedback cycle • Example: Origin-Destination Commuter Data • Example: Demographic Data • Example: Employer-Employee Matched Data • How do you do it? • Resources

  3. What Are Synthetic Data? • Micro data constructed by sampling from appropriate distributions to simulate a target population • Preserve the multivariate distribution of the underlying confidential data (along all dimensions) by appropriate modeling • Protect the confidentiality of the underlying data by releasing only the synthetic version

  4. Conceptual Framework D • The mission of data providers • Disseminate information • Protect confidentiality • Efficient production is on the PPF (not Z) • X is optimal when preferences favor dissemination • Y is optimal when preferences favor protection • X and Y are both feasible X Y Z P

  5. The Research – Synthetic Data Feedback Cycle ConfidentialityProtection ScientificModeling DataSynthesis AnalyticValidity

  6. Example: Origin-Destination Data • Block-level transportation analysis of commuting patterns • Origin household characteristics (including address) protected by synthetic data methods • Destination establishment characteristics protected by dynamic noise infusion (see example 3) • Released in prototype by the LEHD Program at the Census Bureau • http://lehd.dsd.census.gov/led/datatools/onthemap.html

  7. Where do workers working between Minneapolis/St Paul live?

  8. Example: Demographic Data • Survey of Program Participation linked to all available Social Security Administration earnings and benefit data • Joint project of LEHD at Census and SSA ORES • Protected by partial synthetic data • Scheduled for release within a year

  9. Example: Longitudinally Linked Employer-Employee Data • LEHD Program Quarterly Workforce Indicators • Current releases protected by dynamic noise infusion • Future releases will be protected by synthetic data methods • http://lehd.dsd.census.gov/led/datatools/qwi-online.html

  10. How Do You Do It? • Assemble all data in a protected, confidential enclave (e.g., US Census Bureau) • Identify important relations to preserve • Model • Synthesize • Test • Repeat until the results are satisfactory • User feedback improves future versions

  11. Resources • John Abowd’s Social and Economic Data Course • LEHD Program Technical Papers • NSF-ITR Virtual Research Data Center

More Related