1 / 20

New Implementations of Noise for Tabular Magnitude Data, Synthetic Tabular Frequency and Microdata, and a Remote Micro

New Implementations of Noise for Tabular Magnitude Data, Synthetic Tabular Frequency and Microdata, and a Remote Microdata Analysis System. Laura Zayatz U.S. Census Bureau 4600 Silver Hill Road Washington, DC 20233-9100 301-763-4955 Fax 301-763-8399 laura.zayatz@census.gov.

ksena
Download Presentation

New Implementations of Noise for Tabular Magnitude Data, Synthetic Tabular Frequency and Microdata, and a Remote Micro

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. New Implementations of Noise for Tabular Magnitude Data,Synthetic Tabular Frequency and Microdata, and a Remote Microdata Analysis System • Laura Zayatz • U.S. Census Bureau • 4600 Silver Hill Road • Washington, DC 20233-9100 • 301-763-4955 • Fax 301-763-8399 • laura.zayatz@census.gov

  2. Legal Requirements and the Balancing Act • Title 13, U.S. Code and the Confidential Information Protection and Statistical Efficiency Act (CIPSEA) of 2002 • Publish as much valuable statistical information as possible without violating the confidentiality of respondents • Preserve data utility while avoiding disclosure

  3. This Presentation • Noise for Tabular Magnitude Data • Synthetic Tabular Frequency and Microdata • Remote Microdata Analysis System

  4. This Presentation • Introduction to the method • What happened with real data • How we altered the method • Current uses of the method on real data products

  5. Noise for Tabular Magnitude Data:Introduction to the Method • Perturb each establishment’s underlying microdata by a small amount, e.g. 10%, randomly up or down prior to table creation • Sensitive cells needing protection end up being changed by a large amount • Non-sensitive cells end up being changed by a small amount • Simple procedure, values can be shown for all cells, guarantees additivity, no coordination problems for related (overlapping) tables

  6. Noise for Tabular Magnitude Data:Introduction to the Method • To perturb an establishment’s value by about 10%, multiply that value by a random number close to 1.1 or 0.9 • Distribution must be symmetric about 1 for no bias • All establishments within the same company are perturbed in the same direction • Can incorporate the increase in variance into published coefficients of variation

  7. Noise for Tabular Magnitude Data:What Happened with Real Data • Because of randomness, the method can occasionally add excessive amounts of noise to some non-sensitive cells (a problem that cell suppression avoids for non-suppressed cells and controlled tabular adjustment can minimize) • Is there anything we can do to avoid this problem or at least improve results?

  8. Noise for Tabular Magnitude Data:What Happened with Real Data • US Census Bureau magnitude data is almost always published in rounded form (integer form representing thousands or millions) • Noise changes individual response values by a small percentage • Rounding can remove the effect of noise on small response values • Is that OK --- does rounding provide enough protection --- or should additional steps be taken to protect such small values?

  9. Noise for Tabular Magnitude Data:How we Altered the Method • Balanced Noise (See Massell and Funk) • Experiment and choose a table(s) --- quite often a lower level table (in the hierarchy) is a good choice and has a trickle up effect • Random noise for estabs in sensitive cells and in companies represented in more than 1 cell • For others, use a sort to choose noise directions to minimize change to non-sensitive cells

  10. Noise for Tabular Magnitude Data:How we Altered the Method • Currently testing various modifications to standard rounding techniques • Options include rounding underlying microdata values and rounding tabulated cells values • Want to ensure standard rounding does not undo the protection provided by the noise • Ceiling/Floor techniques seem to work well, but results differ for different data products

  11. Noise for Tabular Magnitude Data:Current Uses on Real Data Products • Done: Quarterly Workforce Indicators • Non-Employer Data Products • Near Future: Commodity Flow Survey • Census of Island Areas • Survey of Business Owners • Under Study: County Business Patterns

  12. Synthetic Tabular Frequency and Microdata:Introduction to the Method • Posterior predictive models generate synthetic data with many of the same statistical properties as the original data • Sequential regression imputation, one variable in one record at a time (blank and impute variables causing a disclosure risk for a given record) • Full or partial synthesis, demographic or economic, tables or microdata, one or more implicates

  13. Synthetic Tabular Frequency and Microdata:What Happened with Real Data • Problems with relationships between variables within a data set • Records of households linked to records of all people within the household (father, mother, son, daughter, etc.) • Structurally missing (blank) values because of skip patterns in survey instrument • Examples: people under age 15 cannot have income, a mother cannot be 6 years older than her child

  14. Synthetic Tabular Frequency and Microdata:How we Altered the Method • Impute some of the structurally missing values, but then restore them to missing for standard imputation and edits • For one product - additional layer of programming that became a nine-level collection of parent-child relationships to enforce all constraints

  15. Synthetic Tabular Frequency and Microdata:Current Uses on Real Data Products • Done: SSA Earnings and CB SIPP Data • “On The Map” ACS Group Quarters Data • Under Study: ACS Household Data • Special Tabs for Veterans

  16. Remote Microdata Analysis System:Introduction to the Method • Advanced Query System allow users to generate tables from Census 2000 data • Request passes through 2 firewalls to previously swapped, recoded, and topcoded files; tables are generated and electronically reviewed for disclosure problems; if none are found the results are sent to the user • Can we extend this to data from demographic surveys and other types of statistical analyses?

  17. Remote Microdata Analysis System:What Happened with Real Data • Enabled or disabled system? We chose enabled • Disabled is more flexible for the user but may require “babysitting” • Enabled is more restricted in types of analyses but can be available to more people without strict monitoring • Users choose from lists of data sets, geographic areas, universes, analyses, and variables (system writes the code)

  18. Remote Microdata Analysis System:How we Altered the Method • In looking for disclosure problems, we first focused on the model statements, but later realized the need to look at the underlying data tables (marginal totals of size 1 in particular) in various types of analyses • Working on methods to best identify “cut points” in the detail of short, medium, and long lists of continuous variables that need to be categorized

  19. Remote Microdata Analysis System:Current Uses on Real Data Products • Done: Advanced Query System available to Census Bureau State Data Centers and Census Information Centers and researchers who request an account • Under Study: Extended Microdata Analysis System being tested with American Community Survey and Current Population Survey

  20. Conclusion • Many recent developments in disclosure avoidance at the US Census Bureau • Using the noise technique for several tabular magnitude data products • Releasing several products based on partially synthetic data • AQS is being used widely and work continues on the MAS • It takes time, but it is worth the effort

More Related