200 likes | 376 Views
New Implementations of Noise for Tabular Magnitude Data, Synthetic Tabular Frequency and Microdata, and a Remote Microdata Analysis System. Laura Zayatz U.S. Census Bureau 4600 Silver Hill Road Washington, DC 20233-9100 301-763-4955 Fax 301-763-8399 laura.zayatz@census.gov.
E N D
New Implementations of Noise for Tabular Magnitude Data,Synthetic Tabular Frequency and Microdata, and a Remote Microdata Analysis System • Laura Zayatz • U.S. Census Bureau • 4600 Silver Hill Road • Washington, DC 20233-9100 • 301-763-4955 • Fax 301-763-8399 • laura.zayatz@census.gov
Legal Requirements and the Balancing Act • Title 13, U.S. Code and the Confidential Information Protection and Statistical Efficiency Act (CIPSEA) of 2002 • Publish as much valuable statistical information as possible without violating the confidentiality of respondents • Preserve data utility while avoiding disclosure
This Presentation • Noise for Tabular Magnitude Data • Synthetic Tabular Frequency and Microdata • Remote Microdata Analysis System
This Presentation • Introduction to the method • What happened with real data • How we altered the method • Current uses of the method on real data products
Noise for Tabular Magnitude Data:Introduction to the Method • Perturb each establishment’s underlying microdata by a small amount, e.g. 10%, randomly up or down prior to table creation • Sensitive cells needing protection end up being changed by a large amount • Non-sensitive cells end up being changed by a small amount • Simple procedure, values can be shown for all cells, guarantees additivity, no coordination problems for related (overlapping) tables
Noise for Tabular Magnitude Data:Introduction to the Method • To perturb an establishment’s value by about 10%, multiply that value by a random number close to 1.1 or 0.9 • Distribution must be symmetric about 1 for no bias • All establishments within the same company are perturbed in the same direction • Can incorporate the increase in variance into published coefficients of variation
Noise for Tabular Magnitude Data:What Happened with Real Data • Because of randomness, the method can occasionally add excessive amounts of noise to some non-sensitive cells (a problem that cell suppression avoids for non-suppressed cells and controlled tabular adjustment can minimize) • Is there anything we can do to avoid this problem or at least improve results?
Noise for Tabular Magnitude Data:What Happened with Real Data • US Census Bureau magnitude data is almost always published in rounded form (integer form representing thousands or millions) • Noise changes individual response values by a small percentage • Rounding can remove the effect of noise on small response values • Is that OK --- does rounding provide enough protection --- or should additional steps be taken to protect such small values?
Noise for Tabular Magnitude Data:How we Altered the Method • Balanced Noise (See Massell and Funk) • Experiment and choose a table(s) --- quite often a lower level table (in the hierarchy) is a good choice and has a trickle up effect • Random noise for estabs in sensitive cells and in companies represented in more than 1 cell • For others, use a sort to choose noise directions to minimize change to non-sensitive cells
Noise for Tabular Magnitude Data:How we Altered the Method • Currently testing various modifications to standard rounding techniques • Options include rounding underlying microdata values and rounding tabulated cells values • Want to ensure standard rounding does not undo the protection provided by the noise • Ceiling/Floor techniques seem to work well, but results differ for different data products
Noise for Tabular Magnitude Data:Current Uses on Real Data Products • Done: Quarterly Workforce Indicators • Non-Employer Data Products • Near Future: Commodity Flow Survey • Census of Island Areas • Survey of Business Owners • Under Study: County Business Patterns
Synthetic Tabular Frequency and Microdata:Introduction to the Method • Posterior predictive models generate synthetic data with many of the same statistical properties as the original data • Sequential regression imputation, one variable in one record at a time (blank and impute variables causing a disclosure risk for a given record) • Full or partial synthesis, demographic or economic, tables or microdata, one or more implicates
Synthetic Tabular Frequency and Microdata:What Happened with Real Data • Problems with relationships between variables within a data set • Records of households linked to records of all people within the household (father, mother, son, daughter, etc.) • Structurally missing (blank) values because of skip patterns in survey instrument • Examples: people under age 15 cannot have income, a mother cannot be 6 years older than her child
Synthetic Tabular Frequency and Microdata:How we Altered the Method • Impute some of the structurally missing values, but then restore them to missing for standard imputation and edits • For one product - additional layer of programming that became a nine-level collection of parent-child relationships to enforce all constraints
Synthetic Tabular Frequency and Microdata:Current Uses on Real Data Products • Done: SSA Earnings and CB SIPP Data • “On The Map” ACS Group Quarters Data • Under Study: ACS Household Data • Special Tabs for Veterans
Remote Microdata Analysis System:Introduction to the Method • Advanced Query System allow users to generate tables from Census 2000 data • Request passes through 2 firewalls to previously swapped, recoded, and topcoded files; tables are generated and electronically reviewed for disclosure problems; if none are found the results are sent to the user • Can we extend this to data from demographic surveys and other types of statistical analyses?
Remote Microdata Analysis System:What Happened with Real Data • Enabled or disabled system? We chose enabled • Disabled is more flexible for the user but may require “babysitting” • Enabled is more restricted in types of analyses but can be available to more people without strict monitoring • Users choose from lists of data sets, geographic areas, universes, analyses, and variables (system writes the code)
Remote Microdata Analysis System:How we Altered the Method • In looking for disclosure problems, we first focused on the model statements, but later realized the need to look at the underlying data tables (marginal totals of size 1 in particular) in various types of analyses • Working on methods to best identify “cut points” in the detail of short, medium, and long lists of continuous variables that need to be categorized
Remote Microdata Analysis System:Current Uses on Real Data Products • Done: Advanced Query System available to Census Bureau State Data Centers and Census Information Centers and researchers who request an account • Under Study: Extended Microdata Analysis System being tested with American Community Survey and Current Population Survey
Conclusion • Many recent developments in disclosure avoidance at the US Census Bureau • Using the noise technique for several tabular magnitude data products • Releasing several products based on partially synthetic data • AQS is being used widely and work continues on the MAS • It takes time, but it is worth the effort