480 likes | 683 Views
Official Statistics and Confidentiality. Maura Bardos. Outline. Overview of the Federal Statistical System Agencies Types of survey data collected Challenges Statistical Disclosure and confidentiality Implications . Federal Statistical System. Headed by a Chief Statistician
E N D
Official Statistics and Confidentiality Maura Bardos
Outline • Overview of the Federal Statistical System • Agencies • Types of survey data collected • Challenges • Statistical Disclosure and confidentiality • Implications
Federal Statistical System • Headed by a Chief Statistician • Decentralized System in the United States • 13 Agencies with a statistics oriented mission • Statistical Agencies are located throughout various agencies in the Federal Government • Examples: Census (Commerce Department), Energy Information Administration (Department of Energy), Bureau of Labor Statistics (Department of Labor)
Data • Where do the numbers come from? • Survey data • Regulations by OMB • Response rates • Legal obligations • Confidentiality
Confidentiality • Confidential Information Protection and Statistical Efficiency Act of 2002(CIPSEA)- places the onus on federal employees to limit disclosure • Took over 4 years to implement (Anderson and Seltzer) • 3 ways to reduce within agencies: • 1) Limiting identifiability of survey materials within the organization • 2) restricting access to data • 3) restricting the contents that may be released
Statistical Disclosure and Confidentiality • Statistical Disclosure- “the identification of an individual (or of an attribute) through the matching of survey data with information available outside of the survey” (Groves, et.al) • The federal government identifies three different types of disclosure: • Identity: inappropriate attribution of information to a data subject, whether an individual or an organization. • Attribute: data subject is identified from a released file sensitive information about a data subject is revealed through the released file • Inferential: the released data make it possible to determine the value of some characteristic of an individual more accurately than otherwise would have been possible (FCSM)
Challenges • Need to provide information • FOIA requests, Subpoenas • Satisfy requests for multiple clients. Must keep track of all withheld information • Maintain utility of data while preserving confidentiality • “Programming nightmare” to keep track of the relationship between variables, tables, and hierarchy
How To Prevent • Specific Strategies • Data Swapping • Noise • Combining Cells • Rounding • Cell Suppression
Strategy: Data Swapping • Exchange of reported data values across data records (Fienberg, Steele, Makov, 1996)
Strategy: Noise • Assign a multiplying factor, or noise factor to all data • For example: the value of a randomly generated variable might be added to each value in a dataset • “protect individual establishments without compromising the quality of our estimates” • Pro: More data can be published, less complicated, less time consuming • Problem: perturbing ALL data, non-sensitive and sensitive alike
Strategy: Noise • How is this done: Use Multipliers • The standard is to perturb data by about 10% • Use multipliers ranging from .9 to 1.1 • Must preserve trend in data- otherwise useless for client’s analysis • Use distributions to control variance (examples)
Tables • Before Tabulation Strategies: Data Swapping; Data Perturbation (Noise) • Tables of Frequencies • Percent of population with certain characteristics • With outside knowledge- respondents with unique characteristics can be identified • Sensitive information: identified by threshold • Tables of magnitude data • Aggregate data, such as income of individuals, revenues of companies • Extreme values • Sensitive information: identified by linear sensitivity measure
Strategy: Recoding Methods • Changing to values of outlier cases, since outliers are more likely to be sample or population uniques • Top coding- taking the largest values on a variable and giving them the same code value in dataset • For example- place all companies producing more than 100,000 barrels of oil per day in one category • Non-uniques are unperturbed
Strategy: Rounding • Similar to noise. Cells are rounded, random decision is made whether to round up or down • Example: x -r = 5q • Round values to the a multiple of 5 • Where q = non negative integer r = remainder X = cell value, Rounded up, 5 x (q+1) probability of r/5 Rounded down, 5 x q probability of (1-r/5)
How to identify cells with disclosure risks for magnitude data • n-k rule • p% rule
P-Percent rule • If upper or lower estimates for the respondent’s value are closer to the reported value than some prespecified percentage (p) of the total cell value, the cell is sensitive (Groves, 372). • Assumptions: Any respondent can estimate the contribution of another respondent within 100% of its value • The second largest responded can use their reported value and attempt to estimate the largest reported value, X1
P Percent Rule • A cell is sensitive if: S>0 where S = x1 - 100/p * (T – x2 - x1) For a given cell with N respondents, arrange the data in order from large to small: X1>X2>…>Xn>0
Example Consider the cell 18,177. N=3; X1 = 17,000; X2 = 1,000; X3 = 177; p=15
(n, k) Rule • If a small number (n) of the respondents contribute a large percentage (k) to the total cell value then the cell is sensitive (Groves 372)
Example • We are publishing production data of how many barrels a day of crude oil each refinery produces. This is secret information. If our competitors found out, it could be detrimental to our business. • There are 4 collectors in the state with collections of 100, 50, 25, and 5 respectively • Find out if this information should be released or not using the n-k rule with (2, 85). The P Percent rule (p=35%)? • Using the P Percent rule, this cell is sensitive. However, it is not sensitive by the n-k rule
System of equations: P%: Z2 > 100 – 1.35Z1 (n,k): Z2 > 85 – Z1 Variable Constraints Z2 < Z1 Z1 + Z2 < 100
Strategy: Sensitive Cell Suppression • Primary Suppressions: The sensitive Cell • Complementary/Secondary Suppressions: Additional withheld data to ensure that the primary suppressions cannot be derived by linear combination • Goal: Minimize information lost. This is accomplished by selecting smallest possible cell values for complementary cell suppression • Problem: Often requires a substantial amount of data to be withheld. Potential for errors may lead to the release of confidential data
Strategy: Sensitive Cell Suppression • Small Tables: • Manual suppression • Computerized audit procedures • Large Tables: • Much more complex, especially with related tables and hierarchical data • Consistency
Cell Suppression Example • Let’s return to a previous example: Sales Revenue • We determined that we must the cell must be suppressed. How do we accomplish this?
Conclusion: Data is secure • High levels of security and suppression protect data are necessary as data guides real life policy issues. • Quality of this data is dependent on not only a high response rate, but accurate responses • Producing data is a function of “public trust” • However, the point of data collection is its use and analysis. The tradeoff between confidentiality and utilization must be examined
…Or is it? • Patriot Act 2001 (Anderson & Seltzer) • Section 508: Disclosure information from National Center for Education Statistics Surveys • Justice Department is able to obtain and use for investigation and prosecution reports, records, and information (including individually identifiable information) • The Patriot Act overrides the 1994 National Center for Education Statistics Act that protections confidentiality
Other examples from history • Second War Powers Act (1942-1947) • Repealed confidentiality protects of Title 13 governing the US Census Bureau (Anderson & Seltzer) • Japanese Americans and Internment camps (USA Today)
2004 data on Arab-Americans (NYT) • Released number of Arab-Americans per zip code • Categorized by country of origin: Egyptian, Iraqi, Jordanian, Lebanese, Moroccan, Palestinian, Syrian and two general categories, "Arab/Arabic" and "Other Arab." • Data obtained from a sample (the long form of the census)
In conclusion… …the next time you fill out a survey, think about where your information may (or may not) be used.
Sources • Clemetson, Lynette. “Homeland Secuirty given data on Arab-Americans.” New York Times. July 30, 2004. http://www.nytimes.com/2004/07/30/politics/30census.html • El Nasser, Haya. “Papers show Census role in WWII Camps.” USA Today. March 30, 2007. http://www.usatoday.com/news/nation/2007-03-30-census-role_N.htm • “DoD releases FY 2010 Budget Proposal.” US Department of Defense. May 7, 2009. http://www.defenselink.mil/releases/release.aspx?releaseid=12652 • Seltzer, William and Margo Anderson. “NCES and the Patriot Act.” Paper prepared for the Joint Statistical Meetings. 2002. http://www.uwm.edu/~margo/govstat/jsm.pdf • Evans, Timothy, Laura Zayatz, and John Slanta. “Using Noise for Disclosure Limitation of Establishment Tabular Data.” US Census Bureau. 1996. http://www.census.gov/prod/2/gen/96arc/iiaevans.pdf • “Statistical Programs of the US Government.” Office of Management and Budget. 2009. http://www.whitehouse.gov/omb/assets/information_and_regulatory_affairs/09statprog.pdf
Sources of examples • Sullivan, Colleen. “An Overview of Disclosure Principles.” US Census Bureau. 1992. http://www.2010census.biz/srd/papers/pdf/rr92-09.pdf • “Statistical Policy Working Paper: Report on Statistical Disclosure Methodology.” Federal Committee on Statistical Methodology. 2005. http://www.fcsm.gov/working-papers/SPWP22_rev.pdf • Groves, Robert et. al. Survey Methodology. Hoboken, NJ: John Wiley & Sons. 2004.
Additional Resources • http://jpc.cylab.cmu.edu/journal/2009/vol01/issue01/issue01.pdf • http://www.census.gov/srd/sdc/papers.html • http://www.census.gov/srd/sdc/abowd-woodcock2001-appendix-only.pdf