260 likes | 272 Views
This presentation explores releasing microdata while protecting respondents' confidentiality. It covers disclosure risks, intrusion scenarios, avoidance treatments, and more. Learn about common methods and examples like NSDUH PUFs.
E N D
Releasing microdata via public use files while protecting respondents’ confidentiality Presented by NeerajaSathe Neeraja Sathe, Feng Yu, and Lanting Dai RTI International 3040 Cornwallis Road Research Triangle Park, North Carolina 27709, USA
Acknowledgments • This presentation is sponsored by RTI International’s Division for Statistical & Data Sciences. • The National Survey on Drug Use and Health (NSDUH) is used an example in this presentation. NSDUH is funded by the Substance Abuse and Mental Health Services Administration (SAMHSA), Center for Behavioral Health Statistics and Quality (CBHSQ), under Contract No. HHSS283201700002C. • The views expressed in this presentation do not necessarily reflect the official position or policies of SAMHSA or the U.S. Department of Health and Human Services, nor does mention of trade names, commercial practices, or organizations imply endorsement by the U.S. Government.
Outline • What are microdata, and why do we need them? • What is disclosure risk? • Disclosure scenarios • Intrusion scenarios • Disclosure avoidance treatment and data quality • Common statistical disclosure limitation (SDL) methods • NSDUH public use files (PUFs): An example a. NSDUH background b. MASSC c. NSDUH PUF data quality 4. Other examples of microdata PUFs 5. References
1. What are microdata, and why do we need them? • Microdata are typically “unit-level” data not yet summarized into tables. Unit can be a pretty broad term and can apply to person level, household level, institution level data, etc. • Microdata help us make informed decisions. • Microdata provide researchers flexibility in performing analyses. • Demands are increasing from researchers at large and from policymakers to gain access to readily available microdata.
2. What is disclosure risk? • Disclosure occurs when private information is revealed to the public. Specifically, disclosure happens when an unauthorized person (an “intruder”) tries to link a record in the microdata file to an identifiable respondent. • Disclosure risk is typically considered as a measure of the risk to respondent confidentiality that the data releaser (typically, a statistical agency) would experience as a consequence of releasing the data (Franconi & Polettini, 2004).
2a. Disclosure scenarios • Identity disclosure occurs when a direct identifying variable (IV) reveals the identity of a particular individual (e.g., name, phone number, DOB). • Attribute disclosure occurs when confidential information (typically, a sensitive value [SV]) about a data subject is revealed and can be attributed to the subject on a PUF (institution may reveal diagnosis). • Inferential disclosure occurs when it becomes possible to determine the value of some characteristic related to the data subject more accurately than would have been otherwise possible (zip code or type of car may reflect financial status).
2b. Intrusion scenarios • Intrusion scenarios fall into two main categories: • Inside intrusion occurs when intruders know that their targets are in the survey and try to discover sensitive information by using the existing identifying information in the data file in combination with personal knowledge (e.g., a father knows that his son is in the survey and wants to discover his son’s substance use behavior). • Outside intrusion occurs when intruders do not know the presence of their targets in the sample and try to identify a record (any record) by matching it to an external database (e.g., an outside intruder targets a population unique in the sample to find more information).
2c. Disclosure avoidance treatment and data quality • To protect respondents’ confidentiality and to comply with federal regulations in the United States, such as the Confidential Information Protection and Statistical Efficiency Act of 2002 (CIPSEA), statistical disclosure avoidance treatment is often imposed on large survey data to minimize disclosure risk. • There is a trade-off between disclosure risk and data quality:As the level of disclosure treatment increases, data utility decreases. A statistical disclosure avoidance method must be used to control both disclosure risk and data quality.
2d. Common SDL methods • Deterministic methods: • Variable deletion (removing variables that pose a high disclosure risk from a released dataset, particularly direct IVs) • Global recoding (recoding continuous variables into intervals, collapsing categories that have a small number of observations into broader categories) • Local suppression (suppress SVs of certain records)
2d. Common SDL methods (continued) • Perturbative methods: • Random swapping/substitution (a subset of data attributes containing SVs or IVs is exchanged between a selected pairs of records so that it is difficult for an intruder to definitively identify an individual) • Random subsampling (sampling out a few observations from the survey data, so intruders do not know if their targets are on the PUF) • Post RAndomisationMethod (PRAM) (for each record in a microdata file, the responses for selected variables are perturbed according to a specified probability mechanism) • Micro Agglomeration, Random Substitution, Random Subsampling, and Weight Calibration (MASSC) (an RTI-invented method discussed in detail later)
2d. Common SDL methods (continued) • Synthetic data generation: • Using multiple imputation, values that are identifiable or highly sensitive from originally surveyed data are replaced with imputed values. • Fully synthetic data -this data does not contain any original data (very low disclosure risk, since all values are synthetic). • Partially synthetic data- retain the survey units but replace some collected values (typically SVs) with imputed values (disclosure risk is higher than fully synthetic data especially for inside intrusion).
2d. Common SDL methods (continued) • Modern SDL method: • Differential privacy was proposed by Dwork (2006) and is also referred to as “formally private SDL method” • Differential privacy guarantees that the presence or absence of any individual case from a database will not significantly affect any database query. • Differential privacy can be achieved by adding randomized “noise” to an aggregate query result to protect individual entries without significantly changing the result. • A set of algorithms provides mathematically provable protection to respondents and allows policymakers to manage the trade‑off between data accuracy and privacy protection (Dajani et al., 2017).
3. NSDUH PUFs: An example3a. NSDUH background • NSDUH (National Survey of Drug use and Health) provides national, state, and substate data on substance use and mental health in the civilian, noninstitutionalized population aged 12 or older. • Data are collected on a quarterly basis each year. • Approximately 700 field interviewers are on staff each year. • Approximately 140,000 household screenings and 67,500 interviews are completed annually. • The survey is conducted by RTI under contract with SAMHSA.
3b. MASSC MASSC is used to treat NSDUH data to minimize disclosure risk (Singh, 2002; Singh, Yu, & Dunteman, 2003; Singh, Yu, & Wilson, 2004). It has four major steps: • MicroAgglomeration partitions data into risk strata based on a set of selected IVs. • Optimal probabilistic Substitution replaces IVs of the randomly selected records with those of “substitution donors” subject to a set of bias constraints. • Optimal probabilistic Subsampling randomly deletes some of the records from the data subject to a set of variance constraints. • Optimal sampling weight Calibration adjusts weights in the subsample so that they sum to the total of the original weights for selected variables on the full analytic data file.
3b. MASSC (continued) • Using MASSC allows disclosure risk and information loss to be simultaneously controlled in the treated database. • Using MASSC also introduces uncertainty to the data. • The variables substituted in MASSC are confidential. • In addition to the MASSC steps, other treatment is routinely done: • Directly identifying information, such as name, phone number, and address, is not included in any NSDUH restricted-use file (RUF) or PUF. • In a PUF, almost all geographic identifiers (including census region, state, and county) are removed. • The household link between respondents from the same household is deliberately excluded from the PUF to reduce the inside intrusion risk because more than one person in the household could have been selected to participate in the survey. • Local suppression, recoding, etc., are used.
3c. NSDUH PUF data quality • The overall data quality of the NSDUH PUF is assessed after disclosure treatment by MASSC and other perturbation methods. • This assessment is done by comparing the estimates from the PUF and estimates from the RUF for substance use, mental health measures, health outcomes, and prescription drugs in various sociodemographic domains.
3c. NSDUH PUF data quality (continued) • Comparison of estimates: • Point estimates, contrasts, and regression coefficients after treatment are calculated based on the PUF. • Point estimates, contrasts, and regression coefficients before treatment are calculated based on the restricted use file (RUF). • The following ratio calculation is used: Ratios_EST = PUF estimate / RUF estimate Ratios_SE = PUF SE / RUF SE, where SE = standard error Ratios close to 1.00 mean that the PUF estimates are close to the RUF estimates, which indicates less bias. • Comparison of statistical inferences from contrasts and regressions: • t-tests are performed to compare the change of significance at the 5% level for both the contrasts and regression coefficients.
3c. NSDUH PUF data quality (continued) Table 1. Ratios of Estimates and SEs from PUF and RUF Data, 2015 NSDUH
3c. NSDUH PUF data quality (continued) • Because the PUF sample size is about 20% smaller than the RUF sample size, we expected to see an average of 11% increase in the SEs of the point estimates on the RUF. • Extreme ratios in either the estimates or in the SEs of the estimates may come from the small domains or low prevalence. • Precision-based rules should be implemented for small domains. NSDUH has its own set of suppression rules for RUF-based estimates.
3c. NSDUH PUF data quality (continued) Table 2. Ratios of Contrasts and SEs of Contrasts from PUF and RUF Data,2015 NSDUH
3c. NSDUH PUF data quality (continued) Table 3. Ratios of Regression Coefficients and SEs of Regression Coefficients from PUF and RUF Data, 2015 NSDUH
3c. NSDUH PUF data quality (continued) Table 4. Change of Significance for Contrasts at 5% Level of Significance (LOS), 2015 NSDUH Sig to NS = significant to nonsignificant changes; NS to Sig = nonsignificant to significant changes.
3c. NSDUH PUF data quality (continued) Table 5. Change of Significance for Regression Coefficients at 5% Level of Significance (LOS) , 2015 NSDUH Sig to NS = significant to nonsignificant changes; NS to Sig = nonsignificant to significant changes.
4. Other examples of microdata PUFs • European Union Labour Force Survey (EU LFS) – using traditional statistical disclosure control (SDC) methods, such as global recoding, local suppression, and post-randomization. • Medicare Current Beneficiary Survey Cost and Use (MCBS) files / National Health Interview Survey (NHIS) – traditional methods, such as suppression, exclusion, deletion, recoding stratum, and PSU vars. • American Community Survey (ACS) / Public Use Microdata Sample (PUMS) – identifying information removed and geographic details limited to only region, division, state, and PUMAs (Public Use Microdata Areas). • Longitudinal Business Database (LBD) - an annual economic census of establishments in the United States uses a synthetic data approach to produce PUFs. • Census Bureau's OnTheMap tool – a web-based mapping and reporting application showing where workers are employed and where they live that was the first production deployment of formally private SDL methods.
5. References • Dajani, A. N., Lauger, A. D., Singer, P. E., Kifer, D., Reiter, J. P., Machanavajjhala, A., Garfinkel, S. L., Dahl, S. A., Graham, M., Karwa, V., Kim, H., Leclerc, P., Schmutte, I. M., Sexton, W. N., Vilhuber, L., & Abowd, J. M. (2017). The modernization of statistical disclosure limitation at the U.S. Census Bureau. Presented at the Census Scientific Advisory Committee Meeting, September 14–15, 2017, U.S. Census Bureau, Suitland, MD. Retrieved from https://www.census.gov/about/cac/sac/meetings/2017-09-meeting.html • Dwork, C. (2006). Differential privacy. In M. Bugliesi, B. Preneel, V. Sassone, & I. Wegener (Eds.),33rd International Colloquium on Automata, Languages and Programming, Part II (ICALP 2006) (Lecture Notes in Computer Science, Vol. 4052, pp. 1–12). New York, NY: Springer, Berlin, Heidelberg. Retrieved from https://doi.org/10.1007/11787006_1 • Franconi, L., & Polettini, S. (2004). Individual risk estimation in μ-argus: A review. In J. Domingo-Ferrer & V. Torra (Eds.), Proceedings of the International Conference on Privacy in Statistical Databases (PSD 2004), Barcelona, Spain (Lecture Notes in Computer Science, Vol. 3050, pp. 262–272). New York, NY: Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-25955-8_20 • Singh, A. C. (2002). Method for statistical disclosure limitation (U.S. Patent Application Pub. No. US 2004/0049517A1). Research Triangle Park, NC: RTI International. The patent was granted in June 2006 (Patent No. US7058638B2). • Singh, A. C., Yu, F., & Dunteman, G. H. (2003). MASSC: A new data mask for limiting statistical information loss and disclosure. In Proceedings of the Joint UNECE/ EUROSTAT Work Session on Statistical Data Confidentiality, Luxembourg (Working Paper No. 23, pp. 373–394). Geneva, Switzerland: United Nations Statistical Commission and Economic Commission for Europe Conference of European Statisticians, European Commission Statistical Office of the European Communities (EUROSTAT). • Singh, A., Yu, F., & Wilson, D. H. (2004). Measures of information loss and disclosure risk under MASSC treatment of micro-data for statistical disclosure limitation. In Proceedings of the 2004 Joint Statistical Meetings, American Statistical Association, Section on Survey Research Methods, Toronto, Canada (pp. 4374-4381). Alexandria, VA: American Statistical Association.