220 likes | 231 Views
This article discusses the importance of disseminating microdata while protecting individuals' privacy rights. It explores the benefits of using microdata for analysis, policy development, and building trust. The article also addresses the challenges of confidentializing integrated microdata and presents various anonymization methods to ensure confidentiality while maximizing analysis potential.
E N D
Statistical confidentiality and privacy:1. General considerations* * *Robert McCaaMinnesota Population Centerrmccaa@umn.edu “Inadequate use of microdata has high costs”--Len Cook (2003, registrar general, ONS)
UNSD Principles and Recommendations (Rev. 1, 1997) endorse dissemination of census microdata • §1.218: “There are a range of methods…that can be used to make such microdata available while still protecting individuals’ rights to privacy.” (Rev. 2 has a stronger statement.) • In four decades of distributing microdata there is not a single allegation of a breach of confidentiality or privacy (includes 100% microdata stored at CELADE in Santiago, Chile).
Why disseminate microdata? Julia Lane, European Statisticians Conference (2003) • 1. Analyze more realistic questions • 2. Develop reality-based policy • 3. Acquire new constituencies and stakeholders • 4. Build trust; reduce suspicions of data cooking • 5. Replicate findings • a. use standards of UNSD, Eurostat, ISCO, ISCED, etc. • b. facilitate comparative research in time and space • 6. Calculate marginal effects • 7. Assess data quality • …and much, much more….
Imagine!!! What’s the problem? Confidentializing an integrated microdata base with: • 200+ samples of households (70+ countries) • Containing ½ billion person records with thousands of variables • Available to tens of thousands of licensed users regardless of country of birth, citizenship, residence or place of work • Without a single allegation of violation of privacy or statistical confidentiality-- Ever!!
Usage: Off-site vs. on-site use (secure microdata laboratory)? Germany RDC, 2005-8: ten-to-one Jan-Sept RDCs are expensive and attract few users.
ONS-UK gold standard: “Statistical disclosure control methods may modify the data or the design of the statistic, or a combination of both. They will be judged sufficient when the guarantee of confidentiality can be maintained, taking account of information likely to be available to third parties, either from other sources or as previously released National Statistics outputs, against the following standard:“It would take a disproportionate amount of time, effort and expertise for an intruder to identify a statistical unit to others, or to reveal information about that unit not already in the public domain.”Protocols on Data Access and Confidentiality, pp. 7-8 --ONS-UK(2004)www.statistics.gov.uk/about_ns/cop/downloads/prot_data_access_confidentiality.pdf
Risk assessment of household samples of UK 1991 census: attempts at matching are “fruitless”few matches; many false positives • After taking into account errors in the data, coding variability and changing of personal characteristics in time • Dale and Elliott, JRSS-A (2003): “For a user of an outside database, attempting this sort of match with no opportunity for verification would prove fruitless. In the first place, the small degree of expected overlap would be a considerable deterrent to an intruder. However, if a match between the two files was attempted the large number of apparent matches would be highly confusing as an intruder would have no way of checking correct identification.”
Level of Anonymization(FSO-Germany) Degree of confidentiality stronger anonymisationmethod delete direct identifier anonymisationmethod de-facto anonymised microdata fully anonymised microdata completemicrodata confidential microdata Degree of analysis potential Trade-off between confidentiality and analysis potential: is it monotonic (as portrayed)?
Level of Anonymization—not monotonic Degree of confidentiality 95% 99% 99.9% stronger anonymisationmethod delete direct identifier anonymisationmethod & Construct sample de-facto anonymised microdata fully anonymised microdata completemicrodata confidential microdata 50% 45% 25% Degree of analysis potential Trade-off is not monotonic
Resources • UN-ECE (2007), Managing Statistical Confidentiality & Microdata Accesshttp://www.unece.org/stats/documents/tfcm.htm • IHSN Tools & Guidelines, anonymization:www.surveynetwork.org • Eurostat (1999)
IHSN www.Surveynetwork.org • Remove variables • Identifiers: name, address, low-level administrative geography • Sensitive: tribe, disability • Global recoding • Aggregate classes: age (5 yr groups), country of birth (continent), administrative geography, occupation (4 digit 3), etc. • Top and bottom coding (continuous variables--income, size of residence, number of rooms, etc.) • Local suppression--sparse categories (population n < 250…2,500) • Data swapping (household geography) • Complex perturbations
EUROSTAT statistical confidentiality standards (Thorogood, 1999) --all endorsed by IPUMS-International • 1. Restrict access to samples • 2. Limit geographical detail • 3. Re-code unique categories--top and bottom • 4. Sign non-disclosure agreement • 5. Prohibit redistribution to third parties • 6. Prohibit attempts to identify individuals or the making any claim to that effect • 7. Require users to provide copies of publications
EUROSTAT statistical confidentiality standards (Thorogood, 1999) --all endorsed by IPUMS-International • 8. Construct age from birthdate, if necessary • 9. Do not identify date of birth • 10. Do not identify precise place of birth • 11. Migration: timing/place not identified in detail • 12. Identify place of residence by major civil division (pop>20k, 60k, 100k, 1 million—i.e., national convention) • 13. Do sensitivity analysis • 14. Do confidentiality assessment (not yet)
Countering Fear, Hysteria and Paranoia…with reason “There has been no known attempt at identification with the 1991 SARs [microdata samples of the UK]-nor in any other countries that disseminate samples of microdata” --Elliott and Dale, Journal of the Royal Statistical Society, 1999
No official statistical microdata!! Why Not?Companies want linkable data with names, addresses, ID #s, etc. * * * * * * * * * * * * * * * * * * *Probabilistic linking with 90% of the population missing is not good enough ChoicePoint Data Sources and Clients. Source: Washington Post http://www.choicepoint.com/
No statistical microdata!! To play ”pizza” video:http://www.aclu.org/pizza/
Statistical samples are innocuous. Nothing to be gained from matching. Countering Fear, Hysteria and Paranoia…with reason “There has been no known attempt at identification with the 1991 SARs [microdata samples of the UK]-nor in any other countries that disseminate samples of microdata” --Elliott and Dale, Journal of the Royal Statistical Society, 1999
Please allow me to invite you to think about producing (or permitting IPUMS to produce) anonymized, integrated samples for all the censuses of your country for which microdata survive…Thank you* * * * * *Contact: rmccaa@umn.eduthis ppt is available at:www.hist.umn.edu/~rmccaa/ipums-globalSee “Port of Spain workshop”