490 likes | 499 Views
Learn how IPUMS-International maintains privacy for integrated census microdata while ensuring data quality through restricted access. Explore its usage, extraction system, and future plans.
E N D
IPUMS-International: High precision Population Census Samples: Balancing the Privacy-Quality Tradeoff by Means of Restricted Access Microdata Extracts https://www.ipums.org/international * * *Robert McCaa, Steven Ruggles, Michael Davern, Tami Swenson, and Krishna Mohan PalipudiMinnesota Population Centerrmccaa@umn.edu = information not in proceedings or on CD
Outline of paper (in proceedings, except “0.”) 0. What’s a historian doing at PSD2006? Introduction: The Trusted User Approach The Case for High Precision Samples: The USA Experience High Precision Samples with Implicit Stratification Access Disclosure Controls Technical Disclosure Controls Fear, Hysteria and Paranoia Conclusions and Future Work
Help!!! Why am I (a historian) here? To learn from you to enhance IPUMS-International privacy and confidentiality techniques To inform you of our existence and the challenges we face To invite your contributions, as producers, users, and creators of statistical confidentiality methods To advertise opportunities for post-docs, staff To invite statistical agencies to entrust census microdata to the project
Imagine!!! What’s the problem? Confidentializing IPUMS-International, an integrated microdatabase with: • 150 census samples of households (50 countries) • Containing 300 million person records with hundreds of variables • Available to tens of thousands of licensed users regardless of country of birth, citizenship, residence or place of work • Not a single allegation of violation of privacy or statistical confidentiality-- Ever!!
IPUMS-International: a restricted-access, web-based census microdata extraction system • Password protected: to make and retrieve extracts • Licensed researcher selects: • Countries, • Censuses, • Cases/sub-populations, • Variables, and • Sample densities • Extract engine queues request, generates extract • Researcher retrieves extract via web with SSL 128-bit encryption and analyzes using own wares (soft/hard/wet) • NO: CDs, original codes, or complete datasets
2a. Study documentation2b. Design extract 3. Receive email; logon with p/word 1. Logon w/ password (also SAS, STATA) 4. Download extract (SSL encrypted) 5. UnZip data 6. Analyze 6 stepsusinghttps://www.ipums.org/international:
IPUMS-International, December 2006dark green = disseminating (20 countries, 63 censuses, 185mpr)green = harmonizing (37 countries, 100 censuses, 200mpr)lightest green = negotiating 69 countries, 58% world's population
What has happened since Geneva (xi/05)? NSF-USA renewed funding for 5 years Database grew: 12 countries, 35 censuses, 65mpr More agreements signed, census data acquired New, dynamic metadata system implemented Number of users doubled Publications are taking off Paris Workshop (INED/CEPED): delegates from 14 European countries and 10 non-European, plus academic researchers
IPUMS-EuropeDecember 2006 Dark green = Disseminating (5 countries, 15 censuses, 27mpr) In Lisbon: Portugal and Hungary will become “dark green” with the launch of samples for 4 censuses ea. for Argentina and Hungary, 3 for Portgual and Israel, 2 for Egypt and Rwanda, and 1 for Gaza and the West Bank
What will happen by Lisbon (ISI, viii/07)? Confidentiality methods will be enhanced Database will grow: 7 countries, 19 censuses, 25mpr Dynamic metadata system will be expanded Number of users will increase!!! Publications!!! IPUMS Workshop (Sat Aug 25 at INE-Pt) for producers and users (registration required; please email rmccaa@umn.edu) Microdata Session (Fri Aug 24) Free mug!* *Special conditions apply
1. Introduction: The “trusted-user” approach to disseminating integrated, anonymized census microdata sample
MBNA: world’s largest independent credit card issuerspecialist in affinity marketing • 1982: MBNA founded by Charles Cawley –instead of competing on price, compete on affinity • 1983: Georgetown Univ Alumni Association (Cawley’s alma mater) supplied MBNA with names and addresses of its members in exchange for percentage of revenues on card usage • Big hit! Large number of new accounts, low risk, high spenders • 1985: new groups: American Dental Association, Aircraft Owners and Pilots Association, National Education Assoc., • 1994: Sierra Club, 45,000 members signed with MBNA generating $400,000 annually for Sierra Club • The rest is history! • 2005:
MBNA: world’s largest independent credit card issuerspecialist in affinity marketing • 1982: MBNA founded by Charles Cawley –instead of competing on price, compete on affinity • 2005: MBNA, with 25,000 employees, acquired by Bank of America, US$35 billion • How many credit cards do you have? • How many affinity credit cards do you have?
IPUMS-International: world’s largest provider of integrated census microdata to trusted users • 1999: Founded by Steven Ruggles and Bob McCaa, –restrict access to trusted users, and apply corresponding confidentiality techniques • 2002: 1st release of integrated samples for 7 countries; >200 users in first year • Big hit! 69 countries signed; 57 entrusted data to IPUMS, datasets for more than 230 censuses, >150 entire datasets • 2006,
IPUMS-International: world’s largest provider of integrated census microdata to trusted users • 1999: Founded—seeks neither profits or popularity! • 2006, 3rd release: • data for 20 countries, samples for 63 censuses, • 185 million person records, • >1,000 users • 2009, 8th release: • data for 50 countries, samples for ~150 censuses • >300 million person records • thousands of users • Note: data extracts are provided only to licensed users.
2. High Precision Samples: The Case of the USA • Beginning with the 1980 census, US Census Bureau released 5% samples of households • Not a single allegation of misuse • 1988: first articles using high precision samples published in Demography Language use and fertility in the Mexican origin population Household size and regional outmigration • 1996: IPUMS-USA samples available via internet • Available at no cost to researchers worldwide • 81% of articles in Demography, since 1990, use high precision samples • In 2000 & 2001, high precision census microdata used twice as often as next most common data source • Analyze information for small population subgroups • very large census microdata samples are among the most powerful tools available for economic and demographic analysis
2. High Precision Samples: The Case of the USA • Beginning with the 1980 census, US Census Bureau released 5% samples of households • Not a single allegation of misuse • 1988: first articles using high precision samples published in Demography Language use and fertility in the Mexican origin population Household size and regional outmigration • 1996: IPUMS-USA samples available via internet • Available at no cost to researchers worldwide • 81% of articles in Demography, since 1990, use high precision samples • In 2000 & 2001, high precision census microdata used twice as often as next most common data source • Analyze information for small population subgroups • very large census microdata samples are among the most powerful tools available for economic and demographic analysis
3. High Precision Samples with Implicit StratificationNote: almost all NSIs are supplying household samples drawn to IPUMS specifications (every nth household from 100% fine-grained geographically stratified microdata)—see table 1
IPUMS-International: High precision samples with implicit stratification • Suppress all identifying information: names, id numbers, street addresses, low-level administrative geography (NUTS-5, NUTS-4?, NUTS-3?, NUTS-2?) • Sample is stratified by lowest level geography (census tract) • Lower standard errors than a classic random sample—to the extent that variables of interest are correlated with geography • Implicit geographical stratification is equivalent to extremely fine geographic stratification with proportional weighting • Many of our NSI partners have adopted the IPUMS sample design (see table 1). • 26 countries provided 100% microdata for the MPC to draw the sample • Europe: almost all NSIs have drawn samples to IPUMS specs. for all censuses • High precision samples for 57 countries entrusting microdata (12/12/2006) • 10% samples: 43 countries • 5% 10 countries • <5% 4 countries
IPUMS-International: High precision samples with implicit stratification • Suppress all identifying information: names, id numbers, street addresses, low-level administrative geography (NUTS-5, NUTS-4?, NUTS-3?, NUTS-2?) • Sample is stratified by lowest level geography (census tract) • Lower standard errors than a classic random sample—to the extent that variables of interest are correlated with geography • Implicit geographical stratification is equivalent to extremely fine geographic stratification with proportional weighting • Many of our NSI partners have adopted the IPUMS sample design (see table 1). • 26 countries provided 100% microdata for the MPC to draw the sample • Europe: almost all NSIs have drawn samples to IPUMS specs. for all censuses • High precision samples for 57 countries entrusting microdata (12/12/2006) • 10% samples: 43 countries • 5% 10 countries • <5% 4 countries
4. Access Disclosure Controlsa. Memorandum with NSIb. License with researchers
IPUMSi LICENSE B. License with researchersRestricted Access web-based system Legally-binding license agreement • forces would-be snoopers to violate law by which they can be fined and jailed • protects privacy and confidentiality • assures proper use Access limited to: • Bona-fide researchers (credentials) • With a demonstrated scientific need • who agree to abide by license restrictions • Confidentiality • No redistribution • Safely secured • Alleging that a person has been identified is prohibited
IPUMSi LICENSE B. License with researchersRestricted Access web-based system Legally-binding license agreement • forces would-be snoopers to violate law • protects privacy and confidentiality • assures proper use Access limited to: • Bona-fide researchers (credentials) • With a demonstrated scientific need • who agree to abide by license restrictions • Confidentiality • No redistribution, no commercial use • Safely secured • Alleging that a person can be or has been identified is illegal
License is for 1 year, renewable. End of application
IPUMSi technical measures are also applied, in addition to the legal & administrative protections CONFIDENTIALIZES » Suppress geographical detail» Blur/aggregate sensitive codes» Convert dates to ages (blur key vars.) » Swap cases between districts» Scramble order of records
EUROSTAT statistical confidentiality standards (Thorogood, 1999) --all endorsed by IPUMS-International • 1. Restrict access to samples • 2. Limit geographical detail • 3. Re-code unique categories--top and bottom • 4. Sign non-disclosure agreement • 5. Prohibit redistribution to third parties • 6. Prohibit attempts to identify individuals or the making any claim to that effect • 7. Require users to provide copies of publications
EUROSTAT statistical confidentiality standards (Thorogood, 1999) --all endorsed by IPUMS-International • 8. Construct age from birthdate, if necessary • 9. Do not identify date of birth • 10. Do not identify precise place of birth • 11. Migration: timing/place not identified in detail • 12. Identify place of residence by major civil division (pop>20k, 60k, 100k, 1 million—i.e., national convention) • 13. Do sensitivity analysis (not yet) • 14. Do confidentiality assessment (not yet)
6. Countering Fear, Hysteria and Paranoia…with reason “There has been no known attempt at identification with the 1991 SARs [microdata samples of the UK]-nor in any other countries that disseminate samples of microdata” --Elliott and Dale, Journal of the Royal Statistical Society, 1999
No census microdata!! Why Not?Companies want linkable data with names, addresses, ID #s, etc. * * * * * * * * * * * * * * * * * * *Probabilistic linking with 90% of the population missing is not good enough ChoicePoint Data Sources and Clients. Source: Washington Post http://www.choicepoint.com/
No census microdata!! http://www.aclu.org/pizza/
“…there are no known incidents of researchers using their access to microdata to deliberately identify individuals...”--Managing Statistical Confidentiality and Microdata Access: Principles and Guidelines of Good PracticeUNECE, Conference of European Statisticians, Task Force on Census Microdata (October 2006), p. 19 http://www.unece.org/stats/documents/tfcm/1.e.pdf
“Statistical disclosure control methods may modify the data or the design of the statistic, or a combination of both. They will be judged sufficient when the guarantee of confidentiality can be maintained, taking account of information likely to be available to third parties, either from other sources or as previously released National Statistics outputs, against the following standard:“It would take a disproportionate amount of time, effort and expertise for an intruder to identify a statistical unit to others, or to reveal information about that unit not already in the public domain.”Protocols on Data Access and Confidentiality, pp. 7-8 --ONS-UK(2004)www.statistics.gov.uk/about_ns/cop/downloads/prot_data_access_confidentiality.pdf
IPUMS-International strengths • Uniform legal authorization with national statistical authorities • Access restricted to academics with need who agree to abide by stringent confidentiality protections • Experienced integration teams • Proven web-based distribution system • High user satisfaction • Sustainable: NSF, NIH, FP-6 (7?) funded (Europe only)
Significant weakness: statistical disclosure controls…as a result of PSD2006, we will: • Re-consider our portfolio of statistical disclosure controls • Implement a uniform set of controls across all samples and countries • Do sensitivity analysis • Do confidentiality assessment • Revise our documentation on the confidentializing of datasets for each country, describing principles, but not the “keys” • Cite bibliography for users to confidentialize tables and graphs
IPUMS-International, August 2009???dark green = disseminating (50 countries, 150 censuses, 300mpr)green = harmonizing (?? countries, ?? censuses, ???mpr)lightest green = negotiating 2009? --ISI
Thank you!https://www.ipums.org/internationaladditional information at:www.hist.umn.edu/~rmccaa/* * * * * *Contact: rmccaa@umn.edu