220 likes | 238 Views
Secure Data Laboratories: The U.S. Census Bureau Model. Steven Ruggles University of Minnesota. Why are secure data laboratories needed?. Greater geographic detail needed for multi-level modeling, spatial analysis, and studies of spatial segregation
E N D
Secure Data Laboratories:The U.S. Census Bureau Model Steven Ruggles University of Minnesota
Why are secure data laboratories needed? • Greater geographic detail needed for multi-level modeling, spatial analysis, and studies of spatial segregation • Very large samples (over 10% coverage) and complete-count microdata offer new research opportunities • Adding geographic detail and raising sample sizes raises new confidentiality concerns
Existing Models: • German Research Data Centres • Statistics Canada Research Data Centers • Census Bureau Research Data Centers Key limitation: each holds data for only one country, making comparative research impossible
Emerging standards: • Data Sharing for Demographic Research Project, Inter-university Consortium for Political and Social Research • Eurostat initiative: all statistical agencies are mandated to develop secure data laboratories
Census Bureau Research Data Centers • U.S. Census Bureau made census microdata available to researchers in 1964 through the anonymized Public Use Samples • It was impossible to anonymize the census of business • Original RDC established in 1982 by the Census Bureau Center for Economic Studies to provide access to microdata on firms
The RDC Concept • An office with multiple computers • Staffed by a Census Bureau employee • Computer driven remote data access • Meets physical and computer security requirements for restricted access • Researchers must undergo a background check and obtain Special Sworn Status to use restricted data • Researchers are not permitted to remove anything from the RDC before it passes a disclosure avoidance review
Census RDC Remote Branches • Boston (NBER) 1994 • Carnegie-Mellon 1996-2004 • UC Berkeley 1999 • UCLA 1999 • Research Triangle (Duke, North Carolina) 2000 • Michigan 2002 • Chicago 2002 • New York Cornell 2004 • New York Baruch 2006 • Minnesota 2009
Census RDCs Coming soon: Minneapolis
Census Bureau and RDC partners: • Establish physically secure offices and secure computer systems • Choose projects that use the data appropriately, benefit Census Bureau programs, and present low disclosure risks; • Impart to researchers at the RDC the Census Bureau “culture of confidentiality;” • Establish policies and procedures that protect confidentiality in the RDC office; • Release only research output that does not reveal confidential information.
Each RDC has a security plan. • Locked office with badges, key cards, keypads, etc. • Access limited to researchers with Special Sworn Status (SSS) carrying out active, approved projects at the RDC: • Sign written active project agreements • Obtain security clearance • Sign Census Bureau’s standard sworn agreement to preserve the confidentiality of the data. • Receive awareness training
Census employee (the RDC administrator) stationed at each RDC. • Instills the Census Bureau's “culture of confidentiality” into the researchers • trains the researchers regarding the security and confidentiality restrictions. • Carries out disclosure analysis on any research output a researcher wishes to remove from the secure facilities
Thin client computing environment • Data stored on secure Unix servers at Census Bureau headquarters (Bowie MD). No confidential data stored at the RDCs. • RDCs connected to servers via dedicated T-1 lines. • Researchers use X-terminals (“thin clients”- no local data storage) to access the data authorized for their projects. • Researchers are accountable for their computer use, through the use of passwords and system logs.
The rules: • May not upload or download anything to thin client servers (no physical way to do it) • Have no access to any non-Census Bureau network (including the Internet) from within the RDC facility. • May not bring laptop computers or other portable mass storage devices into the RDC facility.
Demographic and Health Data In the RDCs • Historical focus on “economic” data • Requests for “demographic” data • Higher geographical resolution • Denser samples and complete-count microdata • Obtained permission to provide access to demographic data in RDCs in 1997 • IPUMS is working with Census to reconstruct complete (100%) census microdata from 1960-2000+ for RDCs • RDCs will soon include major collections of U.S. health data as well
The importance of high-density census microdata with fine geographic detail • This is a completely new source with the potential to provide unprecedented insight into residential segregation and the influence of local conditions on behavior. • Analysts of small areas have never had access to microdata, and have been forced to use crude aggregate tabulations that are often incompatible across time and across national boundaries. • As a new kind of data, complete count microdata will stimulate entirely new methods of analysis.
Limitations of the Data Laboratory Model • Access is highly restricted, cumbersome, and expensive • The U.S. experience: just a dozen research projects using censuses in RDCs; number of projects using public-use census microdata over 10,000, most widely used data source in the social sciences • Analysis across national boundaries is essential, and RDCs currently operated by the Census Bureau and the statistical agencies of Germany and Canada cannot meet this need • The Data Sharing for Demographic Research (DSDR) program at the ICPSR has been charged with developing a set of standards for data enclaves
Conclusion • Restricted data enclaves cannot replace public use data, since they prevent access for most researchers. • This strategy, however, does provide the possibility for researchers with compelling needs to gain access to highly confidential data with virtually no risk of disclosure. • To allow analyses that cross national boundaries, we must develop secure data laboratories that are not tied to specific national statistical agencies, but which allow access to data from many countries. • Existing RDCs provide a valuable model