360 likes | 499 Views
Timothy M. Mulcahy Fritz Scheuren. Eurostat’s NTTS: Conference on New Techniques & Technologies for Statistics: 21 st Century Data Dissemination: Practice & Innovations Brussels, 23 February 2011. Overview. Introduction Context Data access modalities
E N D
Timothy M. Mulcahy Fritz Scheuren Eurostat’s NTTS: Conference on New Techniques & Technologies for Statistics: 21st Century Data Dissemination: Practice & Innovations Brussels, 23 February 2011
Overview • Introduction • Context • Data access modalities • Confidentiality, data utility & convenience • Data security • Researcher collaboration • Innovations
Introduction • NORC at the University of Chicago is a nonprofit public interest research organization • Established in 1941 • Closely affiliated with the University of Chicago • Divisions: • Economics, Labor and Population Studies • Education and Child Development • Field Operations Center • Health Care Research • International Projects • Public Health Research • Security, Energy & Environment • Statistics and Methodology • Substance Abuse, Mental Health & Criminal Justice • Telephone Survey and Support Operations
Introduction (cont.) NORC - University of Chicago Academic Research Centers • Center for Advancing Research and Communication in Science, Technology, Engineering, and Mathematics • Center for Excellence in Survey Research • Center for the Study of Politics and Society • Center on Demography and Economics of Aging • Cultural Policy Center • Data Research and Development Center • Joint Center for Education Research • Population Research Center • Ogburn Stouffer Center for the Study of Social Organizations • Alfred P. Sloan Center on Parents, Children and Work
Context • Challenges to responsible data sharing • Selecting the most appropriate access modality • Confidentiality v. analytic utility tradeoff • Convenience v. analytic utility tradeoff • Data security • Researcher Collaboration
2nd Annual Conference on Microdata Access (2/10/11) “Responsible Data Sharing in the 21st Century” KEYNOTE ADDRESS: ROBERT GROVES, Director, U.S. Census Bureau PANEL DISCUSSIONS: • “Motivations, Challenges, and Implications for Responsible Data Sharing” • “Responsible Data Sharing Among University Researchers”
The Government’s Role in Statistics • Statistical information is key to an informed citizenry; an informed citizenry is key to a functioning democracy • To be useful, the statistical information must be credible • It must be viewed as nonpartisan • It must be viewed as relevant to key questions about the welfare of the society (Groves)
Conflicting Principles for Data Producers? U.S. Government Statistical Agencies must simultaneously: • Maximize the richness of statistical information and insights based on the data provided • Widely, freely distribute statistics • Spur secondary analysis of data; and • Preserve its pledge of respondent confidentiality • Not just a principle, a law (Groves)
Challenges to Maintaining Confidentiality • Many policy and research questions require estimates that can not be generated from publicly available data • Statistical disclosure control inherently infers modifying the inferences from the data • Research conducted on perturbing data puts the scientific discovery process at risk
Public Use Files • Advantage: • PUFs may be made widely available for public consumption • Disadvantages • No training is provided; limited metadata • Some useful information must be at least partially suppressed to protect confidentiality • Widespread availability of other micro datasets that can be matched to public use microdata files or even tabulations to reidentify respondents
The Role of Microdata Access • Increases the # of researchers working with the data • Allows authorized researchers to pose and answer their own questions (curiosity-driven research) • Facilitates secure exploratory analyses and testing & confirming models • Provides a means for interdisciplinary research • Allows for advance queries of confidential data that cannot be pursued with public use files • “In essence, research that extracts all important information from the data while respecting individual rights, as part of our obligation to the society.” (Groves)
Selecting the Most Appropriate Data Access Modality Questions for data producers & providers: • What are my goals & objectives? • Who is my audience? • What is my risk tolerance? • Goal: design a set of customized data dissemination strategies that balance the level of risk tolerance and the need for data analytic utility
Available Data Access Modalities • Licensing: very high disclosure risk • Remote batch processing: time-consuming and costly • Online tabulation engines: data suppression, perturbation • Synthetic microdata: costly, low analytic utility, highly dependent on model accuracy • RDC’s and data enclaves: high data analytic utility while maintaining high standard of data confidentiality Strategizing Data Dissemination and Secure Microdata Access
Risk-Utility Tradeoff • The primary risk factor of data access is disclosure • Individual or firm level information must be handled very carefully • In the context of data access, there is a tradeoff between disclosure risk and data analytic utility • As additional measures are introduced to protect data confidentiality, data analytic utility will be reduced • In other words, the lower the risk, the lower the utility Strategizing Data Dissemination and Secure Microdata Access
Confidentiality-Utility Curve Physical and/or Remote Access Data Enclaves Remote Batch Processing Confidentiality Synthetic Micro-Data Statistical Tables and Data Cubes Public Use Data-File Licensing Analytic Utility Strategizing Data Dissemination and Secure Microdata Access
The Third Factor • Confidentiality and data utility are not the only factors that influence the choice of data access modality • The third factor: Convenience • Producers’ perspective: how easy is it to: • Implement an RDC or enclave? • Update and document the data? • Monitor researchers’ work and output requests? • Researchers’ perspective: • How far do they need to travel to the nearest RDC? • How easy is it for them to conduct follow-up work? • How quickly does the RDC review and approve output requests? • How easy is it for them to seek assistance? • Is there any peer-to-peer researcher interaction or peer review? Strategizing Data Dissemination and Secure Microdata Access
Given the Same Utility… Physical Data Enclaves with Remote Access Physical Data Enclaves Remote Access Data Enclaves Confidentiality Value provided with a secure physical enclave Value added with remote access to diverse sensitive datasets Value added with flexible deployment of terminals Convenience Strategizing Data Dissemination and Secure Microdata Access
Data Security • Data security – the ability to control disclosure risk, ensure privacy, and thus maintain data confidentiality • Both RDCs and data enclaves allow secure microdata access: similar level of data analytic utility • RDC: researchers physically access data stored at a secure physical facility • Data Enclave: researchers remotely access data stored at a file server through a secure system on a virtualized environment • Both modalities provide high confidentiality protection: information inflow & outflow are monitored and controlled. Strategizing Data Dissemination and Secure Microdata Access
Portfolio Protection Approach • Legal • Educational / Training • Statistical • Technical • Operational (*Customized per data producer & dataset) Strategizing Data Dissemination and Secure Microdata Access
Technical Protection • Encrypted connection with the data enclave using virtual private network (VPN) technology. VPN technology prevents outsiders from reading the data transmitted between the researcher’s computer and NORC’s network. • Users access the data enclave from a static or pre-defined narrow range of IP addresses. • Citrix Web-based security interface. • All applications and data run on the server at the data enclave. • Data enclave can prevent the user from transferring any data from data enclave to a local computer. • Data files cannot be downloaded from the remote server to the user’s local PC. • User cannot use the “cut and paste” feature in Windows to move data from the Citrix session. • User is prevented from printing the data on a local computer. • Audit logs and audit trails Strategizing Data Dissemination and Secure Microdata Access
Enclave Security Features Locked down thin client with minimal software, hardware authentication and self-monitoring mechanisms Webcam for audit trails, user/room monitoring, face recognition Network connection control (fixed IP, no DNS resolution, etc.) 2-factor authentication (biometric, smartcard, token, etc.) Internet Enclave Security & Support Center monitors activity and provide remote assistance and system maintenance DE Security/support Team Strategizing Data Dissemination and Secure Microdata Access
Enclave Security Features (cont.) • Ability to push out updates • Machines communicate with central server • Pings security configurations • Experimenting with GPS, biometrics, finger swipe/iris scan • 2-factor authentication
Operational Control • Internal Review: NORC performs extensive disclosure analysis on all output and makes recommendation to producer • Primary disclosure • Secondary disclosure • Residual disclosure • External Review: Data producers perform additional review of all output and make final decision on all output releases • Internal + external statistical review = safe output Strategizing Data Dissemination and Secure Microdata Access
The USDA Experience • Problem: USDA needed to disseminate survey data to the agricultural economics research community • The Agricultural Resources Management Survey • They already operated a network of RDCs in all 50 states • Implementing a remote access solution allowed them to engage a larger number of more dispersed researchers and centralize their researcher outreach Strategizing Data Dissemination and Secure Microdata Access
The USDA Experience Geographically dispersed researchers travel to secure RDCs Instead of creating additional costly brick and mortar RDCs, USDA can now roll out a virtual RDC to any university in the country Thin client terminals are installed in secure locations at researchers’ universities Strategizing Data Dissemination and Secure Microdata Access
Collaboration • Collaboration increases researcher productivity • Traditional data access modalities do not accommodate the need for research collaboration • Data Enclave facilitates collaboration • Provides platforms and tools for collaboration • Environment for interaction (i.e. instant messaging) • Allows encrypted file sharing • Develops group identity within research communities Strategizing Data Dissemination and Secure Microdata Access
Enclave Collaboration Tools Strategizing Data Dissemination and Secure Microdata Access
Collaboration Tools (cont.) PRODUCER PORTAL GENERAL INFORMATION KNOWLEDGE SHARING SUPPORT • Background info • Announcements • Calendar or events • About • Topic of the week • Discussion groups • Wiki • Shared libraries • Metadata / Report • Scripts • Research papers • Frequently Asked Questions • Technical Support • DE usage • Data usage • Quality Content fully editable by producers and researchers using a simple web based interface Private research group portals with similar functionalities are configured for each research project Strategizing Data Dissemination and Secure Microdata Access
Collaboration Tools (cont.) Home Welcome, background information, contact, simple access to public data and documentation Researcher Services Collaborative Space My Datasets Create custom view of the data for use in project or sharing with community Wiki Capture knowledge surrounding the data. Initial content will be seeded with survey metadata. My Projects Bring together researchers in a virtual environment to share research ideas, data, documentation, and scripts. Library Searchable libraries of papers/references/documentation, scripts/programs, primary and secondary data. Most of the content is extracted automatically from the research space. My Publications Package research outputs (papers, documents, scripts/programs, secondary data) for preservation, dissemination and sharing Communication Events and news, Community driven discussion groups, FAQ/Answers, Chat My Profile Provide individual background information, research interests, set privacy options and configure notifications services Services Researcher Directory, Project Directory, Call for collaboration, Notification, Support, Training Infrastructure Primary and researcher data and metadata storage, databases, security (access, backups), web services Admin Services System and data usage reports, data/metadata management, user administration, etc. Strategizing Data Dissemination and Secure Microdata Access
Kauffman Foundation Experience • Collaboration increases research productivity • The Kauffman Foundation sought a remote microdata access solution with the express intent of creating a collaborative research community around the KFS firm survey • Output since mid-2007: • Two books • Five book chapters • 10 peer reviewed articles • Two dissertations • 57 conference presentations • 23 research reports • Four best paper awards Strategizing Data Dissemination and Secure Microdata Access
External Collaboration • External web-based collaboration tools allow researchers to share knowledge leveraging public data and metadata as well as online information • These tools provide prospective researchers with the opportunity to familiarize themselves with confidential datasets prior to being granted access. Strategizing Data Dissemination and Secure Microdata Access
The NADA Data Catalog • The International Household Survey Network (IHSN) National Data Archive web-based tool (NADA) • Catalogs data in a DDI-compliant standard • Allows prospective researchers to browse metadata • Users can compare variables across surveys within the system • Multi-tiered access system allows each researcher to have their personalized account so that producers can • Disseminate public use files directly • Review data access requests from researchers Strategizing Data Dissemination and Secure Microdata Access
The NADA Data Catalog Strategizing Data Dissemination and Secure Microdata Access
Disclosure Review Innovations • Disclosure Control Process • Optimize management processes for archiving and disclosure review processes • Facilitate workflow for review and export of output request • Phase 1: • Develop simple request packaging tool for user • Support workflow (submit, enclave review, producer, review, delivery) • Institutional archiving and audit • Phase 2: • Link with DDI metadata to identify all data sources and variables • Facilitate review process, comparison of log requests over time to protect against residual disclosure, understanding of variable usage/utility • Phase 3: Automation Strategizing Data Dissemination and Secure Microdata Access
Disclosure Review Innovations DocuStat and Coodle Script Import SAS, Stata, SPSS scripts are imported into Coodle and indexed by an Apache Lucene database. The files are also tagged to maintain attributes such as author, project, filename, date, etc. Coodle Variable Analysis Lucene is a free text indexing engine. Similar to Google, it provides search functionalities on vast amount of text. DDI metadata is used to retrieve variable names and Coodle perform name based searches in the repository. This returns the scripts and code snippets where a particular variable is in use. The results are stored in a XML document. Coodle Source Code Browser Interestingly, users can also query the repository to retrieve source code example or snippets for reuse Coodle Report Generator XML transformations are then used to generate various reports for the archive or producer Strategizing Data Dissemination and Secure Microdata Access
Thank you! QUESTIONS?