1 / 64

norc/dataenclave

vansickle
Download Presentation

norc/dataenclave

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. NORC Data Enclave:Providing Secure Remote Access to Sensitive Microdata Julia Lane, Principal Investigator, National Science Foundation Chet Bowie, Senior Vice President, Director, Economics, NORC Fritz Scheuren, Vice President, Senior Fellow, Statistics & Methodology, NORCTimothy Mulcahy, Data Enclave Program Director, NORC http://www.norc.org/dataenclave

  2. Overview • Background • Mission • Portfolio Approach • Security • Enclave Walkthrough • Data and Metadata Best Practices • Outreach / Dissemination • Next Steps

  3. Background: the Data Continuum • Conceptualization and operationalization • Data collection • Data maintenance and management • Data processing and analysis • Data quality and security • Data archiving and access • Production and dissemination of new knowledge to inform policy and practice

  4. Background: Data in the 21st Century • Remarkable growth in the amount of data available to us • Growing demand for better management of the quality and value of data • Dramatic advances in the technology available to best use those data to inform policy and practice

  5. Background: the Challenge • To be responsive to the dramatic and fast paced technological, social, and cultural changes taking place in the data continuum • To be resourceful enough to take advantage of them to best inform policymaking, program development, and performance management

  6. Mission • Promote access to sensitive microdata • Protect confidentiality (portfolio approach) • Archive, index and curate micro data • Encourage researcher collaboration / virtual collaboratory

  7. Emergence of the Remote Data Enclave • Enclave began in July 2006 • Sponsors • US Department of Commerce (NIST-TIP) • US Department of Agriculture (ERS/NASS) • Ewing Marion Kauffman Foundation • US Department of Energy (EIA) (pilot project) • National Science Foundation (2009)

  8. What is the Enclave? • Secure environment for accessing sensitive data • Access: remote desktop, encryption, audit logs • Security: controlled information flow, group isolation • Facility: software, tools, collaborative space, technical support

  9. Ideal System • Secure • Flexible • Low Cost • Meet Replication standard • The only way to understand and evaluate an empirical analysis fully is to know the exact process by which the data were generated • Replication dataset include all information necessary to replicate empirical results • Metadata crucial to meet the standard • Composed of documentation and structured metadata • Undocumented data are useless • Create foundation for metadata documentation and extend data lifecycle

  10. Principles of Providing Access • Valid research purpose • statistical purpose • need “sponsor agency /data producer benefit” • Trusted researchers • Limits on data use • Remote access to secure results • Disclosure control of results / output safe projects + safe people + safe setting + safe outputs  safe use

  11. Safe People and Disciplinary Actions • All new researchers undergo background checks and extensive training (as per producer) • Once competence has been demonstrated, researchers are permitted to access the enclave • Appropriate conduct also allows researchers to have longer term contracts with data producers/sponsors

  12. Safe People and Disciplinary Actions • Users may not attempt to identify individuals or organizations from the data • Any attempt to remove data (successful or not) is regarded as a deliberate action • Any breach of trust dealt with severely Possible actions include: • Immediate removal of access for researcher (and colleagues?) • Formal complaint by Data Producer to institution • Potential legal action

  13. Safe People are Vital • Not possible to remove data electronically from enclave except by NORC data custodian. • All users trained in access to enclave • However, risk of users abusing data can’t be reduced to zero (portfolio approach) • Researchers: • May only access the dataset they applied for and were authorized to use • use must relate to the originally proposed project

  14. Data Enclave Today • Approximately 50 active researchers • More than 40 research projects • Approximately 500 people on enclave listserv • Increasingly accepted as a model for providing secure remote access to sensitive data: • UK Data Archive, Council on European Social Science Archives, University of Pennsylvania, Chapin Hall Center for Children • Emphasis on building and sustaining virtual organizations or collaboratories

  15. Available Datasets • Department of Commerce, National Institute of Standards and Technology • ATP Survey of Joint Ventures • ATP Survey of Applicants • Business Reporting Survey Series • Department of Agriculture, Economic Research Service, National Agricultural Statistical Service • Agricultural Resource Management Survey (ARMS) • Kauffman Foundation • Kauffman Firm Survey • Department of Energy, Energy Information Agency (pilot) • National Science Foundation (2009)

  16. Examples of Researcher Topic Areas • Entrepreneurship • Knowledge & innovation • Joint ventures • New businesses/startups • Strategic alliances • Agricultural economics

  17. Researcher Institutions

  18. Portfolio Approach to Secure Data Access Portfolio Approach* • Legal • Educational / Training • Statistical • Data Protection / Operational / Technological (*Customized per data producer & dataset)

  19. Options for Agency X (and Study Y) Sample Modalities Legal Options (1,2,3,4) Statistical (1,2,3,4,5) Operational (1,2,3,4,5) Educational (1,2,3,4) Remote Access 3 1 4 2 None 2 5 2 Onsite Access 3with customization 3,5 1 None Licensing (different levels of anonymization) 2 1 2,3 1,4 Portfolio Approach

  20. Legal Protection: Data Enclave Specific On an annual basis: • Approved researchers sign Data User Agreements (legally binding the individual and institution) • Researchers and NORC staff sign Non-disclosure Agreements specific to each dataset • Researchers and NORC staff complete confidentiality training

  21. Educational / Researcher Training Locations • Onsite • Remote / Web-based / SFTP • Researcher locations (academic institutions, conferences AAEA, JSM, AOM, ASA, ASSA, NBER summer institute Note: The training is designed to go above and beyond current practice in terms of both frequency and coverage

  22. Educational/Training Example Agenda Day 1 • Data enclave navigation (NORC) • Metadata documentation (NORC) • Confidentiality and data disclosure (NORC) • Survey overview (Data Producer) • Confidentiality agreement signing (NORC/ (Data Producer) Day 2 • Data files and documentation (Data Producer) • Sampling and weights (Data Producer) • Item quality control and treatments for non-response (Data Producer) • Statistical testing (Data Producer)

  23. Statistical Protection • Remove obvious identifiers and replace with unique identifiers • Statistical techniques chosen by agency (recognizing data quality issues)* • Noise added? • Full disclosure review of all data exported coordinated between NORC and Data Producer * Note: At discretion of producer and can go above and beyond the minimum level of protection

  24. Disclosure Review Process • Getting disclosure proofed results • Drop table/file and related files in special folder in shared area • Submit request for disclosure review through tech support • Complete Checklist (researcher “to do” list) • Include details about number of observations in each cell; previous releases • Summary, disclosure proofed output made available for download on a public server

  25. Secure lab NORC & Data Producer Review Data read only Researcher Logs in to server work area read-write Output Input automatic transfer archiving of all transfers Input Output Disclosure Review Process Disclosure Review

  26. Disclosure Review Guidelines • Provide a brief description of your work • Identify which table(s) / file(s) you are requesting to be reviewed and their location • Specify the dataset(s) and variables from which the output derives • Identify the underlying cell sizes for each variable, including regression coefficients based on discrete variables

  27. Helpful Hints on Disclosure Review • Threshold Rule • No cells with less than 10 units (individuals/enterprises). Local unit analysis must show enterprise count (even when there is no information associated with each cell) • Avoid / be careful when / remember to: • Tabulating raw data (threshold rule) • Using “lumpy variables, such as “investment” • Researching small geographical areas (dominance rule) • Graphs are simply tables in another form (always display frequencies) • Treat quantiles as tables (always display frequencies) • Avoid minimum and maximum values • Regressions generally only present disclosure issues when: • Only on dummies = a table • On public explanatory variables • Potentially disclosive situations when differencing hiding coefficients makes linear and non linear estimation completely non-disclosive (note: panel models are inherently safe)

  28. Data Protection / Operational • Encrypted connection with the data enclave using virtual private network (VPN) technology. VPN technology enables the data enclave to prevent an outsider from reading the data transmitted between the researcher’s computer and NORC’s network. • Users access the data enclave from a static or pre-defined narrow range of IP addresses. • Citrix’s Web-based technology. • All applications and data run on the server at the data enclave. • Data enclave can prevent the user from transferring any data from data enclave to a local computer. • Data files cannot be downloaded from the remote server to the user’s local PC. • User cannot use the “cut and paste” feature in Windows to move data from the Citrix session. • User is prevented from printing the data on a local computer. • Audit logs and audit trails

  29. Data Protection / Operational • NORC already collects data for multiple statistical agencies (BLS, Federal Reserve (IRS data), EIA, NSF/SRS etc.) => & has existing safeguards in place • The Data Enclave is fully compliant with DOC IT Security Program Policy, Section 6.5.2, the Federal Information Security Management Act, provisions of mandatory Federal Information Processing Standards (FIPS) and all other applicable NIST Data IT system and physical security requirements, e.g., - Employee security - Rules of behavior - Nondisclosure agreements - NIST approved IT security / certification & accreditation - Applicable laws and regulations - Network connectivity - Remote access - Physical access

  30. High-level Access Perspective

  31. Restrictions in the Enclave Environment • Access only to authorized applications • Most system menus have been disabled • Some control key combinations or right click functions are also disabled on keyboard • Closed environment: no open ports, no access to Internet or email • No output (tables, files) may be exported and no datasets imported without being reviewed • File explorer is on default settings

  32. Accessing the Enclave https://enclave.norc.org *** don’t’ forget it’s https://, not http:// *** The message center will inform you on browser related technical issues. Note that you will first need to install the Citrix Client on you system (a download link will be provided) Enter your user name and password. The first time you access, you will need to change your password. Need at least one number and a mix upper/lower case characters. Password must be changed every 3-months

  33. Tools Available in the Enclave • Stata/SE 10.0 • StatTransfer 9 (selected users) • SAS v9.2 • “R” Project for Statistical Computing • MATLAB • LimDep / NLogit • Microsoft office 2007 • Adobe PDF Reader • IHSN Microdata Management Toolkit / Nesstar Publisher (upon request, selected users only)

  34. Collaboration Tools PRODUCER PORTAL GENERAL INFORMATION KNOWLEDGE SHARING SUPPORT • Background info • Announcements • Calendar or events • About • Topic of the week • Discussion groups • Wiki • Shared libraries • Metadata / Report • Scripts • Research papers • Frequently Asked Questions • Technical Support • DE usage • Data usage • Quality Content fully editable by producers and researchers using a simple web based interface Private research group portals with similar functionalities are configured for each research project

  35. Contributing to the Wiki Wiki pages can be changed by clicking the Edit button All changes are tracked and can be seen using the history page. Links to other wiki pages are shown as hyperlinks Pages that have not yet been created are underlined

  36. Data Documentation & Shared Code Libraries Click on documents and folders to open or navigate in the structure Use the menu to create folders or upload documents

  37. Labeled stuff Unlabeled stuff The bean example is taken from: A Manager’s Introduction to Adobe eXtensible Metadata Platform, http://www.adobe.com/products/xmp/pdfs/whitepaper.pdf What are Metadata? • Common definition: Data about Data

  38. Metadata You Use Everyday… • The internet is build on metadata and XML technologies

  39. Academic Producers Users Librarians Sponsors Business Media/Press Policy Makers General Public Government Managing Social Science Metadata is Challenging! We are in charge of the data. We support our users but also need to protect our respondents! We have an information management problem We want easy access to high quality and well documented data! We need to collect the information from the producers, preserve it, and provide access to our users!

  40. Metadata Needs in Social Science • The data food chain • Data takes a very long path from the respondent to the policy maker • Process should be properly documented at each step • Different needs and perspectives across life cycle • But it’s all about the data/knowledge being transformed and repurposed • Linkages / information should be maintained • Drill from results to source • Needed to understand how to use the data and interpret the results • Information should be reusable • Common metadata • Shared across projects • Dynamic integration of knowledge

  41. Importance of Metadata • Data Quality: • Usefulness = accessibility + coherence + completeness + relevance + timeliness + … • Undocumented data is useless • Partially documented data is risky (misuse) • Data discovery and access • Preservation • Replication standard (Gary King) • Information / knowledge exchange • Reduce need to access sensitive data • Maintain coherence / linkages across the complete life cycle (from respondent to policy maker) • Reuse

  42. Metadata Issues • Without producer / archive metadata • We do not know what the data is about • We loose information about the production processes • Information can’t be properly preserved • Researchers can’t work discover data or perform efficient analysis • Without researcher metadata • Research process is not documented and cannot be reproduced (Gary King  replication standard!) • Other researchers are not aware of what has been done (duplication / lack of visibility) • Producer don’t know about data usage and quality issues • Without standards • Such information can’t be properly managed and exchanged between actors or with the public • Without tools: • We can’t capture, preserve or share knowledge

  43. What is a Survey? • More than just data…. • A complex process to produce data for the purpose of statistical analysis • Beyond this, a tool to support evidence based policy making and results monitoring in order to improve living conditions • Represents a single point in time and space • Need to be aggregated to produce meaningful results • It is the beginning of the story • The microdata is surrounded by a large body of knowledge • But survey data often come with limited documentation • Survey documentation can be broken down into structured metadata and documents • Structured metadata (can be captured using XML) • Documents (can be described in structured metadata)

  44. Microdata Metadata Examples • Survey level • Data dictionary (variable labels, names, formats,…) • Questionnaires: questions, instructions, flow, universe • Dataset structure: files, structure/relationships, • Survey and processes: concepts, description, sampling, stakeholders, access conditions, time and spatial coverage, data collection & processing,… • Documentation: reports, manuals, guides, methodologies, administration, multimedia, maps, … • Across surveys • Groups: series, longitudinal, panel,… • Comparability: by design, after the fact • Common metadata: concepts, classifications, universes, geography, universe

  45. Questionnaire Example Universe Instruction Module/Concepts Questions Classifications (some reusable) Value level Instruction (skip) Instruction

  46. Information Technology and Metadata The eXtensible Markup Language (XML) and related technologies are use to manage metadata or data Document Type Definition (DTD) and XSchema are use to validate an XML document by defining namespaces, elements, rules Specialized software and database systems can be used to create and edit XML documents. In the future the XForm standard will be used XML separates the metadata storage from its presentation. XML documents can be transformed into something else, like HTML, PDF, XML, other) through the use of the eXtensible Stylesheet Language, XSL Transformations (XSLT) and XSL Formatting Objects (XSL-FO) Very much like a database system, XML documents can be searched and queried through the use of XPath oe XQuery. There is no need to create tables, indexes or define relationships XML metadata or data can be publishedin “smart” catalogs often referred to as registries than can be used for discovery of information. XML Documents can be sent like regular files but are typically exchanged between applications through Web Services using the SOAP and other protocols

  47. Metadata specifications for social sciences • We need generic structures to capture the metadata • A single specification is not enough • We need a set of metadata structures • That can map to each other to (maintain linkages) • Will be around for a long time (global adoption, strong community support) • Based on technology standards (XML) • Suggested set • Data Documentation Initiative (DDI) – survey / administrative microdata • Statistical Data and Metadata Exchange standard (SDMX) – aggregated data / time series • ISO/IEC 11179 – concept management and semantic modeling • ISO 19115 – Geographical metadata • METS – packaging/archiving of digital objects • PREMIS – Archival lifecycle metadata • XBRL – business reporting • Dublin Core – citation metadata

  48. The Data Documentation Initiative • The Data Documentation Initiative is an XML specification to capture structured metadata about “microdata” (broad sense) • First generation DDI 1.0…2.1 (2000-2008) • Focus on single archived instance • Second generation DDI 3.0 (2008) • Focus on life cycle • Go beyond the single survey concept • Multi-purpose • Governance: DDI Alliance • Membership based organizations (35 members) • Data archives, producers, research data centers, academic • http://www.ddialliance.org/org/index.html

  49. Pre-DDI 1.0 70’s / 80’s OSIRIS Codebook 1993: IASSIST Codebook Action Group 1996 SGML DTD 1997 DDI XML 1999 Draft DDI DTD 2000 – DDI 1.0 Simple survey Archival data formats Microdata only 2003 – DDI 2.0 Aggregate data (based on matrix structure) Added geographic material to aid geographic search systems and GIS users 2003 – Establishment of DDI Alliance 2004 – Acceptance of a new DDI paradigm Lifecycle model Shift from the codebook centric / variable centric model to capturing the lifecycle of data Agreement on expanded areas of coverage 2005 Presentation of schema structure Focus on points of metadata creation and reuse 2006 Presentation of first complete 3.0 model Internal and public review 2007 Vote to move to Candidate Version (CR) Establishment of a set of use cases to test application and implementation October 3.0 CR2 2008 February 3.0 CR3 March 3.0 CR3 update April 3.0 CR3 final April 28th 3.0 Approved by DDI Alliance May 21st DDI 3.0 Officially announced Initial presentations at IASSIST 2008 2009 DDI 3.1 and beyond DDI Timeline / Status

  50. Academic Producers Users Archivists Sponsors Business Media/Press Policy Makers General Public Government DDI 2.0 Perspective DDI 2 Survey DDI 2 Survey DDI 2 Survey DDI 2 Survey DDI 2 Survey DDI 2 Survey DDI 2 Survey

More Related