220 likes | 372 Views
Data Management Best Practices. Data Management Best Practices. Ryan Womack and Aletia Morgan Rutgers University Libraries Research Data Services. Ryan Womack and Aletia Morgan Rutgers University Libraries Research Data Services. October 29, 2013. Why Data Management?.
E N D
Data Management Best Practices Data Management Best Practices Ryan Womack and Aletia Morgan Rutgers University Libraries Research Data Services Ryan Womack and Aletia Morgan Rutgers University Libraries Research Data Services October 29, 2013
Why Data Management? • Worst case scenarios • Lost data • Stolen data • Incomprehensible data • Unverifiable data
Why Data Management? • Best Case Scenarios, Data is… • Robust • Recoverable • Reliable • Reusable • Reproducible • Reputable • Renowned
Understand your data • Quantity (MB, GB, TB) • will affect your decisions about how to store, package, transport, and backup your data • Large numbers of discrete files, regardless of size, may be harder to handle • Multiple versions? Frequent updates? • Format • Seek out open formats or at least commonly used formats (.xlsx is ok, consider .csv) • Both for preservation and reuse, dependency on a single option can be a problem, even if it is the most convenient in the short-term. • Rights and permissions • Reuse of other data sources (licensed or otherwise) may require investigation and restrict data sharing options • Confidentiality of human subjects or guarding of information for patent protection may be reasons to restrict data • Give appropriate credit to others’ contributions
Preserving your Data What types of data need to be managed or and stored over the course of your project? • Raw Data • Working Data • Processed or Final Data • Preserved Data for Possible Re-Use
Data Preservation Considerations • Access Control • Versioning • Backup • Local • Shared • Campus • Offsite
Security Security includes both • Physical Security • Logical/Network/Password Security
Organizing your Data • Who controls the data for which parts of the project? • A logical structure and plan will help during and after data creation, for directories and filenames (e.g. project directory has standard locations for raw data, analysis, code, graphs, etc.) • Project folders with well-documented naming conventions (e.g., date of data creation as part of file name in a structured format (ISO is yyyy-mm-dd). See ARM example. • Versioning scheme should be agreed on and documented. • Goal is to have unique and understandable identifiers so that different parts of a project can merge and be handled easily over time, even if participants and conditions change.
Documenting your Data • Documentation helps • with your own reuse of the data at a later date, • others in your research group working with the data • with the long-term reuse potential of your data • A Codebook is a structured way of explaining the contents of your data file • Readme files are a well-understood way of communicating information about the contents and setup of your data • Documentation can also take the form of a simple text or Word document • Include any explanations of experimental methods, software code run on the data, and any other tools needed to work with the data • Your previous work on creating and describing a consistent organization for your data will help here.
Reproducible Research • Ideally, someone else can grab your data project as a complete bundle of data, documentation, and software code, and recreate the analysis to get exactly the same results • Reports and data can be integrated so that live analysis run on actual data can be placed in reports (some R packages do this). • Many initiatives are advancing the concept of reproducible research. • This high standard of evidence and validation is an assurance that data and conclusions are not flawed (or faked). • Good data management practices lay the groundwork for success in reproducible research
Data Management Plans • Many sponsored research agencies now require a "Data Management Plan" (DMP) as a component of any proposal • The DMP is a formal document that outlines what the PI will do with data during and after completion of a funded research project • The specific requirements for the DMP vary by funder, and by research subject
Why The DMP Requirement? • Ensure the preservation of important research data • Support the potential re-use of grant funded data by other researchers to validate and potentially extend the value of the data • Improve accountability for use of public revenue to support research • Provide for the reasonable access to research data to be consistent with the Freedom of Information Act
Benefits of a Good DMP • Improved competitiveness for grant programs; a clear and complete Data Management Plan will support the project’s evaluation plan • Enhanced PI efficiency – writing a DMP encourages the creation of a structured plan for managing research data throughout the life of the project and beyond • Long-term protection and preservation of RU research data
Who’s Asking for a DMP? • National Science Foundation http://www.nsf.gov/pubs/policydocs/pappguide/nsf11001/aag_6.jsp#VID4 • Centers for Disease Control and Prevention http://www.cdc.gov/od/foia/policies/sharing.htm • Department of Energy http://www.cio.energy.gov/policy-guidance/federal_regulations.htm • Department of Defense http://www.dtic.mil/whs/directives/corres/pdf/320014p.pdf • Environmental Protection Agency http://www.epa.gov/quality/informationguidelines/documents/EPA_InfoQualityGuidelines.pdf • Institute of Museum and Library Services http://www.imls.gov/applicants/forms/DigitalProducts.pdf • NASA http://nasascience.nasa.gov/earth-science/earth-science-data-centers/data-and-information-policy • National Endowment for the Humanities http://www.neh.gov/grants/guidelines/pdf/DataManagementPlans.pdf • National Institute of Justice http://www.ojp.usdoj.gov/nij/funding/data-resources-program/welcome.htm • National Institute of Standards and Technology http://www.nist.gov/director/quality_standards.htm • United States Department of Agriculture http://www.csrees.usda.gov/ • United State Department of Education http://ies.ed.gov/funding/datasharing_policy.asp • Many academic journals have also begun to require data sharing as part of the submission process. For a partial list of journals with data sharing mandates for their published articles, see the following: • http://oad.simmons.edu/oadwiki/Journal_open-data_policies http://gking.harvard.edu/pages/data-sharing-and-replication As described by Portland State University Library http://library.pdx.edu/digital-scholarship/data-mgmt-plans/who-requires-dmps.html
The NSF DMP Should Include • [data attributes] The types of data, samples, physical collections, software, curriculum materials, and other materials to be produced in the course of the project; • [metadata] The standards to be used for data and metadata format and content • [security policies] Policies for access and sharing including provisions for appropriate protection of privacy, confidentiality, security, intellectual property, or other rights or requirements, including the right to embargo data for a specified time period to allow first publication and thorough use of the data; • [use policies] Policies and provisions for fair re-use, re-distribution, and the production of derivatives; and • [preservation] Plans for archiving data, samples, and other research products, and for preservation of access to them.
DMP Web Support Operated by the California Digital Library, the DMPTool is a site supporting general DMP development with some school-specific guidance RUL Data Resources – offers information about services and experts to support your data management efforts
Discoverable Data • Publicly available archives such as Dryad, Dataverse, ICPSR, and more allow other researchers to easily discover and reuse data • Metadata provide standardized terminology for searching and discovering data matching defined characteristics. Unlike a data upload to a website, metadata provides a well-structured way for computer indexing of the data. • Your discipline may have well-defined metadata standards (or not). Check with your librarian if you are in doubt. • Databib and re3data are directories of research data repositories that can be used to discover possible locations for you to deposit and share your data.
Citing Data • By publicly sharing your data via an established repository, you will typically get a DOI (Digital Object Identifier) or other persistent URL. This will serve as a permanent pointer to the data • You can cite your data, and others’ data, with DOI’s. This is more precise and easier for others to work with than a vague reference to the data by title, author, or even citing a paper that uses the data. • Data citation increases the impact of your research! This completes the data lifecycle.
Further References • Australian National Data Service: Data Management for Researchers • Australian National University: Data Management • CIESIN: Geospatial Electronic Records • ICPSR Guide to Social Science Data Preparation and Archiving (pdf): • Oak Ridge National Laboratory: Best Practices for Preparing Environmental Data Sets to Share and Archive • UK Data Archive: Create & Manage Data and Managing and Sharing Data: a Best Practice Guide for Researchers (pdf).