570 likes | 725 Views
Data Management Workshop. Minglu Wang ( minglu@rutgers.edu ) Bonnie L. Fong ( bonnie.fong@rutgers.edu ) Ann Watkins ( ann.watkins@rutgrs.edu ) . Data Management & Libraries. How Libraries Manage Data. We’re “in the business” of information, including data…
E N D
Data Management Workshop Minglu Wang (minglu@rutgers.edu) Bonnie L. Fong (bonnie.fong@rutgers.edu) Ann Watkins (ann.watkins@rutgrs.edu)
How Libraries Manage Data • We’re “in the business” of information, including data… • @ the Rutgers University Libraries, we: • Collect & manage millions of books, thousands of periodicals, hundreds of databases + other online resources • Ensure easy access through good organization, through controlled vocabulary (≈ metadata) • Preserve materials
New Trends of Doing Research • Data intensive and broad collaboration • E-Science • Sharing research data, procedures, and results • Networked/Open Science
Research Funders’ DM & DS Requirement • NSF Data Management & Data Sharing requirement “Proposals must include a supplementary document of no more than two pages labeled ‘Data Management Plan’. This supplement should describe how the proposal will conform to NSF policy on the dissemination and sharing of research results.” “Investigators are expected to share with other researchers, at no more than incremental cost and within a reasonable time, the primary data, samples, physical collections and other supporting materials created or gathered in the course of work under NSF grants.”
Pay Attention to Requirements by Directorate, Office, Division, Program, or other NSF Unit • Biological Sciences Directorate (BIO) • Directorate-wide Guidance • Computer & Information Sciences & Engineering (CISE) • Directorate-wide Guidance • Education & Human Resources Directorate (EHR) • Directorate-wide Guidance • Engineering Directorate (ENG) • Directorate-wide Guidance • Geosciences Directorate (GEO) • Directorate-wide Guidance • Mathematical and Physical Sciences Directorate (MPS) • Division of Astronomical Sciences • Division of Chemistry • Division of Materials Research • Division of Mathematical Sciences • Division of Physics • Social, Behavioral and Economic Sciences Directorate (SBE) • Directorate-wide Guidance
Many Benefits of Well-Planned DM Practice • For your own project • Efficient and cost-effective • Avoiding disasters/loss • Habits of being accountable as a researcher • For the research community • Enable data re-use, meta-analysis, and new discoveries • Facilitate collaboration • Increase research impact
What to plan? • The NSF DMP requires planning information on: • Data under the context of your research • the types of data, samples, physical collections, software, curriculum materials, and other materials to be produced in the course of the project; • Data and metadata formats and documentation • the standards to be used for data and metadata format and content (where existing standards are absent or deemed inadequate, this should be documented along with any proposed solutions or remedies); • Data storage, backup, and access • policies for access and sharing including provisions for appropriate protection of privacy, confidentiality, security, intellectual property, or other rights or requirements; • Data sharing and long-term preservation • policies and provisions for re-use, re-distribution, and the production of derivatives; and • plans for archiving data, samples, and other research products, and for preservation of access to them.
What to Document? • Project • Data Collection • Questionnaire and Variables Construction • Data Integrity • Datasets (naming, structure, version, changes) • Analysis Process and workflow • ……
Data Life Cycle and Data Management Tasks - I • Proposal Development and Data Management Plans • Contact archive for advice • Create data management plan to ensure long-term availability of data resources • Project Start-up • Make decisions about documentation form and content • Conduct pretests and pilot tests of materials and methods
Data Life Cycle and Data Management Tasks - II • Data Collection and File Creation • Follow best practice • For data, address dataset integrity, variable names, labels, and groups; coding; missing data • For documentation, explore use of metadata standard; include all relevant documentation elements; document constructed variables • Data Analysis • Manage master datasets and work files • Set up appropriate file structures • Back up data and documentation
Data Life Cycle and Data Management Tasks - III • Preparing Data for Sharing • Address disclosure risk limitation • Determine file formats to deposit • Contact archive for advice • Depositing Data • Complete relevant forms • Comply with dissemination standards and formats Guide to Social Science Data Preparation and Archiving developed by ICPSR
General Principles • Start considering data management earlier • Always include a Data Management Plan (DMP) in Research Proposal • Data management is an integral part of the complete research cycle • Your DMP needs to be implemented and reviewed continuously • Every stage of research involves data management issues • Adopt standards for both data coding, file formatting and metadata construction
Managerial Strategies • Develop policies, training, manuals, templates, consistent oversight, and regular communication, to prevent unreliable research data collection and analysis • Study institutional and funders’ data policies • Establish data collection and recording procedures BEFORE data collection begins • Plan ahead for possible data loss / restoration situations Research Data Management Online Workshop developed by UMN
Ensuring Reproductivity through Documentation • Every stage of data life cycle: raw data, processed data, data for analysis, visulizations, data for archive… • Data analysis (software, programming codes, models) • Process/workflow (process metadata) • Including Data provenance; Analyses and parameters used; Connections between analyses via inputs and outputs • Informal: using flowcharts, commented scripts • Formal: using software e.g. Kepler, VisTrails
Check and Document Data Reliability and Data Integrity • Double-checking variable names, labels, and coding accuracy and consistency cross time and files • Missing values are important • Checking data completeness • Detect errors by running frequencies, means, ranges, cross tabulations, time series plots… • Peer review
Documentation as a foundation for: • access • retrieval • sharing • archiving • preservation
Documentation includes information about all aspects of data • Context for the study • Data collection methods, tools and instruments • Coding information including variable names and any missing data • Quality control measures used to validate, check, clean and proof data • Changes made to the data after collection • Conditions for access and use of the data • Data file structure and relationships between the data files
File name • Primary identifier for data files • Each file name is unique • Means to classify or sort the files • Facilitates locating by researcher and others • Structure makes different versions easily identifiable
Elements in a file name • Project acronyms or project number • Researcher’s initials or team name • Data file type • Data file creation date • Version number • File status information
Best practices for file names • Make the name meaningful but brief – usually no more than 25 characters. • Sufficient content description for a researcher to decide about file’s usefulness without opening the file. • Use a generic data file name to eliminate problems if moved from one location to another. • Use underscores instead of spaces. • Avoid punctuation or special characters. • Apply the naming system whenever a data file is created. • Use the naming system consistently during the project. • Use appropriate scalability.
Versioning • When files contain closely related content • When files are used by one or more researchers in different locations • When files require synchronization to prevent parallel development of different versions
Decisions about versioning • How many versions to keep? When to remove? • How to organize versions? • How to record a version and its status—in numbers or text? • How to record changes made in the new version? • How to track locations of files? • When to synchronize and what process will be used? • What is the best single location for master version storage?
Examples of file names [document name] [version #] [status: draft/final] Fong_interview_February_2014_V1.2_draft Lipid-analysis-rate-V2_definitive 2011_01_28_1LB_CS3_V6_AB_edited
Maintaining control of file versions • Include version number or description in file • Maintain a file history with versions, dates, authors, and any changes to the file • Take advantage of version control capabilities within the software you’re using • Use versioning software such as Subversion (SVN) • Use file sharing services such as dropbox, Google Docs, Amazon S3 • Determine who can edit files • Merge entries or edits manually to minimize problems
Organization Methods • Hierarchical—moving from broad to specific • Facilitates retrieval by location Folder Subfolder File
Organization methods • Tagging • Indicate many relationships, placing data file in multiple locations • Requires additional software—Tabbles, TaggedFrog, TaggTool, Gmail (tagging email messages rather than using folders) • Facilitates retrieval by searching • Used to supplement hierarchical method
Storage locations for documentation • Separate files • Lab notebooks • Data collection user’s guides • Code books • Data dictionaries
NISO’s definition of metadata • “metadata is the structured information that describes explains, locates or otherwise makes it easier to retrieve, use or manage an information resource.”
Purpose of metadata • Provides organization • Enhances discovery through searching • Serves as a unique identifier • Supports archiving and preservation • Provides interoperability so searching can take place across different organizational structures
Creating metadata schemas to provide structure • Experts in classification and description identify unique characteristics of item format • They find mutually acceptable terms for field names • They define the content of the fields • Schemas shared with others to foster common use
Metadata for books Metadata
Metadata for journal articles Metadata
Metadata fields for digital items • Abstract • Classification • Genre • Language • Location • Name • Physical Description • Rights Holder • Subject • Title Information
Selected metadata schemas and standards • MODS – Metadata Object Description Schema • METS – Metadata Encoding and Transmission • DDI – Data Documentation Initiative • Dublin Core
MODS Record from RUCore RUCore, the Rutgers University Libraries’ repository, includes digital dissertations and theses. For an example of a dissertation record using the MODS schema, http://mss3.libraries.rutgers.edu/dlr/output.php?ds=Full&type=FULL&demono=rutgers-lib:26496
Metadata for data file Metadata field names in bold type
Storage • Consider long-term access when it comes to file formats: • Think about migrating your data to one of the preferred formats, in addition to keeping a copy in the original format
Backup, backup, backup! • Keep at least 3 copies of your data • LOCKSS (Lots Of Copies Keeps Stuff Safe) • Locally (e.g., hard drive) • Portably (e.g., USB flash drive, external hard drive, CD/DVD) • Remotely (e.g., network drive, remote server, in the “cloud”) • Consider: • “full” vs. “incremental” backup • manual vs. automatic backup
Security • Have sensitive (private/confidential) data? • Password-protection • Encryption • Physical security • Get up-to-date anti-virus software • Storage + backup? • Store on computer not connected to network • Use external hard drive, then store it in a locked safe overnight • How will you dispose of sensitive data?
Preservation • Regularly check storage media • make sure it still works & is not failing • Periodically “refresh” data • copy it to a new disk/USB flash drive • Copy & migrate data to newer media every 2-5 years
Archive • Store & preserve your data in a repository • Disciplinary data centers advantages • Specific subject data needs are considered • Easier for researchers in your field to discover your data • Institutional repository advantages • Guaranteed institutional support • Local support is available • All your data in one place
RUresearch Data Portal • Part of Rutgers University’s RUcore • Uses “industrial best practices” for digital file preservation: • Standardized metadata data to fill in • Multiple backups • Continuous file integrity checks • Persistent identifiers • Support for multiple file formats • Non-exclusive accessibility + visibility! • Support from local librarians • RUresearch Team Dana Data Team
Additional Tips • Include costs for data storage, backup, security, & preservation in grant proposals • Develop policies & procedures for each • Who, what, when, where, how • Regularly restore data files from backups & check that you can read them • Keep the original master copy • Uncompressed • Original file format
Manage Research Data 101http://libraries.mit.edu/guides/subjects/data-management Put together by the MIT Libraries “Data Management and Publishing” subject guide Slides from a workshop designed for researchers: “Managing Research Data 101”