1 / 57

Data Management Workshop

Data Management Workshop. Minglu Wang ( minglu@rutgers.edu ) Bonnie L. Fong ( bonnie.fong@rutgers.edu ) Ann Watkins ( ann.watkins@rutgrs.edu ) . Data Management & Libraries. How Libraries Manage Data. We’re “in the business” of information, including data…

leala
Download Presentation

Data Management Workshop

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Management Workshop Minglu Wang (minglu@rutgers.edu) Bonnie L. Fong (bonnie.fong@rutgers.edu) Ann Watkins (ann.watkins@rutgrs.edu)

  2. Data Management & Libraries

  3. How Libraries Manage Data • We’re “in the business” of information, including data… • @ the Rutgers University Libraries, we: • Collect & manage millions of books, thousands of periodicals, hundreds of databases + other online resources • Ensure easy access through good organization, through controlled vocabulary (≈ metadata) • Preserve materials

  4. Data Management: Why and How?

  5. New Trends of Doing Research • Data intensive and broad collaboration • E-Science • Sharing research data, procedures, and results • Networked/Open Science

  6. Research Funders’ DM & DS Requirement • NSF Data Management & Data Sharing requirement “Proposals must include a supplementary document of no more than two pages labeled ‘Data Management Plan’. This supplement should describe how the proposal will conform to NSF policy on the dissemination and sharing of research results.” “Investigators are expected to share with other researchers, at no more than incremental cost and within a reasonable time, the primary data, samples, physical collections and other supporting materials created or gathered in the course of work under NSF grants.”

  7. Pay Attention to Requirements by Directorate, Office, Division, Program, or other NSF Unit • Biological Sciences Directorate (BIO) • Directorate-wide Guidance • Computer & Information Sciences & Engineering (CISE) • Directorate-wide Guidance • Education & Human Resources Directorate (EHR) • Directorate-wide Guidance • Engineering Directorate (ENG) • Directorate-wide Guidance • Geosciences Directorate (GEO) • Directorate-wide Guidance • Mathematical and Physical Sciences Directorate (MPS) • Division of Astronomical Sciences • Division of Chemistry • Division of Materials Research • Division of Mathematical Sciences • Division of Physics • Social, Behavioral and Economic Sciences Directorate (SBE) • Directorate-wide Guidance

  8. Many Benefits of Well-Planned DM Practice • For your own project • Efficient and cost-effective • Avoiding disasters/loss • Habits of being accountable as a researcher • For the research community • Enable data re-use, meta-analysis, and new discoveries • Facilitate collaboration • Increase research impact

  9. What to plan? • The NSF DMP requires planning information on: • Data under the context of your research • the types of data, samples, physical collections, software, curriculum materials, and other materials to be produced in the course of the project; • Data and metadata formats and documentation • the standards to be used for data and metadata format and content (where existing standards are absent or deemed inadequate, this should be documented along with any proposed solutions or remedies); • Data storage, backup, and access • policies for access and sharing including provisions for appropriate protection of privacy, confidentiality, security, intellectual property, or other rights or requirements; • Data sharing and long-term preservation • policies and provisions for re-use, re-distribution, and the production of derivatives; and • plans for archiving data, samples, and other research products, and for preservation of access to them.

  10. What to Document? • Project • Data Collection • Questionnaire and Variables Construction • Data Integrity • Datasets (naming, structure, version, changes) • Analysis Process and workflow • ……

  11. Data Life Cycle and Data Management Tasks - I • Proposal Development and Data Management Plans • Contact archive for advice • Create data management plan to ensure long-term availability of data resources • Project Start-up • Make decisions about documentation form and content • Conduct pretests and pilot tests of materials and methods

  12. Data Life Cycle and Data Management Tasks - II • Data Collection and File Creation • Follow best practice • For data, address dataset integrity, variable names, labels, and groups; coding; missing data • For documentation, explore use of metadata standard; include all relevant documentation elements; document constructed variables • Data Analysis • Manage master datasets and work files • Set up appropriate file structures • Back up data and documentation

  13. Data Life Cycle and Data Management Tasks - III • Preparing Data for Sharing • Address disclosure risk limitation • Determine file formats to deposit • Contact archive for advice • Depositing Data • Complete relevant forms • Comply with dissemination standards and formats Guide to Social Science Data Preparation and Archiving developed by ICPSR

  14. General Principles • Start considering data management earlier • Always include a Data Management Plan (DMP) in Research Proposal • Data management is an integral part of the complete research cycle • Your DMP needs to be implemented and reviewed continuously • Every stage of research involves data management issues • Adopt standards for both data coding, file formatting and metadata construction

  15. Managerial Strategies • Develop policies, training, manuals, templates, consistent oversight, and regular communication, to prevent unreliable research data collection and analysis • Study institutional and funders’ data policies • Establish data collection and recording procedures BEFORE data collection begins • Plan ahead for possible data loss / restoration situations Research Data Management Online Workshop developed by UMN

  16. Ensuring Reproductivity through Documentation • Every stage of data life cycle: raw data, processed data, data for analysis, visulizations, data for archive… • Data analysis (software, programming codes, models) • Process/workflow (process metadata) • Including Data provenance; Analyses and parameters used; Connections between analyses via inputs and outputs • Informal: using flowcharts, commented scripts • Formal: using software e.g. Kepler, VisTrails

  17. Check and Document Data Reliability and Data Integrity • Double-checking variable names, labels, and coding accuracy and consistency cross time and files • Missing values are important • Checking data completeness • Detect errors by running frequencies, means, ranges, cross tabulations, time series plots… • Peer review

  18. Data Documentation and Metadata

  19. Documentation as a foundation for: • access • retrieval • sharing • archiving • preservation

  20. Documentation includes information about all aspects of data • Context for the study • Data collection methods, tools and instruments • Coding information including variable names and any missing data • Quality control measures used to validate, check, clean and proof data • Changes made to the data after collection • Conditions for access and use of the data • Data file structure and relationships between the data files

  21. File name • Primary identifier for data files • Each file name is unique • Means to classify or sort the files • Facilitates locating by researcher and others • Structure makes different versions easily identifiable

  22. Elements in a file name • Project acronyms or project number • Researcher’s initials or team name • Data file type • Data file creation date • Version number • File status information

  23. Best practices for file names • Make the name meaningful but brief – usually no more than 25 characters. • Sufficient content description for a researcher to decide about file’s usefulness without opening the file. • Use a generic data file name to eliminate problems if moved from one location to another. • Use underscores instead of spaces. • Avoid punctuation or special characters. • Apply the naming system whenever a data file is created. • Use the naming system consistently during the project. • Use appropriate scalability.

  24. Versioning • When files contain closely related content • When files are used by one or more researchers in different locations • When files require synchronization to prevent parallel development of different versions

  25. Decisions about versioning • How many versions to keep? When to remove? • How to organize versions? • How to record a version and its status—in numbers or text? • How to record changes made in the new version? • How to track locations of files? • When to synchronize and what process will be used? • What is the best single location for master version storage?

  26. Examples of file names [document name] [version #] [status: draft/final] Fong_interview_February_2014_V1.2_draft Lipid-analysis-rate-V2_definitive 2011_01_28_1LB_CS3_V6_AB_edited

  27. Maintaining control of file versions • Include version number or description in file • Maintain a file history with versions, dates, authors, and any changes to the file • Take advantage of version control capabilities within the software you’re using • Use versioning software such as Subversion (SVN) • Use file sharing services such as dropbox, Google Docs, Amazon S3 • Determine who can edit files • Merge entries or edits manually to minimize problems

  28. Organization Methods • Hierarchical—moving from broad to specific • Facilitates retrieval by location Folder Subfolder File

  29. Organization methods • Tagging • Indicate many relationships, placing data file in multiple locations • Requires additional software—Tabbles, TaggedFrog, TaggTool, Gmail (tagging email messages rather than using folders) • Facilitates retrieval by searching • Used to supplement hierarchical method

  30. Storage locations for documentation • Separate files • Lab notebooks • Data collection user’s guides • Code books • Data dictionaries

  31. NISO’s definition of metadata • “metadata is the structured information that describes explains, locates or otherwise makes it easier to retrieve, use or manage an information resource.”

  32. Purpose of metadata • Provides organization • Enhances discovery through searching • Serves as a unique identifier • Supports archiving and preservation • Provides interoperability so searching can take place across different organizational structures

  33. Creating metadata schemas to provide structure • Experts in classification and description identify unique characteristics of item format • They find mutually acceptable terms for field names • They define the content of the fields • Schemas shared with others to foster common use

  34. Metadata for an individual research file

  35. Metadata for books Metadata

  36. Metadata for journal articles Metadata

  37. Metadata fields for digital items • Abstract • Classification • Genre • Language • Location • Name • Physical Description • Rights Holder • Subject • Title Information

  38. Selected metadata schemas and standards • MODS – Metadata Object Description Schema • METS – Metadata Encoding and Transmission • DDI – Data Documentation Initiative • Dublin Core

  39. MODS Record from RUCore RUCore, the Rutgers University Libraries’ repository, includes digital dissertations and theses. For an example of a dissertation record using the MODS schema, http://mss3.libraries.rutgers.edu/dlr/output.php?ds=Full&type=FULL&demono=rutgers-lib:26496

  40. Metadata for data file Metadata field names in bold type

  41. Data Storage and Preservation

  42. Storage • Consider long-term access when it comes to file formats: • Think about migrating your data to one of the preferred formats, in addition to keeping a copy in the original format

  43. Backup, backup, backup! • Keep at least 3 copies of your data • LOCKSS (Lots Of Copies Keeps Stuff Safe) • Locally (e.g., hard drive) • Portably (e.g., USB flash drive, external hard drive, CD/DVD) • Remotely (e.g., network drive, remote server, in the “cloud”) • Consider: • “full” vs. “incremental” backup • manual vs. automatic backup

  44. Security • Have sensitive (private/confidential) data? • Password-protection • Encryption • Physical security • Get up-to-date anti-virus software • Storage + backup? • Store on computer not connected to network • Use external hard drive, then store it in a locked safe overnight • How will you dispose of sensitive data?

  45. Preservation • Regularly check storage media • make sure it still works & is not failing • Periodically “refresh” data • copy it to a new disk/USB flash drive • Copy & migrate data to newer media every 2-5 years

  46. Archive • Store & preserve your data in a repository • Disciplinary data centers advantages • Specific subject data needs are considered • Easier for researchers in your field to discover your data • Institutional repository advantages • Guaranteed institutional support • Local support is available • All your data in one place

  47. RUresearch Data Portal • Part of Rutgers University’s RUcore • Uses “industrial best practices” for digital file preservation: • Standardized metadata data to fill in • Multiple backups • Continuous file integrity checks • Persistent identifiers • Support for multiple file formats • Non-exclusive  accessibility + visibility! • Support from local librarians • RUresearch Team  Dana Data Team

  48. Additional Tips • Include costs for data storage, backup, security, & preservation in grant proposals • Develop policies & procedures for each • Who, what, when, where, how • Regularly restore data files from backups & check that you can read them • Keep the original master copy • Uncompressed • Original file format

  49. Additional Data Management Training Opportunities

  50. Manage Research Data 101http://libraries.mit.edu/guides/subjects/data-management Put together by the MIT Libraries “Data Management and Publishing” subject guide Slides from a workshop designed for researchers: “Managing Research Data 101”

More Related