1 / 37

GRAD 521, Research Data Management Winter 2014 – Lecture 15 Amanda L. Whitmire, Asst. Professor

Plan for Archiving & Preservation of Data. GRAD 521, Research Data Management Winter 2014 – Lecture 15 Amanda L. Whitmire, Asst. Professor. Logistics. Heads up/reminder on the final: data management plan. Survey responses: thank you!. Today’s lesson.

Download Presentation

GRAD 521, Research Data Management Winter 2014 – Lecture 15 Amanda L. Whitmire, Asst. Professor

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Plan for Archiving & Preservation of Data GRAD 521, Research Data Management Winter 2014 – Lecture 15 Amanda L. Whitmire, Asst. Professor

  2. Logistics Heads up/reminder on the final: data management plan Survey responses: thank you!

  3. Today’s lesson • Basic archival processes: data selection, format migration, checksums, auditing, etc. • Address the need for conversion to standard formats needed for re-use • Options for a long-term sustainable preservation strategy/policy for your data • Costs & timelines for data storage, management tools and services

  4. vs.

  5. Archive-stage actions • Data selection or appraisal • Format selection • Perform checksums • Select archive location • Periodic file- and bit-level audits

  6. 1. Data appraisal • “… the process of distinguishing recordsof continuing value from those of no further value so that the latter may be eliminated.” • The National Archives (UK)

  7. Appraisal roles & responsibilities

  8. Appraisal criteria Relevance to mission Historical value Uniqueness Potential or redistribution Non-replicability Economic case Full documentation For a full discussion of the appraisal process, see this guide: Whyte, A. & Wilson, A. (2010). "How to Appraise and Select Research Data for Curation". DCC How-to Guides. Edinburgh: Digital Curation Centre. http://www.dcc.ac.uk/resources/how-guides

  9. 2. Format selection Ideal: non-proprietary or open formats For more info. on data formats: http://guides.library.oregonstate.edu/data-management-types-formats

  10. Archive-stage actions • Data selection or appraisal • Format selection • Perform checksums • Select archive location • Periodic file- and bit-level audits

  11. 3. Checksums Checksums provide a way to: ensure the integrity of your data create a comprehensive list of your files

  12. Data integrity What is an MD5 checksum? is like a fingerprint of a file used to verify whether two files are identical Each time you run a checksum: a number string for each file is created even if 1 byte of data has been altered or corrupted that string will change if the checksums match, the data has not altered

  13. Checksums Here is an example data collection: Folder: C:\ … \datamanagementstuff

  14. Checksums Here is a MS Word document in that folder:

  15. FastSum • FastSum is a free MD5 checksum tool for windows available at http://www.fastsum.com/ • 1. Download and install the trial version • 2. Run the Program

  16. Creating a checksum The wizard has created list of‘Checksum\State’in FastSum It has also created a text file in the \datamanagementstufffolder

  17. Creating a checksum Open up the text file and this is what you find: *a checksum string and a list of file names*

  18. Using a checksum • In this example: • Reopened the Word document from earlier • Deleted a period, saved, and closed the document • When you run the checksum wizard again, the value for the ‘Datamanagement.doc’ file should change

  19. Comparing checksums • Before … … After

  20. Comparing checksums • Notice how the values for Datamanagement.doc have changed: • 0CA9E83E612447E793D4758BF7A5244D • 91BAE7EC0C642D967585D01DD6AA4096 • - values for the other files stay the same • - values stay the same across machines unless a file has changed

  21. Creating a file list FastSum has created a list of all the files in the folder it was pointed it toward:

  22. Archive-stage actions • Data selection or appraisal • Format selection • Perform checksums • Select archive location • Periodic file- and bit-level audits

  23. 4. Select archive location • Considerations • Costs • Size of dataset • Public vs. private access • Length of preservation • Hands-on vs. hands-off • Security of platform Locations Individual Department/College University-wide Discipline-specific 3rd-party Archive vs. sharingmechanism

  24. Archive-stage actions • Data selection or appraisal • Format selection • Perform checksums • Select archive location • Periodic file- and bit-level audits

  25. Data in Real Life Images courtesy of Heather Henkel • A design firm was handling their own backups. The system was working fine and the backup software was reporting that the data was successfully backed up.

  26. Data in Real Life CC Image courtesy of angielauw on Flickr • The administrator checked the backups immediately after they were done and confirmed they were good.

  27. Data in Real Life After a computer virus erased most of their files, they went back to their backups. Unfortunately they found that the backups were all blank and all of the data was gone. Only after some investigation did they discover that the computer tapes (which contained the backups) were placed against a wall that had an elevator on the other side of it. When the elevator went past, the magnets inside erased all of the tapes. Take home message: had they checked their backups again, they probably would have noticed this issue before there was an emergency & complete loss of files.

  28. Preservation strategy • Create an archive backup policy that clearly identifies: • roles • responsibilities • where the data is backed up • how often the files are backed up • how to access the files • recommended file formats to be used & • policies for migrating data to assure data are not lost due to media degradation or changing formats or programs • Review your backup policy & plan periodically to ensure it is still valid and applicable • Update contacts, if appropriate

  29. Best Practices • Minimize or remove reliance on users to perform manual backups (if possible) • Implement standardized and automatic backups • If possible, put experts in charge of this task (computer staff) as they are more likely to keep up-to-date regarding software updates, hardware issues, best practices, etc. • Don’t assume backups are being performed for you • You don’t want to find out after the fact that no backups have been performed • If you are using third-party software (like Yahoo or Google Mail), what happens if they lose your files?

  30. Example options for preservation

  31. A typical OSU researcher > 55% produce 100 GB or less per project

  32. Archive on your own • You buy & manage hardware, replication, backups and networking (if applicable, for offsite access) • OK for unrestricted, sensitive (FERPA), and protected data Costs (100 GB dataset) Ranges (but generally cheap) $

  33. Archive w/ department IT • 30-day backup/recovery window for files on personal or departmental storage • RAID protected, backed up online storage • Accessible (to you) remotely (via VPN) Costs (100 GB dataset in COSINe) ($0/year * 4 GB) + ($60/100 GB/year) = $60/year (ongoing) $300 for 5 years $

  34. Archive @ OSU w/ CN • Storage is in 2 separate data centers & backups retained for 3 months • Accessible (to you) remotely (via VPN) • OK for unrestricted, sensitive (FERPA), and protected data Costs (100 GB dataset) ($0/year * 5 GB) + ($4/GB/year * 95 GB) = $380/year (ongoing) $1,900 for 5 years $

  35. Archive in discipline-specific repository • Replicated, archive-quality storage • Data curation throughout ingest & archive period • Data in context with other datasets Costs $ Ranges

  36. 3rd party storage platforms Costs $ Ranges

  37. Bottom line No “one-size-fits all” approach Balance costs, storage quality, access, degree of involvement, security, longevity etc. Plan ahead so you can budget appropriately

More Related