290 likes | 387 Views
UC3 Summer Webinar Series. An Introduction to the Merritt Curation Repository. University of California Curation Center Team California Digital Library June 9, 2011. First, a word about the webinar series…. A forum for timely topics of interest to the UC community
E N D
UC3 Summer Webinar Series An Introduction to the Merritt Curation Repository University of California Curation Center Team California Digital Library June 9, 2011
First, a word about the webinar series… • A forum for timely topics of interest to the UC community • Highlighting projects, services, and developments in the areas of digital preservation, web archiving, and data curation • Intended to raise awareness of issues, and provide information on useful resources and services available to the UC community • 2nd and 4th Thursday of the month, and as scheduled, featuring UC3 staff and UC librarians, content managers, and technologists Teleconference +1 (866) 740-1260, access code 9879016# Webconferencehttp://bit.ly/jdjMAP
First, a word about the webinar series… • Some logistics… • Participant phones will be muted during the formal presentation, but we will be monitoring the online chat • Slides, Q & A, and web and voice recordings will be posted after each presentation • Schedule available at http://www.cdlib.org/uc3/uc3webinars.html • Please suggest additional topics! uc3@ucop.edu • Take the short survey http://www.surveymonkey.com/s/XSGWP8R
Now on with the show… • Today’s topic is an introduction to the Merritt curation repository • Who is it for? • What can it do? • Why use it? • What does it cost? • Next steps? • Q & A
What keeps you up at night? How much will it cost? What’s the best strategy to ensure permanent availability? How do I know my content is safe? Are there standards or best practices I should be aware of? How can I transfer my content to an appropriate curation environment I have a good discovery platform; how can I add preservation services? Do I need to create new derivatives just for preservation purposes? Can I control who can see my content? How can I get a persistent reference to my content? What if my content needs to evolve over time?
“There’s an app for that” Automatic replication and high-availability redundancy How much will it cost? What’s the best strategy to ensure permanent availability? Storage at $1.04/GB/year How do I know my content is safe? Are there standards or best practices I should be aware of? Periodic fixity audit UC3 consultation How can I transfer my content to an appropriate curation environment I have a good discovery platform; how can I add preservation services? Simple submission UI/API METS “feeder” duplicates existing DPR workflow Modular micro-services “toolkit” Do I need to create new derivatives just for preservation purposes? Can I control who can see my content? How can I get a persistent reference to my content? Model free No packaging, format, or metadata requirements What if my content needs to evolve over time? Curator-defined access control rules Integration with EZID and DataCite Strongly versioned
Merritt repository • Merritt is available for use by all members of the UC community • Libraries/archives/museums • ORU/MRUs • Faculty/staff • Centrally hosted by UC3/CDL on behalf of the UC community • Economies of scale • Shared experience and expertise Mediated through campus libraries
Modes of use: dark archive • Pro-active preservation, but no expectation of direct end user access • Legacy DPR content contributed by campus libraries • Cultural heritage texts, master images, sound, moving image, data sets • All DPR content will be automatically migrated to Merritt
Modes of use: bright archive • Provide preservation and end user access • NIH Healthy Pathways project on bio-demographics • Multi-institutional: UC Davis, University of Colorado, University of Virginia, Syddansk University (Denmark) • Need to restrict access to project partners initially, with eventual public access
Modes of use: bright archive • Content discovery: search
Modes of use: bright archive • Content discovery: search
Modes of use: bright archive • Content discovery: browse
Modes of use: bright archive • Content discovery: browse
Modes of use: preservation “back end” • Preservation only; content discovery/delivery provided by well-known external systems • Using direct hooks into Merritt to retrieve content • – eScholarship • Open access publishing • – Open Context • Archaeological data publishing • – Investigating integration with Islandora/Drupal and Alfresco
Modes of use: distributed data grids • DataONE “Enable new science and knowledge creation through universal access to data about life on earth and the environment that sustains it”
More information • Online help http://merritt.cdlib.org/help • FAQ http://merritt.cdlib.org/docs/merritt_handout.pdf • User’s guide http://merritt.cdlib.org/docs/merritt_user_guide.pdf • UC3 contact http://www.cdlib.org/uc3/contact.html uc3@ucop.edu
Merritt cost model • UC3 provides technical infrastructure, data center hosting, staff, monitoring, maintenance, enhancements, help, outreach, consultation, etc. • Contributors are charged only for storage used, at the UC3 recovery rate of $1.04/GB/year • Developing an “endowment” model: Pay once, preserve forever • Will soon extend model for non-UC contributors • How does this compare? • Cost of a physical book in RLF †$ 4.62/year • Cost of a digital book in HathiTrust ‡$ 0.15/year • Cost of a digital book in Merritt $ 0.06/year † Gary Lawrence (2007) Internal analysis, CDL; ‡ Paul Courant and Matthew Nielsen (2010), On the cost of keeping a book, HathiTrust.
Average collection sizes and costs A “cost calculator” spreadsheet is available at http://www.cdlib.org/uc3/docs/Merritt-cost-calculator-v3.xlsx
Average ETD size and cost Based on 2009 holdings in ProQuest * UCSF based on total ETD holdings in Merritt
Average research data size and cost • Almost 50% of all research data is less than 1 GB Source: Science 331:6018 (February 11, 2011): 692-693 <DOI: 10.1126/science.331.6018.692>
Next steps • UC3 is working with campus partners to determine ongoing development and collection priorities New content acquisition
Next steps • In production • Model-free objects • Submission via UI and API • Persistent identifiers • Format identification • Version provenance • Automated replication • Automated fixity audit • Role-based access control • Collections • Semantic index and search • Object/version/file download • In progress • Simplified update • Enhanced characterization (JHOVE2) • Faceted search and browse (XTF) • CMS/DAMS-like function (Islandora) • In planning • Simplified batch • UCTrust integration • Linked data • Transformation • Notification • Annotation • Support for NGTS/DLSTF recommendations • We welcome your feedback on needs and priorities! • http://www.cdlib.org/uc3/contact.html • uc3@ucop.edu
Simplified update • Variant form of object update requiring the submission of only the changed components • Client-side tools to simplify the creation of batch manifests #%checkm_0.7 #%profile | http://uc3.cdlib.org/registry/ingest/mani #%prefix | mrt: | http://merritt.cdlib.org/terms# #%prefix | nfo: | http://www.semanticdesktop.org/onto #%fields | nfo:fileUrl | nfo:hashAlgorithm | nfo:hash http://merritt.cdlib.org/samples/goldenDragon.jpg | m http://merritt.cdlib.org/samples/tumbleBug.jpg | md5 http://merritt.cdlib.org/samples/generalDrapery.jpg | http://merritt.cdlib.org/samples/generalDrapery.jpg | #%eof
Enhanced characterization • JHOVE2 next-generation framework for format-aware characterization http://jhove2.org/ • Automated extraction and inference of extensive technical metadata significant for preservation analysis and planning "Module": { "scope": "ICCModule“, "Header": { "scope": "ICCHeader“, "ProfileSize": { "unit": "byte“, "value": 60960 } ,"ProfileVersionNumber": "4.2.0.0“ ,"ProfileDeviceClass_raw": "spac“ ,"ProfileDeviceClass_descriptive": "ColorSpace Conversion profile“ ,"ColourSpace_raw": "RGB “ ,"ColourSpace_descriptive": "rgbData“ ,"ProfileConnectionSpace_raw": "Lab “ ,"ProfileConnectionSpace_descriptive": "labData“
Enhanced discovery via XTF • eXtensible Text Framework http://xtf.cdlib.org/ • CDL developed/supported open source discovery platform • Robust, scalable faceted search and browse
CMS/DAMS-like function • Many campuses are looking for CMS/DAMS solutions • Investigating integration with Islandora to provide a Drupal CMS/DAMS front-end to Merritt http://islandora.ca/ http://drupal.org/
Upcoming webinars http://www.cdlib.org/uc3/uc3webinars.html • Please take the webinar survey http://www.surveymonkey.com/s/XSGWP8R
For more information UC Curation Center http://www.cdlib.org/uc3 http://www.cdlib.org/uc3/contact.html uc3@ucop.edu • Stephen Abrams Margaret Low • Lisa Colvin David Loy • Patricia Cruse Mark Reyes • Scott Fisher Tracy Seneca • Erik Hetzner Joan Starr • Greg Janée Marisa Strong • John Kunze Perry Willett UC3 webinar series http://www.cdlib.org/uc3/uc3webinars.html Merritt repository http://merritt.cdlib.org/ http://merritt.cdlib.org/help http://merritt.cdlib.org/docs/merritt_handout.pdf http://merritt.cdlib.org/docs/merritt_user_guide.pdf