150 likes | 550 Views
Breakout Session One, Panel Five Content Transfer. NDIIPP Partner’s Meeting, Arlington, 8-10 July 2008. Stephen Abrams California Digital Library Stephen.Abrams@ucop.edu. Topics. Submission CDL Digital Preservation Repository (DPR) www.cdlib.org/inside/projects/preservation/dpr
E N D
Breakout Session One, Panel FiveContent Transfer NDIIPP Partner’s Meeting, Arlington, 8-10 July 2008 Stephen Abrams California Digital Library Stephen.Abrams@ucop.edu
Topics • Submission • CDL Digital Preservation Repository (DPR) www.cdlib.org/inside/projects/preservation/dpr • Dissemination • Chronopolis chronopolis.sdsc.edu • Mass Transit masstransit.sdsc.edu • Library of Congress • BagIt www.cdlib.org/inside/diglib/bagit/bagitspec.html
Digital Preservation Repository (DPR) • The unit of submission is the object, composed of a descriptive METS file and multiple content files • Submission workflow • Initiated by client-side SOAP or REST client • Server-side validation • Package completeness • File-level data integrity and validation • Object-level conformance • Notification • Averaging 600 KB/sec (per process)
Digital Preservation Repository (DPR) • Direct submission/ingest • Cultural heritage and scientific content from 5 campuses 2.5 TB via web services • Indirect submission to staging area with internally-triggered ingest • Local history content from 48 academic and public libraries 720 GB via HD, CD, DVD • Web harvested content 40 TB (est.) via HTTP 50 KB/sec with 4 second “politeness” policy • Google and OCA mass digitization content 150 – 200 TB (est.) via HTTP 3.8 MB/sec with 0.5% failure rate
Chronopolis • Cross-domain collection sharing for long-term preservation • Data replication via SRB over a three node federated data grid • Project partners: UCSD/SDSC, NCAR, UMIACS • Data providers: CDL, ICPSR
CDL web content • Stanford WebBase – 5 collections 14,108 GB • Federal government, 2004 – 2008 9,123 GB • State government, 2005 – 2007 1,742 GB • County government, 2005 – 2007 743 GB • City government, 2005 – 2007 1,531 GB • Hurricane Rita / Katrina, 2005 969 GB
CDL web content • Web-at-Risk – 20 collections 1,452 GB • Myanmar cyclone, 2008 3 GB • Santa Cruz wildfires, 2008 4 GB • Southern California wildfires, 2007 78 GB • Grand jury reports, 2008 1 GB • California political parties, 2007 3 GB • AFL-CIO, 2007 1 GB • Progressive politics, 2007 – 2008 192 GB • Middle Eastern politics, 2007 – 2008 58 GB • University of California, 2007 – 2008 91 GB • … …
CDL web content • Transfer of ARC files and manifest to CDL via HTTP • Transfer of Bags to Library of Congress via HTTP 28.7 MB/sec (16 parallel threads) • Transfer of Bags to UCSD/SDSC via HTTP 5.6 MB/sec (15 parallel threads)
Mass Transit • CDL/SDSC investigation of critical issues in the large-scale transfer and replication of digital data for preservation • Initial focus on measuring and tuning network performance
BagIt • Common need for low-overhead transfer of content between preservation partners • Minimally self-identifying and self-describing packages • Support for error detection and transfer optimization • Content agnostic • Informed by • NDIIPP Archive and Ingest Handling Test (AIHT) D-Lib Magazine, December 2005 • Tabata et al., “Enclose-and-Deposit Method,” IWAW ’05 • Documented at • www.ietf.org/internet-drafts/draft-kunze-bagit-01.txt • www.cdlib.org/inside/diglib/bagit/bagitspec.html
BagIt • “Bag it and tag it” • Minimal metadata, file system structuring, and packaging rules abcd/ bagit.txt fetch.txt manifest-md5.txt package-info.txt data/ ... • bagit.txt – Bag signature and metadata • package-info.txt – Bag contents metadata • manifest-md5.txt – Bag contents manifest and checksums • fetch.txt – Bag contents included by reference, not value; i.e. “a bag of holes”
Publication Transfer Validation Notification GrabIt • “Curb it and grab it” • Protocol for lightweight transfer without reliance on tedious, error-prone email-based conventions • Support for publication, transfer, validation, and notification • No dependence on BagIt, but capable of operating in an enhanced, bag-aware mode
Summary • Transfer is still hard • Automate and reduce overhead • Transfer fewer big files, rather than many small files • Exploit parallelism • Robust transfer requires explicit verification and notification • Instrument and measure all phases of transfer to identify bottlenecks
Sign on a Berkeley Ecology Center Recycling Truck Questions?