300 likes | 311 Views
Chronopolis: Preserving Our Digital Heritage. David Minor UC San Diego San Diego Supercomputer Center. What is Chronopolis?. UCSD Libraries.
E N D
Chronopolis: Preserving Our Digital Heritage David Minor UC San Diego San Diego Supercomputer Center
What is Chronopolis? UCSD Libraries • A digital preservation network developed by a national consortium, with initial funding from The Library of Congress / National Digital Information and Infrastructure Preservation Program (NDIIPP). • Chronopolis partners are : • San Diego Supercomputer Center (SDSC) and the UC San Diego (UCSD) Libraries • University of Maryland Institute for Advanced Computer Studies (UMIACS) • National Center for Atmospheric Research (NCAR) in Boulder, Colorado http://chronopolis.sdsc.edu
Chronopolis Fast Facts • Digital preservation environment using a data grid framework • Designed to leverage capabilities at multiple institutions • Emphasizes heterogeneous and redundant data storage systems • Has a current storage capacity of 150 TB (50 TB at 3 nodes) • Has geographically distributed copies of all data • Includes detailed monitoring and monthly auditing of all data
Institutional Roles • All partners provide: • Storage, network support • Complete copy of all data • SRB support • UCSD Libraries: • Metadata expertise • SDSC: • Project Management • Finances, contracts, etc • UMIACS: • Preservation tool development • Storage technology testing • NCAR: • Data portal development http://chronopolis.sdsc.edu
Data Providers • California Digital Library • 12 TB of data • Crawls of political and government web sites • ARC files, uniform size • BagIt protocol for data transfer • Inter-university Consortium for Political and Social Research (ICPSR) • 10 TB of data • 40+ years of social science research • Millions of files • Already using SRB http://chronopolis.sdsc.edu
Data Providers • North Carolina State University Libraries • 6 TB of data • State and local geospatial data • BagIt protocol for data transfer • Scripps Institution of Oceanography • 1 TB of data • 50 years of data from SIO research cruises • Already using SRB http://chronopolis.sdsc.edu
Core Chronopolis Tools • Storage Resource Broker (SRB) • BagIt • SRB Replication Monitor • Auditing Control Environment (ACE) • Chronopolis Web Portal http://chronopolis.sdsc.edu
Storage Resource Broker • The underlying infrastructure of Chronopolis • Each site is a separate zone with its own MCAT and management • Data is replicated at each zone • Will be moving to iRODS in next few months http://chronopolis.sdsc.edu
BagIt BagIt is a hierarchical file packaging format for the exchange of generalized digital content. • There is no software to install • Consists of base directory with manifest file & subdirectory with content • Manifest file has a row for each content file with: • Full path in content directory • A checksum for file Holey Bags • Have additional ‘fetch.txt’ file in base directory & empty content directory • URLs for each content file are listed in fetch.txt file. • Can reduce transfer time by fetching content in parallel http://www.digitalpreservation.gov/library/resources/tools/docs/bagitspec.pdf
BagIt http://chronopolis.sdsc.edu
SRB Replication Monitor • Product of UMIACS • A webapp that watches registered directories and ensures that copies exist at designated mirrors. • The monitor stores enough information to know if files have been added or removed from the master site and when the last time a file was seen. • Any action that the webapp takes on files is logged. • The monitor does NOT do any type of integrity checking, this is the responsibility of other components (eg, ACE). http://chronopolis.sdsc.edu
Replication Process ReplicationMonitor http://chronopolis.sdsc.edu
Auditing Control Environment (ACE) • Product of UMIACS • Software to protect the integrity of digital assets in the long term • Underpinnings are based on rigorous cryptographic techniques • Scalable, cost-effective, can interoperate with any archiving architecture http://chronopolis.sdsc.edu
ACE-AM 3rd Party Auditor ACE – Overview object Hash (obj) ACE-IMS Client Integrity Token (IntegrityManagementService) (Audit Manager)
ACE Audit • Can audit millions of files and TBs of data • Two types of audit: • A file audit: checks files in registered directories against stored hashes to ensure files have not been corrupted • Token audit: checks the stored hashes against a remote Integrity Management Server to ensure nobody has tampered with the stored hashes http://chronopolis.sdsc.edu
ACE Audit 1. Each digital object is audited locally using the integrity token, according to the policy set by the local manager. Object 2. The integrity management system periodically audits the integrity tokens according to its policies. IntegrityToken CryptographicSummaryInformation Witness 3. Cryptographic summaries are audited as necessary using the published witness values. http://chronopolis.sdsc.edu
Web Portal • Designed to give data providers an in-depth look at their holdings • Shows where data is in all locations • Unifies information from SRB, ACE and the Replication Monitor http://chronopolis.sdsc.edu
Chronopolis Metadata • Working with team from UCSD Libraries • What technical metadata is system tracking? • What descriptive metadata is present? • What are the significant events? http://chronopolis.sdsc.edu
ACE ET-1 Service Level Agreement ET-5 Acquisition Registration into ACE ET-8 File Integrity Check Node 2 DP ET-7 Acquisition Replication ET-3 Acquisition Validation Replication Monitor Manifest Data Data ET-2 Acquisition Transfer ET-6 Inter-Node Inventory Check ET-4 Acquisition Registration to SRB MCAT Node 3 Node 1 http://chronopolis.sdsc.edu
Future directions • Update auditing procedures • Updated portal • Automation of collection ingest • New collections and storage nodes • Fully-fledged business model • TRAC certification http://chronopolis.sdsc.edu
http://chronopolis.sdsc.edu minor@sdsc.edu