Collection-based Persistent Archives

Collection-based Persistent Archives Reagan W. Moore Associate Director, Data Intensive Computing San Diego Supercomputer Center moore@sdsc.edu http://www.npaci.edu/DICE

Topics • Experiences learned building a prototype Persistent Archive • Information model • Hierarchical levels of information • Interoperability mechanisms • Application to workshop topics • Ingestion methodology • Data set identification • Certification of archives

Persistent Archive Goals • Provide collection based archive • Data set relevance is organized by the collection • Provide information model to describe the context for the data collection • Enough information is needed to be able to dynamically create the collection from archived information • Decouple collection creation from digital object archiving • Provide accessioning system to turn data sets into digital objects • Accessioning is independent of the final collection

NARA Persistent Archive Prototype • Demonstrate ability to ingest, archive, recreate, query, and present a digital object from a 1 million record E-mail collection (RFC1036) • 2.5 GB of data • 6 required fields • 13 optional fields • User defined fields (over 1000) • Determine information model needed for persistent archive

Key Concepts Learned • Information model • Semi-structured representation of information - XML • Infrastructure independent representation of information context - XML DTD • Differentiation between information context for digital objects,collection and presentation • DTD for objects • DTD for collection • XSL style sheets for presentation • Instantiation software for creating the collection from the information model • XML databases now appearing

Hierarchy of Information Contexts • Digital object context • Meta-data to define the structure of the object • When publishing a digital object, must also publish the context of the object • Use collections to organize objects • Meta-data to define the structure of the collection • When publishing a collection, must also publish the information needed to organize the collection. • Use presentation context to control access • Meta-data to define structure of presentation

XML DTD for E-mail

Formatted Message Using XML DTD

Key Concepts Learned • Digital object encapsulation • Minimize the number of times a digital object must be touched • Once archived, a digital object should only be retrieved when requested by a user • Implies meta-data stored with digital objects should only describe the objects • Collection and presentation meta-data should be stored separately

Persistent Archive Requirements • Distributed environment to ensure separable components • Accession workbench • Archive • Presentation platform • Data handling mechanisms for interoperability as basis for system evolution • No tightly coupled systems • Unique names are only used by the data handling system • Use of containers to aggregate digital objects for storage • Implies a hierarchical naming scheme • Collection / container / digital object

FTP FTP Electronic Records Archive (ERA) TRANSFER ACCESSION ARCHIVES REFERENCE Accessioning Work Bench (snapin) Reference Workbench (snapin) Retrieve Records Media Handlers Catalog METADATA REPOSITORY RECORDS REPOSITORY Internet Intranet Text Image Photo Video Audio Geographical Information System Compound Records WEB Database Arrangement A R C Query & Reference Tools TAPE TAPE CD U N W R A P P E R CD W R A P P E R DISK DISK record Presentation Metadata wrapper Order Fulfillment

CEED / ESA NASA Catalog DPOSS Sky Survey REINAS 2MASS Sky Survey U Md Archive ADL NS Dig Lib Elib - Flora Wash. Brain Image UCLA Brain Image MSU Brain Image UCSD Neuroscience ESS Dig Lib MS Dig Lib Wash U Genome U H Mol Trajectory Protein Data Bank Federation of Data Collections into Digital Libraries UC Calif Finding Aids NARA Persistent Archive AMICO Image Library UMDL Social Science U Wisc. Video Lib. Pacific Rim DL

Conclusions • Ingestion • Infrastructure independent representation for digital objects • Infrastructure independent representation for information model • Turn data sets into digital objects by adding attribute tags • Aggregate digital objects in containers for storage

Conclusions • Data set identification • Unique names only required by data handling system • Attribute based access through collection • Hierarchical naming • Collection / Container / Digital object • Finding Aid for collection / Data handling system ID for container / Unique ID for object

Conclusion • Certification of persistent archive • Demonstrate that can provide infrastructure independent representation for • Finding aids for locating collections • Information model for building collection • Data handling system container Ids for storage access • Digital object attribute tags • Demonstrate that can use information models to create finding aids, collections, and access interfaces on new technology • Demonstrate that can independently migrate any component of architecture

Further Information http://www.npaci.edu/DICE

NARA Persistent Archive

Context Based Objects • For data to be useful, the context must be defined • Data format - binary/integer representation • Physical meaning - units • Structure - geometry • Relevance - feature annotation • Semantics - data dictionary for attributes • Context is preserved as meta-data attributes

Information Models for Organization of Data Digital Object Attributes Collection Attributes Presentation Attributes

Information Models for Access to Data Presentation of data from multiple digital libraries Collections from federated databases Digital object Attributes

Common Information Model • eXtensible Markup Language (XML) • Use tags to define semantic context for components of the data set • Document Type Definition (DTD) • Provides semi-structured representation for organizing tags that can be applied to groups of digital objects • Development of standards for tags • Digital sky, Protein Data Bank, Neuroscience brain images • California Digital Library - Art Museum Image Consortium

Information Management Hierarchy • Presentation / Information Discovery / Analysis • Visualization - Shastra, 3D visualization tools • Presentation information model - XSL style sheet • Collection organization • Meta-data catalog - MCAT • Collection information model - XML DTD • Data handling • Storage Resource Broker - SRB • Storage • Archival storage system - HPSS • Digital object model - XML DTD

Open Grid Architecture to Encourage Interoperability Application Data Model Management Remote Procedure Execution Information Discovery Data Handling Systems Dynamic Info Discovery Storage System Description Storage Resources

Technology Sources • Archive Community • IEEE Mass Storage Systems Technical Committee • Scalable storage systems • Digital Library Community • NSF Digital Library Initiative, Phase II • Information management mediation - XML • Supercomputer Community • Scalable analysis platforms • Grid Forum • Data handling systems for interoperability • Archivist Community / Library Community • Management policies and standards

Technology Sources Application Data Model Management Digital Library Remote Procedure Execution Information Discovery Data Handling Systems Dynamic Info Discovery Storage System Description Storage Resources Computational Grid

Information Management Architecture • Digital library community technologies • Distributed information resources • Digital library interoperability protocols - SDLIP • Mediation of information using XML - MIX • Grid Forum technologies • Support for distributed services / procedures • Inter-realm authentication • GSI Grid Security Infrastructure • Data handling system • Storage Resource Broker, Meta-data Catalog

Grid Forum Data Access Architecture API that provides “glue” to underlying data handling systems (security, scheduling, QoS, access protocol, data format/model, adaptivity, info discovery, location control) Application + authentication + authorization Data Model Management Remote Procedure Execution Armada D’agents, FEL, ADR GRAM, SRB, Java, CORBA Information Discovery Data Handling Systems LDAP, Database, Flat file, Object database Condor, GASS, NILE, SRB, I-2 caching, ADR Dynamic Info Discovery Storage System Description API that provides “glue” to underlying storage, QoS, etc. [GASS, IBP, SRB] Storage Resources DPSS, DFS, NFS HPSS, ADSM, DMF, Unitree, NASstore, DB2, Oracle, Informix, Sybase, O2, ObjectStore, Objectivity GloPerf, Netlogger, NWS DTD, ADR, object class

Data Handling System Capabilities • SDSC Storage Resource Broker • Protocol transparency • Common API for access to remote data resources • Explicit drivers for each type of storage system • Name transparency • Attribute based access to data • Location transparency • Distribution of collection across multiple physical resources • Time transparency • Minimization of latency for data access

File SID DBLobj SID Obj SID SRB Unix DB2 Oracle ADSM HPSS SDSC Storage Resource Broker & Meta-data Catalog Application Resource User MCAT Dublin Core Application Meta-data

Time Transparency • How to minimize latency • Prefetch data to local high performance disk, so that all accesses can be done at high speed from local resources • How to maximize access rate • Composite or aggregate data into a single data set to avoid multiple accesses • Stream data at high rates using parallel I/O, amortizing the access latency by the volume of data that is delivered. • How to avoid congestion • Replicate data across multiple servers

SRB Containers - Managing Archive Latency SRB client • Create container in a logical storage resource containing at least one “cacheable” resource • Create objects in containers • “Cache” daemon will move filled containers to archive • synch and purge API’s SRB Server UNIX HPSS HPSS container cached containers Distributed Storage Resources

Generality of Information Infrastructure • Same information model needed to manage • Federation in space • Metacomputing environment • Interoperable services for digital libraries • Migration over time • Collection creation and update • Persistent archive • Same storage systems needed to support • Supercomputer center data • Discipline specific data collections • Digital library collections

Art Museum Image Consortium • Demonstrated • Support for heterogeneous digital objects • Automated conversion of meta-data to XML DTD • Validation of meta-data • XSL style sheet for presenting information

AMICO Meta-data Conversion to XML

AMICO Presentation Interface

National Partnership for Advanced Computational Infrastructure • Facilitate the conduct of science through development of knowledge resources • Publish - Data collection infrastructure • Info discovery - Digital Library infrastructure • Data access - Data handling infrastructure • Apply to federal, state, and university projects • NSF / DOE / NASA / USPTO / NARA / Census Bureau • California Digital Library • UCSD - Pacific Rim Digital Library Alliance

Data Storage Archival Storage Applications Collection Building Information Management Digital Library Digital Sky Neuroscience Protein Data Bank Molecular Structures Earth Systems Science CDL UCB - Elib UCSB - ADL Stanford - SDLIP U Michigan - UMDL Publishing Scientific Data Applications Libraries

NPACI is a National Partnership of Partnerships 46 institutions 20 states 4 countries 5 national labs Many projects (new and old) Vendors and industry Government agencies

National Partnership for Advanced Computational Infrastructure • Provide Teraflops / Petabyte capable systems for use by national academic community • Current systems at the San Diego Supercomputer Center • 250 Gflops peak computation rate • IBM SP, CRAY T3E • 250 Terabyte archive capacity, 100 TB in archive • High Performance Storage System • By end of year • 1 TFlop peak computation rate • IBM SP • 500 Terabyte archive capacity

Challenges • Facilitate access to high-end resources • Support data intensive computing • Facilitate access to distributed data resources • Support information discovery • Minimize complexity of user interfaces • Provide unifying data access system • Requires information management infrastructure

Bio-Informatics

Art Museum Image Consortium - AMICO

National Virtual Observatory

California Digital Library

Collection-based Persistent Archives