250 likes | 276 Views
Explore the key components of digital preservation theory including authenticity, trustworthiness, and enforcement of policies. Learn about infrastructure independence and the role of rule-oriented data systems.
E N D
Towards a Theory of Digital Preservation 13 December 2007 Digital Curation Conference Washington D.C. Reagan W. Moore Director of Data Intensive Computing Environments San Diego Supercomputer Center University of California, San Diego moore@sdsc.edu, http:irods.sdsc.edu
Components of a Theory • Preservation assertions • Authenticity • Integrity • Trustworthiness • Preservation management • Enforcement of policies that preserve assertions • Preservation processes • Execution of preservation services • Preservation validation • Verification that the assertions have been met
1st Motivating Concept • Preservation is communication with the future • Information generated in the past is sent into the future • Representation information for each record provides the context needed to understand each record • Challenge - future technology will be more sophisticated and more cost effective • The future preservation environment will incorporate new types of storage systems, new protocols for accessing data, new data encoding formats, and new standards for characterizing provenance • Infrastructure independence ensures the preservation environment can incorporate the new technology
Preservation Implication • Preservation is an active process • Extract record from the environment in which it was created • Ingest record into the preservation environment • Migrate preservation environment into the future • For each new storage technology, create the appropriate drivers to enable standard operations on the new system • For each new access technology, port onto the standard actions provided by the preservation environment
Infrastructure Independence • External • World: • Hardware • Systems, • Software • Systems, • Access • Protocols Preservation Records Environment
Ask for record Data Grids Implement Infrastructure Independence Data Grid Data is returned • Data grid provides • Persistent name space • Standard operations • State information
DB SRB Server Metadata Catalog SRB Server Preservation Environment - SRB • Insert data grid server in front of each • storage resource • Manage state information in a metadata • catalog • Preservation environment consists of the • data grid servers and metadata catalog
Name Space Virtualization Data Access Methods (C library, Unix, Web Browser) Data Collection • Storage Repository • Storage location • User name • File name • File context (creation date,…) • Access controls • Data Grid • Logical resource name space • Logical user name space • Logical file name space • Logical context (metadata) • Access constraints Data is organized as a shared collection
Operation Virtualization Map from the actions requested by the access method to a standard set of micro-services. Interact with remote storage system through standard operations Access Interface Standard Micro-services Data Grid Standard Operations Storage Protocol Storage System
Standard Operations • File manipulation • Posix I/O calls - open, close, read, write, seek, … • Register, replicate, checksum, synchronize • Bulk operations • Bulk data transport, metadata load • Parallel I/O streams • Remote procedures • Data filtering, subsetting, metadata extraction • Remote library execution (HDFv5, DataCutter)
2nd Motivating Concept • Preservation is the validation of communication from the past • Claims about the current state of authenticity and integrity require a complete description of prior preservation policies and processes • Challenge - can we characterize preservation policies and preservation processes? • We need representation information about the preservation environment
Representation Information • Records - the information context needed to: • Understand the provenance and meaning of the data • Interpret and manipulate the record • Preservation environment - the information context needed to validate assertions about: • Authenticity • Integrity • Chain of custody • Trustworthiness
Preservation Environment Representation Information Explicitly define management polices and processes and migrate then onto new choices of technology
Rule-Oriented Data System • Management policies • Expressed as sets of rules that control execution of remote operations • Management processes • Expressed as sets of micro-services • Assertions • Expressed as queries on persistent state information generated by micro-services • iRODS - integrated Rule Oriented Data System
DB iRODS Server Rule engine Metadata Catalog Rule base iRODS Server Rule engine Rule-Oriented Data Management • Data grid enforces management policies • through a distributed rule engine installed at • each storage location • Actions requested by any access • mechanism are executed under the control • of the distributed rule engine
iRODS Rules • Each rule defines • Event • Condition • Action set (micro-services and rules) • Recovery set • Rule types • Atomic, applied immediately • Deferred, support deferred consistent constraints • Periodic, typically used to validate assertions
Preservation Policy Examples • Integrity • Data distribution and replication • Periodic checksum validation • Synchronization of replicas • Data retention and disposition • Time dependent access controls • Authenticity • Provenance metadata / representation information creation • Chain of custody - audit trail analysis • Archival Information Package generation • Trustworthiness • RLG/NARA - Trustworthy Repositories Audit & Certification: Criteria and Checklist.
TRAC Assessment Criteria • Trustworthy Repositories Audit & Certification: Criteria and Checklist http://wiki.digitalrepositoryauditandcertification.org/pub/Main/ReferenceInputDocuments/trac.pdf • Organizational infrastructure • Governance and organizational viability • Organizational structure and staffing • Procedural accountability and policy framework • Financial sustainability • Contracts, licenses, and liabilities
TRAC Assessment Criteria • Digital Object Management • Ingest - acquisition • Ingest - creation of Archival Information Package • Preservation planning • Archival storage & preservation maintenance of AIPs • Information management • Access management • Technologies, technical infrastructure & security • System infrastructure • Appropriate technologies • Security
Mapping TRAC Criteria to Rules • Defined micro-services • Implement the functions needed to enforce TRAC criteria • Specified 105 separate micro-services • Identified persistent state information • Required to validate TRAC assertions • Specified 141 metadata attributes associated with multiple name spaces • Record metadata (provenance, events) • Template metadata (structured information) • User metadata (roles) • Resource metadata (storage properties, errors) • Rule metadata (version, type) • Micro-service metadata (version, audit trails)
Theory of Digital Preservation • Characterization • Persistent name spaces Operations that are performed upon the persistent name spaces Changes to the persistent state information that occur for each operation • Transformations that are made to the records on each operation • Completeness Set of micro-services is complete, enabling the decomposition of every preservation process onto the micro-service set. Preservation management policies are complete, enabling the control of all preservation processes. Persistent state information is complete, enabling the validation of authenticity and integrity.
Theory of Digital Preservation • Closure • Micro-services generate the required persistent state information • Persistent state information is preserved by required micro-services • Assertion • If the operations are reversible, then a future preservation environment can recreate a record in its original form, maintain authenticity and integrity, support access, and display the record. Such a system would allow records to be migrated between independent implementations of preservation environments, while maintaining authenticity and integrity.
Preservation Approaches • Is preservation driven by the management policies that enforce authenticity and integrity? • From the management policies derive the required preservation metadata • Is preservation driven by an assessment of required provenance metadata? • From the metadata derive the required management policies
For More Information Reagan W. Moore San Diego Supercomputer Center moore@sdsc.edu SRB: http://www.sdsc.edu/srb/ iRODS: http://irods.sdsc.edu/ • Rajasekar, A., M. Wan, R. Moore, W. Schroeder, “A Prototype Rule-based Distributed Data Management System”, HPDC workshop on “Next Generation Distributed Data Management”, May 2006, Paris, France. • Moore, R., M. Smith, “Automated Validation of Trusted Digital Repository Assessment Criteria”, Journal of Digital Information, Vol 8, No 2 (2007).