330 likes | 437 Views
Importance of Infrastructure Independence. http://salt.unc.edu richard_marciano@unc.edu Richard Marciano Professor @ SILS Chief Scientist for Persistent Archives and Digital Preservation @ RENCI Director of Sustainable Archives & Leveraging Technologies ( SALT ) group.
E N D
Importance of Infrastructure Independence http://salt.unc.edu richard_marciano@unc.edu Richard Marciano Professor @ SILS Chief Scientist for Persistent Archives and Digital Preservation @ RENCI Director of Sustainable Archives & LeveragingTechnologies (SALT) group
Sustainable Archives & Leveraging TechnologiesSALT: a metaphor for “Data Curation”? “We define this discipline of ‘data curation’ as the practice of c ollection, annotation, conditioning and preservation of data for both current and future use.” Helen Tibbo & Bryan Heidorn annotation conditioning collection preservation current & future use
Curation Topics Motivation for curation Access today Preservation - access tomorrow Preservation community concepts Representation information for records Representation information for policies Representation information for processes Rule-oriented data systems Automate curation processes Enforce curation policies Verify assertions about curation results
Curation Processes Extract record from the creation environment and import into the digital library or preservation environment Context Properties about the record creation Description of the record content Description of the record type Description of the record structure Assert that the record can be viewed and manipulated in the future
Curation (Preservation) Is an Active Process Preservation is communication with the future Continually migrate the record from the current data management environment into the next management environment At the point in time when the migration occurs, both the old and new technologies are present Use data grids to support interoperability across technologies Manage the name spaces for identifying records, archivists, storage systems Decouple access mechanisms from storage systems
Maintain Control of the Curation Environment Insert data management infrastructure between the records and the current technology Distributed server architecture Protect the records from changes in the environment Ensure that the curation properties are maintained Ensure that the curation policies are enforced Verify assessment criteria
Use Cases (1) • DCAPE: Distributed Custodial Archival Preservation Environments • Build a distributed production preservation environment that meets the needs of archival repositories for trusted archival preservation services • Develop preservation policies for state archives, university archives and cultural institutions • Use iRODS to implement and deliver the resulting services
DCAPE: Distributed Custodial Archival Preservation Purpose: Build a distributed production preservation environment that meets the needs of archival repositories for trusted archival preservation services Distributed partnership of 11 institutions: 33 people * STATES: - California - Kansas - Michigan - Kentucky - North Carolina - New York * UNIVERSITIES: - Tufts University - West Virginia University - UNC (SILS/RENCI) * CULTURAL ENTITIES: - Getty Research Institute * INTERNATIONAL PARTNERS: - Carleton University (Geomatics and Cartographic Research Centre) Richard Marciano, Professor SILS Reagan Moore, Professor SILS Chien-yi Hou, Research Associate SILS John Gallagher, Dir. of Research Mgt. and Admin RENCI Kelly Eubank,Ele,ctronic Records Archivist Druscie Simpson IT Administrator David Minor, Programmer Ed Southern,State Archivist Jennifer Ricker, Digital Collections Manager Amy Rudersdorf,Director of Digital Information Mgt.
Overview of iRODS Architecture Delivery of Preservation Services Archivist A Automatic replication service requested Archivist B Validation service for a collection iRODS Data System iRODS Metadata Catalog NC State Library Getty Research Inst. NC State Archives Services can be invoked for automatic replication, generation of audit trails, e-mail notification of activity, ingestion of multiple files, format obsolescence, etc.
What are Data Grids? Data Grids are “middleware services” • Software that sits between applications and data sources So, What?
What are Data Grids Good For? Data Grids allow you to access data: • In any format • Files, databases, streams, web, programs,… • Documents, images, data, sensor packets, tables,… • Stored in any type of storage system • File Systems, tape silos, object ring buffers, sensor streams,… • Stored anywhere over a wide area network • Across organizational, administrative and security boundaries • Without having to know the system addresses, paths, protocols, commands, etc. needed to retrieve it!
What are Data Grids Good For? Scalability Millions of files Petabytes of Data Evolvability Infrastructure Independence Across Generations of Software Extensibility Deal with Technologies not yet Dreamed of
What are Data Grids Good For? • Collections Managed by the DICE Center: • 1+PetaBytes, 170+ Million files • Multi-disciplinary Scientific Data • Astronomy, Cosmology • Neuro Science, Cell-Signalling & other Bio-medical Informatics • Environmental & Ecological Data • Educational (web) & Research Data (Chem, Phys,…) • Earthquake Data, Seismic Simulations • Real-time Sensor Data • Growing at 1TB a day • Supporting large projects: TPAP, TeraGrid, NVO, SCEC, SEEK/Kepler, GEON, ROADNet, JCSG, AfCS, SIO Explorer, SALK, PAT …
Data Grids • Storage Resource Broker (SRB) • Initially funded by DARPA in 1996 • Current version is 3.5.0, released Dec 3, 2007 • Production system used internationally • Integrated Rule-Oriented Data System (iRODS) • Funded by NSF SDCI and NARA • Current version is 1.1, released June 2008
Design Implications • Heterogeneous storage systems • Data stored in file systems, archives, databases • Global name spaces • Files • Users • Resources • Persistent access controls • Constraints between name spaces • Consistent state information • Properties of files, collections, resources, users
Data Grids • Data virtualization • Provide the persistent, global identifiers needed to manage distributed data • Provide standard operations for interacting with heterogeneous storage system • Provide standard actions for interacting with clients • Trust virtualization • Manage authentication and authorization • Enable access controls on data, metadata, storage • Federation • Controlled sharing of name spaces, files, and metadata between independent data grids • Data grid chaining / Central archives / Master-slave data grids / Peer-to-Peer data grids
What are Data Grids Good For? NARA II Rocket Center Georgia Tech MCAT MCAT MCAT • TPAP - NARA Transcontinental Persistent Archive Prototype • Federation of Seven Independent Data Grids NARA I U Md U NC UCSD MCAT MCAT MCAT MCAT Extensible Environment, can federate with additional research and education sites. Each data grid uses different vendor products.
Federation Across Spatial Scales International collaborations Australian Research Collaboration Service (ARCS) Sustaining Heritage Access through Multivalent ArchiviNg (SHAMAN) Cinegrid National collaborations Temporal Dynamics of Learning Center (TDLC) Ocean Observatories Initiative (OOI) NARA Transcontinental Persistent Archive Prototype (TPAP) Regional collaborations LSU data grid HASTAC humanities data grid Distributed Custodial Archive Preservation Environment (DCAPE) State collaborations RENCI data grid North Carolina State Library Institutional repositories Carolina Digital Repository SIO Repository
Ten Years of Data Grid 1.0 - What’s Missing • Automatic Policy Execution • Increasing Scale • Managing System Administration • Visualization • Virtualization • Customization
Data Grids 2.0 – Policies in Action! Specify Policies “Make X Copies of Accessioned Records” Break Policies Down into Rules “Put one copy at Rocket Center” “Put one copy at UCSD” “Verify Copies are Identical” Break Rules Down into Micro-Services “Put one copy at Rocket Center.” Read File Copy File Create Checksum Copy Checksum Etc. Micro-Services Can Be Combined into Complex Workflows Execute them: Periodically, On-demand, Delayed Start, Anywhere on the network
Rule-based Data Management • Associate Rules with Combinations of: • Data Objects • Collections • User Groups • Storage Systems • For Example: • Particular User Groups when Accessing a Particular Collection
Evolution of Data Grid Technology • Shared collections • Enable researchers at multiple institutions to collaborate on research by sharing data • Focus was on performance, scalability • Digital libraries • Support provenance information and discovery • Integrated with digital library front end services • Preservation environments • Support preservation policies • Build rule-based data management system
Infrastructure Independence • Use data grids to preserve records independently of the choice of technology • Management of archives properties • Map technology components to preservation principles • Capabilities that support preservation requirements • Construct preservation environment from components • Archival engineering perspective • Use infrastructure independence to enable use of new technology • View that new technology is an opportunity instead of a challenge
Overview of iRODS Architecture Overview of iRODS Data System User Can Search, Access, Add and Manage Data & Metadata iRODS Data System iRODS Metadata Catalog Keeps track of data iRODS Data Server Disk, Tape, etc. *Access data with Web-based Browser or iRODS GUI or Command Line clients.
Building a Shared Collection UNC @ Chapel Hill NCSU Duke DB Have collaborators at multiple sites, each with different administration policies, different types of storage systems, different naming conventions. Assemble a self-consistent, persistent distributed shared collection
DB Metadata Catalog Rule Base iRODS Server #2 Rule Engine Using a Data Grid - Details iRODS Server #1 Rule Engine • User asks for data • Data request goes to iRODS Server #1 • Server looks up information in catalog • Catalog tells which iRODS server has data • 1st server asks 2nd for data • The 2nd iRODS server applies rules
iRODS - Integrated Rule Oriented Data System • Shared collection assembled from data distributed across remote storage locations • Server-side workflow environment in which procedures are executed at remote storage locations • Policy enforcement engine, with computer actionable rules applied at the remote storage locations • Validation environment for assessment criteria • Consensus building system for establishing a collaboration (policies, data formats, semantics, shared collection)
Overview of iRODS Architecture iRODS Shows Unified “Virtual Collection” User With Client, Views & Manages Data User Sees Single “Virtual Collection” Partner’s Data Disk, Tape, Database, Filesystem, etc. My Data Disk, Tape, Database, Filesystem, etc. My Data Disk, Tape, Database, Filesystem, etc. The iRODS Data Grid installs in a “layer” over existing or new data, letting you view, manage, and share part of all of diverse data in a unified Collection.
Engineering Approach • Preservation Principles - enumerate assertions • Authenticity, integrity, chain of custody, respect des fonds • Preservation Standards - select relevant set • Architecture, metadata, submission, format, assessment • Preservation Engineering - define capabilities • Infrastructure independence, scalability, federation • Preservation Technology - integrate components • Data grids, digital libraries, workflows • Preservation Management - automate policies • Interoperability, policies, capabilities, verification
Preservation Standards • Architectural Model • OAIS, Reference Model for an Open Archival Information System • Representation information for each record • Submission / Archival / Dissemination Information Package (SIP / AIP / DIP) • Data grid - Storage Resource Broker (SRB), integrated Rule Oriented Data System (iRODS) • Digital Library - DSpace services, Fedora digital library middleware • Metadata • Dublin core • LCDRG, NARA Life Cycle Data Requirements Guide • PREMIS, Preservation Metadata Implementation Strategies • Metadata organization • MPEG-21, ISO/IEC TR 21000-1: MPEG-21 Multimedia Framework • METS, Metadata Encoding and Transmission Standard • OAIS, Reference Model for an Open Archival Information System • Submission / Harvesting • Producer Archive Interface (NASA) • OAI-PMH, Open Archives Initiative - Protocol for Metadata Harvesting • Data format • pdf, xml, (4000 formats retrievable on web crawls) • Assessment criteria • RLG/NARA TRAC - Trustworthy Repositories Audit & Certification: Criteria and Checklist. http://wiki.digitalrepositoryauditandcertification.org/pub/Main/ReferenceInputDocuments/trac.pdf
Policy-Virtualization: Automate Operations System-centric Policies & Obligations: Manage retention, disposition, distribution, replication, integrity, authenticity, chain of custody, access controls, representation information, descriptive information requirement, logical arrangement, audit trails, authorization, authentication Domain-specific Policies: Identification & Extraction of Metadata Ingestion Control for Provenance Attribution Processing of Data on Ingestion Creation of multi-resolution images, type-identification, anonymization,… Processing of Data on Access IRB Approval for data access, Data sub-setting, Merging of multiple images, conversion, redaction, …
Evolution of Data Grid Technology • Shared collections • Enable researchers at multiple institutions to collaborate on research by sharing data • Focus was on performance, scalability • Digital libraries • Support provenance information and discovery • Integrated with digital library front end services • Preservation environments • Support preservation policies • Build rule-based data management system • Differ in choice of management policies