310 likes | 493 Views
PAWN: A Novel Ingestion Workflow Technology for Scientific Data. Mike Smorul, Joseph JaJa, Yang Wang, Mike McGann, and Fritz McCall. Overall Principles. Distributed, secure ingestion Use of web/grid technologies – platform independent Minimal client-side requirements
E N D
PAWN: A Novel Ingestion Workflow Technology for Scientific Data Mike Smorul, Joseph JaJa, Yang Wang, Mike McGann, and Fritz McCall
Overall Principles • Distributed, secure ingestion • Use of web/grid technologies – platform independent • Minimal client-side requirements • Ease of integration with data grid systems. • Designed to satisfy data integrity requirements of scientific collections and digital preservation
Producer • Provides data to a data grid based on a prior agreement. • Consists of a management/metadata server and an ingestion client. • Provides initial arrangement, context, and metadata.
Data Grid – receiving • Receives data from a Producer • Validates bitstreams and metadata, and sends acknowledgement to Producer. • Arranges into collections and specifies optional publishing and preservation policy. • Publishes bitstreams into data grid.
Data Grid – Long term Stewardship • Implemented using grid technologies. • Use the existing prototype NARA/UMD/SDSC site. • Automated replication and integrity checking. • Enforces access control and preservation policy
Ingestion Workflow • Negotiate Submission Agreement. • Workflow Initialization and Submission Information Packet (SIP) creation. • Transfer of SIPs to Data Grid site. • Validation of SIP transfer • Organization of data into collections and transfer into Data Grid.
Submission Agreement • Create machine actionable set of rules describing items. • Final Submission Agreement is composed of: • METS document for application defaults • METS Constraint document to limit METS form to submission parameters
METS Overview • Provides a framework for linking structural organization of objects with metadata. • Using XML namespace, metadata from various XML schema can be attached to objects • Ie, dublin core, FGDC, etc • Extensible for more complex metadata • http://www.loc.gov/standards/mets/
Why METS Constraints? • METS doesn’t provide a way to create machine interpretable rules describing a collection • Ie: allow only TIFF files in certain structural areas • METS profiles allow for developer interpretable rules, not machine interpretable
METS Constraints • Allows structural, metadata, and file constraints. • Structural Constraints: • Restrict child div’s and restrict pointers to div, file, and other mets documents • File Constraints: • Restrict files by mime-type or validation tests • Metadata Constraints: • Restrict allowed metadata schema.
METS Constraints - Template <?xml version="1.0" encoding="UTF-8"?> <mets …. > <!-- validation test section, referenced in the constraints document --> <amdSec> <techMD ID="xmltest"> <mdWrap MDTYPE="OTHER"> <xmlData> <val:validation NAME="xmltext" DESCRIPTION="Test for valid xml documents" MIMETYPE="text/xml"> <val:valgrp required="true"> <val:valtest name="gif" required="true"> <val:description>generic gif test for any file</val:description> </val:valtest> </val:valgrp> </val:validation> </xmlData> </mdWrap> </techMD> </amdSec> <!-- base div structure to use for all clients --> <structMap> <div ID="ID1" LABEL="Research & Development Records"> <div ID="ID1.1" LABEL="Research & Development Project Records"> <div ID="ID1.1.1" LABEL="R&D Project Case Files"/> <div ID="ID1.1.2" LABEL="R&D Record Series"/> </div> </div> </structMap> </mets>
METS Constraints - Rules <?xml version="1.0" encoding="UTF-8"?> <metsconstraint …> <filegrp ID="FILE1" NAME="Text Document"> <!-- Files can be identified either by MIMETYPE, or TESTID in skeleton METS document or both --> <file NAME="html document" MIMETYPE="text/html"/> <file TESTID="xmltext" NAME="xml document" MIMETYPE="text/xml"/> </filegrp> <!-- Apply rules to predefined div's and link to required file/metadata tests above --> <divrule DIVID="ID1" RESTRICTDIV="true" RESTRICTFTPR="true" RESTRICTMPTR="true"/> <divrule DIVID="ID1.1" RESTRICTDIV="true" RESTRICTFTPR="true" RESTRICTMPTR="true"/> <divrule DIVID="ID1.1.1" RESTRICTMPTR="true"> <filetype FILEGROUPID="FILE1"/> </divrule> <divrule DIVID="ID1.1.2" RESTRICTMPTR="true"/> </metsconstraint>
Ingestion Workflow • Negotiate Submission Agreement. • Workflow Initialization and Submission Information Packet creation. • Transfer of SIPs to Data Grid site. • Validation of SIP transfer • Organization of data into collections and transfer into Data Grid.
Initialize Ingestion workflow • Instantiate Producer management server to track registered objects • Establish a working trust relationship with the Data Grid • Issue clients.
Create SIP • Each client registers objects stored locally with producer management server • Register file types, validation tests, etc • Client follows rules in Submission Agreement • Producer-wide agents can arrange registered object to give a broader context
Submission packet is designed to contain a self describing set of metadata that is self-validating SIP Example
Ingestion Workflow • Negotiate Submission Agreement. • Workflow Initialization and Submission Information Packet creation. • Transfer of SIPs to Data Grid site. • Validation of SIP transfer • Organization of data into collections and transfer into Data Grid.
Transfer SIP to Data Grid • Retrieve previously registered SIP from producer management server • Authenticate to data grid • Update tracking information with new location of files in data grid • Data Grid acknowledges transfer completion to producer management server
Ingestion Workflow • Negotiate Submission Agreement. • Workflow Initialization and Submission Information Packet creation. • Transfer of SIPs to Data Grid site. • Validation of SIP transfer • Organization of data into collections and transfer into Data Grid.
Validation of SIP transfer • Check incoming SIP against constraints documents. • Ensure object integrity by verifying checksums/cryptographic digest • Validate bitstreams against necessary tests • Record validation results
Ingestion Workflow • Negotiate Submission Agreement. • Workflow Initialization and Submission Information Packet creation. • Transfer of SIPs to Data Grid site. • Validation of SIP transfer • Organization of data into collections and transfer into Data Grid.
Final transfer to Data Grid • Transfer objects to Data Grid • Update tracking information with new location in Data Grid • Transfer log of data activity into data grid • Return accept/reject messages to producer metadata server
Producer Components • Database to track registered objects • Certificate Authority management • Web service for receiving side security callback • Management server supplies web service interfaces to ingestion clients and management operations. • Clients are designed to be standalone, with security certificates issued by producer
Receiving Components • Receiving servers validate connecting clients and validate SIPs • Validation Services are simple webservice calls. • Abstract I/O layer into data grid.
Recap • Implemented using web technologies • Architecture independent • XML based metadata • METS based SIPs • Add-on constraints describing Submission Agreement • Target release dates: • Beta: April • Release: June/July
More Information • ADAPT website • http://www.umiacs.umd.edu/research/adapt • Papers • Scalable, Reliable Marshalling and Organization of Distributed Large Scale Data Onto Enterprise Storage Environments • PAWN: Producer - Archive Workflow Network in Support of Digital Preservation