210 likes | 299 Views
SOAPI: a flexible toolkit for implementing ingest and preservation workflows. Mark Hedges Centre for e-Research, King’s College London Arts and Humanities Data Service. Background. Arts & Humanities Data Service
E N D
SOAPI: a flexible toolkit for implementing ingest and preservation workflows Mark Hedges Centre for e-Research, King’s College London Arts and Humanities Data Service
Background • Arts & Humanities Data Service • Activities included management and preservation of research outputs from UK researchers in arts and humanities • Centre for e-Research, King’s College London (CeRch) • Activities will include management and preservation of research outputs from KCL researchers in all disciplines • Among other things …
Context • Ingestion and preservation of complex material into digital repository (Fedora-based) • Unpredictable structures • Many formats • Formalised but manual procedures • Not scaleable • Functional limitations (e.g. preservation metadata, provenance)
Requirements • Handles complex/compound objects • Distributed architecture • Scalable • Automated processing and user input • Able to integrate specialised third-party tools (e.g. format conversion) • Preservation metadata management • Audit trail/provenance metadata
Approach • Workflow management tool to create and execute workflows (jBPM) • Generic interfaces defining common preservation and ingest actions • Implementations of these interfaces encapsulating units of functionality • Generic interfaces to wrap third-party tools. • Web service (SOAP & REST) and local implementations
jBPM • Chain together automated actions and user tasks to form a workflow or “Business Process” • Open source, flexible, extensible workflow management system • Bridges gap between users/developers by giving them a common language • Packaged as a J2EE application - can run on any J2EE application server such as JBoss.
jBPM (XML view) A jPDL (XML) fragment defining (part of) a workflow
Interfaces Interfaces: • local (java), SOAP and REST options • coarse-grained e.g.: • Create file characterisation • Identify file format • Migrate file format • Normalise file format • Check file integrity • …
Service implementations • Configure use of particular implementations, e.g. • Format validation: JHOVE and others • Format identification: JHOVE, DROID, XENA • Format conversion: various • Metadata capture: PREMIS
Re-use example – SHERPA DP 2 Project Objectives: • Investigate methods for the provision of distributed preservation services and alternative methods of content-service provider interaction. • Provide archiving for varied software repositories and web resources • Perform curatorial activities for diverse types of content, ranging from simple objects to highly structured research data. Website: http://www.sherpadp.org.uk Contact: stephen.grace@kcl.ac.uk; gareth.knight@kcl.ac.uk
Re-use example – SHERPA DP 2 Content providers supported: • Repositories: Fedora, CDS Invenio, DSpace, EPrints, DigiTool • Website: Large dynamic sites, static sites. Automated ingest methods: • OAI-PMH: METS, MPEG21-DIDL, MarcXML, Dublin Core and other metadata formats supported. • SWORD: An ATOM application profile Content types supported: • Wide variety of supported content type - image collections, static and dynamic web sites, datasets and other types of research data. Website: http://www.sherpadp.org.uk Contact: stephen.grace@kcl.ac.uk; gareth.knight@kcl.ac.uk
Issues • Lack of suitable tools in some areas – expensive, outputs unreliable • Preserving content – what do we actually want to preserve? • Significant properties – soft concept, hard to quantify (InSPECT) • Problems with jBPM
Further work • Make code more robust and fill in gaps • Integrate task screens with other identity management systems (e.g. Shibboleth federation) • Incorporate content model-specific processing • Incorporate disseminators • Integrate service registry for selecting services to invoke • Resource discovery metadata generation
Questions Contact: mark.hedges@kcl.ac.uk