70 likes | 205 Views
SCAPE Project. EU project aimed at building a scalable platform for planning and execution of computation intensive processes for ingestion or migration of large data sets in order to help automate digital preservation
E N D
SCAPE Project • EU project aimed at building a scalable platform for planning and execution of computation intensive processes for ingestion or migration of large data sets in order to help automate digital preservation • Digital preservation: standards + policies + technologies to ensure access to digital objects over time • “Preservation workflows”, “Digital objects 4 ever” • 42 months, in the period 2011-2014 • 16 project partners, 22 WPs, 55 deliverables, 88 milestones, zillion mailing lists
The Problem • Scale of data sets involved in digital preservation: • large number of objects involved in data sets • the objects can be large in size • or complex in structure • the data collections can contain heterogeneous objects (objects of different type) • Data formats change over time, become obsolete • Migrating digital objects – must ensure success • Reproducibility of preservation processes and collection of provenance data over the entire digital object’s lifecycle
The Solution – From Project Proposal • The preservation processes - realised as data pipelines and described formally as Taverna workflows • Workflows will invoke various services for planning and execution of institutional preservation and quality assurance strategies • Workflows will be deployed on a large scale (using clouds) and executed over large, distributed and heterogeneous collections of complex digital objects • The execution of workflows will be controlled by a policy-based system, which will ensure the workflows are in line with state-of-the art in digital object representation, file formats, rendering tools, etc. and detect and report any errors in a preservation process
The Solution – In Practice • Preservation services are written in various languages • Use Taverna’s External Tools or Beanshells to invoke them from inside Taverna workflows • Preservation services need to be running locally to be able to deploy them to a cluster and avoid bottleneck problem related to invoking a Web service • Convert Taverna’s workflows to workflows executable and parallelizable on Hadoop MapReduce • Compile Taverna workflows to intermediate language Jaql that can be optimized and executed on MapReduce
Benefits to Us • Strengthened External Tools plugin and improved support for running external services • Taverna workflow (potentially containing only local services) -> parallelizable Jaql workflow executable on a MapReduce cloud • App4Andy-style applications that process large data, use local scripts and need parallelization/optimization • Some extensions to myExperiment (“run wf on a cloud”) /BioCatalogue – not sure how reusable
Other Projects Affecting SCAPE • External Tools plugin for Taverna • Provenance in Taverna • Browsing, exporting • We design a Taverna wf, but actually run a Jaql wf – so provenance is not being captured by Taverna? • Next Generation Workbench – could with a more advanced UI • SCUFL2 – for conversion to Jaql workflows • Easier for manipulation than current t2flow?
Summary • Contributions • Taverna Workbench for workflow design • myExperiment VRE for sharing workflows • BioCatalogue catalogue for curating preservation services • Ontology development • Expectations • Scalability in workflow execution • Experiences with new domain – digital libraries