140 likes | 364 Views
The SCAPE Project. Overview, Objectives, and Approaches. Dr. Rainer Schmidt AIT Austrian Institute of Technology GmbH. APA 2011 Conference London, 8-9 November , 2011. SCAPE – what is it about?. Planning and managing resource-intensive (digital) preservation processes
E N D
The SCAPE Project Overview, Objectives, and Approaches Dr. Rainer Schmidt AIT Austrian Institute of Technology GmbH APA 2011 Conference London, 8-9 November, 2011
SCAPE – what is it about? • Planning and managing resource-intensive (digital) preservation processes • such as large-scale ingestion, analysis, or modification of digital data sets • Focus on scalability, robustness, and automation. SCAPE is a follow-up to the highly successful FP6 IP Planets.
SCAPE Project Data • Project instrument: FP7 Integrated Project • 6. Call • Objective ICT-2009.4.1:Digital Libraries and Digital Preservation • Target outcome (a) Scalable systems and services for preserving digital content • Duration: 42 months • February 2011 – July 2014 • Budget: 11.3 Million Euro • Funded: 8.6 Million Euro
SCAPE Project Overview SCAPE will enhance the state of the art in digital preservation in three ways: • A scalable infrastructure and tools for preservation actions • Automated, quality-assured preservation workflows • Integration of these components with policy-based automatedpreservation planning and watch SCAPE results will be validated in three large-scale testbeds: • Digital Repositories • Web Content • Research Data Sets The SCAPE Consortium brings together a broad spectrum of expertise from • Memory institutions • Data centres • Research labs • Universities • Industrial firms
Selected Scape Data Collections • Data collections provided by 6 institutions • Complete Web archives and snapshots of public domains (.dk, .it, .eu, gov.uk, …) • Millions of digitised newspapers, posters, law gazettes, and 16-19th century broadsheets • Collections of multi-file objects such as books, papyri, and incunabula (up to 230MB/object) • 100.000 images of East Asian manuscripts in different quality levels • TBs of voluntary deposit in a wide variety of formats • 500TB of broadcast radio and TV output (up to 73GB/object) • Many hundreds of thousands of data sets from synchrotron, neutron, and muon instruments. • 30.000 items from a selection of open access journal articles
from digitalbevaring.dk Selected SCAPE Testbed Scenarios • Characterise large video files • The master MPEG2 files are so large that it is difficult to apply JHOVE and insufficient detail is provided. A detailed characterisation of the MPEG2 streams is needed in order to identify technical dependencies for extracting from or rendering the MPEG2 stream. This would enable preservation risks related to current access services to be monitored and action taken as necessary to ensure continued access and preservation. • Carry out large scale migrations • Migrating from one format to another introduces the possibility of damaging the content or failing to capture significant properties of the original in the resulting destination format. • Specific requirements include: • Solution tools that operate reliably at scale (80TB, 2 million pages) • Automated QA, ideally with no manual intervention on a file by file basis • QA performed by process independently from the migration • Demonstrating strong evidence of significant properties being captured in the destination format • Quality assurance in web harvesting • For large scale crawls, automation of the quality control processes is a necessary requirement. Currently, this process relies on random sampling and very basic quantitative checks.
Selected SCAPE Challenges • Bridging the gap between experimental workflows and production scenarios • e.g. coping with amount and size of payload data • Employing data intensive technologies • for processing binary content • generation and evaluation of workflow results • Exploiting data locality • Avoiding data transfer by placing processors next to the data • Repository Integration • Horizontal scalability • Scalable ingest/access • Preservation Planning • Automation of monitoring and decision processes • Automated Quality Assurance • Advanced Image Processing • Scientific data • How to preserve contextual information?
SCAPE Solutions • SCAPE Platform • Environment for carrying out preservation workflows at scale • Software package and shared deployment (the Central Instance) • Dynamic deployment of environments • virtualisation and cloud-based technologies. • support for native tools and environments • Builds upon data-centric execution platform (Hadoop/Stratosphere) • Simple and natural tool support and • automated mapping of graphical (Taverna-based) workflows to parallel programming model • Three levels of parallelization • Distribution of files • Splitting content • Parallel query execution • Repository integration based on two open reference implementations compile PPL dataflowprogram Multi-StageM/R Flow
SCAPE Solutions • OPF Result Evaluation Framework (Ref) • Large RDF quadstore for storing SCAPE workflow results • developed in cooperation with University of Southampton • Shared database to publish and query these results • Supports progress tracking and monitoring over time • Input for Preservation Planning and Watch
SCAPE Solutions • Context-aware Planning and Watch • Automated watch monitoring • trends in web harvests and repositories • linked with Results Evaluation Framework (REF) database • Formalized policy model and representation • using semantic technologies • Automated Planning • Building on the Planets PLATO tool • Key factors and decision criteria • Automated policy-driven planning
SCAPE Solutions • Automated Quality Assurance • QA in web harvesting and digitisation through automated comparison of rendered pages • Characterization – feature extraction • Level 1 - Metadata information: usingcharacterization components. • Level 2 – Global content description: discriminant global features for individual media types. • Level 3 – Structural content description: detect structural similarities in images • Comparison • Discrete solution and smart metrics (level 2+3) • Development of metrics and measures of similarity, quality, relationship to user perception
Selected Achievements • Public Website: http://www.scape-project.eu/ • Development Infrastructure hosted by the Open Planets Foundation and GitHub: http://wiki.opf-labs.org/display/SP/Home • First Deliverables available for download • Publications • 13 in the first nine months, including 6 at iPres last week • Report: Comparative analysis of identification tools • Report: Analysis of scalability challenges for Digital Object Repositories - Classification and design of approaches. • Platform Infrastructure • 10 nodes (dual-core), 20 TB experimental cluster hosted by AIT • Virtualization based on Xen + Eucalyptus • Hardware for the Platform’s Central Instance currently being set-up within data centre at IMF. 13