340 likes | 478 Views
D4Science: An e-Infrastructure for Facilitating Data Management, Process, Sharing, and Access. Digital Repositories – Linked Open Data: the possible Role of D4Science 16-17 December 2010 FAO (Rome). Pasquale Pagano National Research Council of Italy pasquale.pagano@isti.cnr.it.
E N D
D4Science:An e-Infrastructure for Facilitating Data Management, Process, Sharing, and Access Digital Repositories – Linked Open Data: the possible Role of D4Science 16-17 December 2010 FAO (Rome) Pasquale Pagano National Research Council of Italy pasquale.pagano@isti.cnr.it www.d4science.eu
Assumptions • Consolidated facts: • Very rich applications and data collections are currently maintained by a multitude of authoritative providers • Different problems require different execution paradigms: batch, map-reduce, synchronous call, message-queue, … • Key distributed computation technologies exist: grid (gLite and Globus), distributed resource management (Condor), clusters (Hadoop), … • Several standards are adopted in the same domain • Societal observations • A rich variety of protocols, models, and formats • Create barriers in the usage of resources • Delay dramatically new exploitation patterns • Technical observations • Protocols, models, and formats heterogeneity increases load, • Load increases failures
D4Science Vision • D4Science objectives: • hide heterogeneity, i.e. abstract over differences in location, protocol, and model; • embrace heterogeneity, i.e. allow for multiple locations, protocols, and models; • Technical goals • no bottlenecks: scale no less than the interfaced resources • no outages: keep failures partial and temporary • autonomicity: system reacts and recovers
From a testbed to a production ecosystem Oct .’04 Nov.’07 Jan.’08 Oct .’09 Dec.’09 Sept.’11
From a testbed to a production ecosystem Oct .’04 Nov.’07 Jan.’08 Oct .’09 Dec.’09 Sept.’11 functionality gCube gLite
Infrastructure Exploitation Production Nodes Collections Functionality • 30 Nodes • CNR • NKUA • ESA • FAO • UNIBASEL • 25 Data • EEA • MERIS • AATSR • 69 Metadata • es • ISO19115 • eiDB • Integration with gPod • Geographical and text search • Search by metadata • Personal workspace • Objects annotation • Report generation • Maps Generation • Time Series management • 29 Nodes • CNR • NKUA • FAO • UNIBASEL • 15 Data • AquaMaps • Fact Sheets • Country Maps • 28 Metadata • FARM_dc • aquamaps More than 500 autonomic Web Services
gCube as a Digital Library System • A Digital Library System is a possibly distributed system that collects, manages and preserves for the long term rich digital content, and offers to its user communities specialised functionality on that content, of measurable quality and according to codified policies [The Digital Library Reference Model] The gCube data infrastructure enabling framework provides DL functionality by: • maintained in a variety of tailored repository systems • Federating exiting digital content • Supporting the generation of new digital content • by exploiting heterogeneous computational platforms • Providing discovery and access capabilities • on diversely described and modeled digital content
gCube as an e-Infrastructure ecosystem enabling framework • By bridging a number of well-established systems and standards from various domains including high-energy physics, biodiversity, fishery and aquaculture resources management • gCube realises an • e-Infrastructure ecosystem
How does it work ? A VO specifies what is shared, who is allowed to share, the conditions under which sharing can occur • A VRE identifies a subset of resources assigned to a subset of users via interfaces for a limited timeframe and at little or no cost for the providers of the infrastructure
Why sharing through VREs is a key? • Through the VRE, groups of users have controlled access to distributed data and services integrated under a personalised interface.
Why sharing through VREs is a key? • A Virtual Research Environment (VRE) supports cooperative activities • Metadata cleaning, enrichment, and transformation by exploiting mapping schema, controlled vocabulary, thesauri, and ontology • Processes refinement and show cases implementation (restricted to a set of users); • Data assessment (required to make data publically exploitable by VO members); • Expert users validation of products generated through data elaboration or simulation.
Why sharing through VREs is a key? VREs integrated environment put at disposal a functionality set to support and perform research activities: • the ability to integrate heterogeneous data and services • the ability to process information on-demand ingesting the results, • to share data and process with other users, • to customize collection of information, • to store user actions and exploit them for further use, • to aggregate relevant information into ad-hoc information sources and keeping them updated. • VREs integrated environment put at disposal a functionality set to support and perform research activities: • the ability to integrate heterogeneous data and services • the ability to process information on-demand ingesting the results, • to share data and process with other users, • to customize collection of information, • to store user actions and exploit them for further use, • to aggregate relevant information into ad-hoc information sources and keeping them updated.
VRE Facilities A virtual desktop to organize the working environment Workspace Species Maps Generation Tools supporting specific tasks Time Series Management A virtual live document to describe research results Report Management Search Annotation Visualisation Storage Transformation Search Annotation Visualisation Storage Transformation Search Annotation Visualisation Storage Transformation …
Workspace • A collaboration-oriented suite providing for • seamless access and organisation facilities on a rich array of objects (e.g. Information Objects, Queries, Files, Templates) • mediation between external world objects, systems and infrastructures (import/export/publishing) • support common file manager (drag & drop, contextual menu) • support an effective rich object sharing facility
Species Distribution Maps Generation • AquaMaps is an application* • tailored to predict global distributions of marine species initially designed for marine mammals and subsequently generalised to marine species, • that generates color-coded species range maps using a half-degree latitude and longitude blocks • by interfacing several databases and repository providers * Algorithm by Kashner et al. 2006
Species Distribution Maps Generation • AquaMaps execution is based on the gCube Ecological Niche Modelling Suite which allows the extrapolation of known species occurrences • to determine environmental envelopes (species tolerances) • to predict future distributions by matching species tolerances against local environmental conditions (e.g. climate change and sea pollution) Very large volume of input and output data: HSPEC native range 56,468,301 - HSPEC suitable range 114,989,360 Very large number of computation: One multispecies map computed on 6,188 half degree cells (over 170k) and 2,540 species requires 125 millions computations (Eli E. Agbayani, FishBase Project/INCOFISH WP1, WorlFish Center)
Time Series Management • Offers a set of tools to manage capture statistics • Supports the complete TS lifecycle • Supports validation, curation, and analysis • Provides support for data reallocation • Produces uniform data-set
Time Series • Offers a set of tools to operate on capture statistics • Multiple key families support • Filtering, grouping, and aggregation • Union • Mining • Produce automatically provenance information
Report Management • A collaboration-oriented suite providing for • template-oriented, feature-rich and flexible document format definition • effective and infrastructure-integrated report compilation (drag & drop workspace items) • collaborative and distributed editing (workspace based) • standard-based report materialisation (HTML, OpenXML)
PE2ng Definition Process Execution Engine (PE2ng, pronounced as ‘peng’) is a system to manage the execution of software elements in a distributed infrastructure under the coordination of a composite plan that defines the data dependencies among its actors. • Close relatives: • Job Management Systems (Condor) • Distributed Computing Frameworks (MPI, MapReduce)
More Info • PE2ng motivation is the instantiation of a liberal computational infrastructure that: • Builds on existing infrastructures • Integrates existing technologies • Supports several software paradigms without performance compromises • Provides a powerful, flow-oriented processing model • PE2ng’s dual nature: • Coordinator of external computational infrastructures • Native computational infrastructure provider and manager
PE2ng and the Cloud • Exploits all modern clouds paradigms (PaaS, SaaS, IaaS) • Provides a PaaS: • Based on Streams (gCubeResultset – gRS2) • Support for dynamic infrastructure reorganisation • Offloaded to Cloud Management decision making • Direct interaction with cloud management : under implementation • Supports SaaS via a combination of gCube services • Fits several Infrastructures: • No built-in dependencies for computation or storage
Binding together infrastructures • Single Infrastructure • Utilise capacities to the fullest • Bound “for better or for worst” • Bend business logic to fit • One size fits all? • Infrastructure ecosystem • Don’t hide Infrastructures • Not yet another layer • Choose infrastructure to fit needs • Turn Infrastructure into a utility • Unrestrictive Meta-Infrastructure • Single submission, monitoring, access • Single language for “Programming in the Large” and “Small” PE2ng ? …
Terms use on PE2ng Workflow: a high level plan that binds together conceptual operations for the implementation of a task. Execution Plan: a plan for the invocation of code components (aka invocables, i.e. services, binary executables, scripts, …) that ensures that prerequisite data are prepared and delivered to their consumers by defining the flow of data and/or control. Resource: Software, data, network, systems… Registry: A directory service where resources are enlisted for discovery
gCube Data Transformation Service (gDTS) • A service to tackle with the issue of transformation of data among various manifestations • Features: • Distributed (PE2ng based) • Manifestation and transformation agnostic • “Intelligent”, objective-driven operation • Why so important ? • Plays vital role to several data staging steps within the infrastructure • Seems to cover out of the box several needs of “interoperability” as conceived by the communities
gDTS case Input A Transformers Registry T2 T3 T3 T1 T1 T2 T3 A C A A C C A 3 hops D B C B D C D 2 hops D T4 T4 T4 Output B B B C C E Conf B Conf A Conf B External Advisory Board Meeting
VRE Sumamry • D4Science approach: • Heterogeneous resources are accessible in a common ecosystem of resources • despite their locations, technologies, and protocol • Different communities have access to different views • according to the conditions under which the sharing can occur • Each community can define its own virtual research environment to satisfy specific needs • for a limited timeframe and at no cost for the providers of the resource • Several virtual research environments can coexist • without interfering each other even by competing for the same resources
Conclusions • Facts • Very rich services and data collections are currently maintained by a multitude of authoritative providers • Several standards are adopted in the same domain • Interoperability approaches are key to exploit such richness • D4Science offers a variety of patterns, tools, and solutions • to interconnect • Heterogeneous digital content • Heterogeneous repository systems • Heterogeneous computation platforms with a rich set of free-to-use tailored services • to decrease the cost of adoption • to reduce the time to market of new ideas • to deal with plethora of standards
Supported Standards • WS-* • WSRF • WS-BPEL • JDL • JSDL • Glue Schema (part) • X-* • DC, TEI, ISO etc • JSR (several) • GSI-Security • XACML • SAML • OpenSearch • OGC related • Comply with: • OAI-PMH • OAI-ORE
Supported Standards • WSRF Specifications • WS-ResourceProperties (WSRF-RP) • WS-ResourceLifetime (WSRF-RL) • WS-ServiceGroup (WSRF-SG) • WS-BaseFaults (WSRF-BF) • JSR • 168 : Simple Portlets • 286 : 186 update • 160 : JMX • WSN Specifications: • WS-BaseNotification • WS-Topics • (WS-BrokeredNotification) • …. • WS-* Standards • SOAP • WSDL • WS-Addressing • …. • ISO: • ISO3166 countries • ISO4217 currencies • ISO19115 geo-location • …. • X-* • XML • XSD • XSL • XSLT • xPath • xQuery • OGC • Web Coverage Processing Service • Web Coverage Service • Web Feature Service • Web MapContext • Web Map Service • Web MapTile Service • Web Processing Service • Web Service Common • OGF Standard: • Glue Schema (2) • ………. • Comply with: • OAI-PMH • OAI-ORE
Find us • www.gcube-system.org www.d4science.eu Donatella Castelli D4Science-II Project Director donatella.castelli@isti.cnr.it Pasquale Pagano D4Science-II Technical Director pasquale.pagano@isti.cnr.it Thank You For Your Attention