380 likes | 499 Views
Introduction to gCube: promoting an ecosystem approach to controlled resource sharing. gCube - FAO's Information Systems Architecture Forum FAO, Rome 25 January 2011. Pasquale Pagano Pasquale.pagano@isti.cnr.it. www.d4science.eu. Outline. D4Science-II Challenges.
E N D
Introduction to gCube: promoting an ecosystem approach to controlled resource sharing gCube - FAO's Information Systems Architecture Forum FAO, Rome 25 January 2011 Pasquale Pagano Pasquale.pagano@isti.cnr.it www.d4science.eu
D4Science-II Challenges D4Science Ecosystem Challenges • Heterogeneous resources • Heterogeneous computational platforms • Rich set of legacy applications • Multiple administrative domains • Evolving communities D4SCIENCE INFRASTRUCTURE DRIVER Portal Group B GENESI-DR Group C Group C FAO Geonetwork Group A FAO FIGIS AquaMaps INSPIRE Hadoop EGEE/EGI
D4Science-II Challenges D4Science II Current Status Data Resources • File based Data: • 64 large data collections gathering 164087 objects • Tabular Data: • Time series (catch statistics) • Environmental authority (properties of ~250K marine areas) • Species environmental envelope (environmental description of 11k species) • Species assignment (assignment of a species to cell areas, ~2.75 billion records) • Geospatial Data: • Environmental (SST, salinity, sea ice concentration, distance to land, etc) • Layers and several thousand species distribution layers (~25k layers) • Others Data Resources: • Metadata collections in multiple schemas (163) • Full text, forward, and geo spatial indexes (165) • Transformation programs (41) HW and SW Resources
D4Science-II Challenges From a testbed to a production ecosystem Oct .’04 Nov.’07 Jan.’08 Oct .’09 Dec.’09 Sept.’11
gCube identity: the starting point gCube Open Platform • gCube is physically distributed across • Libraries • Services • import, collect, store, index, transform, search, describe, manage, and annotate data. • .. • .. • .. • .. • .. • .. • .. • .. • .. • server-side libraries, client-side libraries, plugins Portlets • interactive components, mediators • Designed for working at large scale • over wide-area links and across administrative domains • to cope with the computational demands • can be easily deployed in a single site
gCube identity: the starting point gCube Release Cycle Procedure Bug Fixing Patching the Production • Release 2.2.2: • 23 subsystem • 307 software packages • 22 full-time developers • 4 testers
gCube identity: the starting point gCube Software License: EUPL
gCube e-Infrastructure • A gCube e-Infrastructure promotes effective consumption of shared resources: • hardware resources • data resources • software resources to facilitate research collaborations that span institutions, disciplines, and countries within a coherent model, regardless of the location of their research facilities • It extends the e-Infrastructure concept by promoting sharing and collaboration and enforcing policies • It increases flexibility in the organization of community resources with Virtual Research Environments • gCube e-Infrastructure enabler: the VRE innovation
Virtual Organization • A Virtual Organization (VO) specifies how a set of users can access a set of resources • what is shared • who is allowed to share • the conditions under which sharing can occur • Is the VO adequate to represent a growing aggregation of resources tailored to satisfy the evolving needs of the user community? • NO, it is not ! • Common scenarios • Data needs to be assessed before to make it publically exploitable by the VO members. • Restricted set of users have to collaborate to refine processes and implement show cases. • Products generated through elaboration of data or simulation have to be validated by expert users. • gCube e-Infrastructure enabler: the VRE innovation
Virtual Research Environment VO VRE 1 • Virtual Research Environment (VRE) is • a distributed and dynamically created environment • where subset of resources can be assigned to a subset of users via interfaces • for a limited timeframe • at little or no cost for the providers of the infrastructure • gCube e-Infrastructure enabler: the VRE innovation VRE 2 gCube is a first example of a VRE management system
How does it work ? • gCube e-Infrastructure enabler: the VRE innovation
Why sharing through VREs is a key? • A Virtual Research Environment (VRE) supports cooperative activities • Metadata cleaning, enrichment, and transformation by exploiting mapping schema, controlled vocabulary, thesauri, and ontology • Processes refinement and show cases implementation (restricted to a set of users); • Data assessment (required to make data publically exploitable by VO members); • Expert users validation of products generated through data elaboration or simulation. • gCube e-Infrastructure enabler: the VRE innovation
Why sharing through VREs is a key? VREs integrated environment put at disposal a functionality set to support and perform activities: • the ability to integrate heterogeneous data and services • the ability to process information on-demand ingesting the results, • to share data and process with other users, • to customize collection of information, • to store user actions and exploit them for further use, • to aggregate relevant information into ad-hoc information sources and keeping them updated. • VREs integrated environment put at disposal a functionality set to support and perform activities: • the ability to integrate heterogeneous data and services • the ability to process information on-demand ingesting the results, • to share data and process with other users, • to customize collection of information, • to store user actions and exploit them for further use, • to aggregate relevant information into ad-hoc information sources and keeping them updated. • gCube e-Infrastructure enabler: the VRE innovation
Why sharing through VREs is a key? • Through the VRE, groups of users have controlled access to distributed data and services integrated under a personalised interface. • gCube e-Infrastructure enabler: the VRE innovation
VRE Facilities A virtual desktop to organize the working environment Workspace • gCube e-Infrastructure enabler: the VRE innovation Species Maps Generation Tools supporting specific tasks Time Series Management A virtual live document to describe research results Report Management Search Annotation Visualisation Storage Transformation Search Annotation Visualisation Storage Transformation Search Annotation Visualisation Storage Transformation …
Workspace • A collaboration-oriented suite providing for • seamless access and organisation facilities on a rich array of objects (e.g. Information Objects, Queries, Files, Templates) • mediation between external world objects, systems and infrastructures (import/export/publishing) • support common file manager (drag & drop, contextual menu) • support an effective rich object sharing facility • gCube e-Infrastructure enabler: the VRE innovation
Species Distribution Maps Generation • AquaMaps is an application* • tailored to predict global distributions of marine species initially designed for marine mammals and subsequently generalised to marine species, • that generates color-coded species range maps using a half-degree latitude and longitude blocks • by interfacing several databases and repository providers • gCube e-Infrastructure enabler: the VRE innovation * Algorithm by Kashner et al. 2006
Species Distribution Maps Generation • AquaMaps execution is based on the gCube Ecological Niche Modelling Suite which allows the extrapolation of known species occurrences • gCube e-Infrastructure enabler: the VRE innovation • to determine environmental envelopes (species tolerances) • to predict future distributions by matching species tolerances against local environmental conditions (e.g. climate change and sea pollution) Very large volume of input and output data: HSPEC native range 56,468,301 - HSPEC suitable range 114,989,360 Very large number of computation: One multispecies map computed on 6,188 half degree cells (over 170k) and 2,540 species requires 125 millions computations (Eli E. Agbayani, FishBase Project/INCOFISH WP1, WorlFish Center)
Time Series Management • Offers a set of tools to manage capture statistics • Supports the complete TS lifecycle • Supports validation, curation, and analysis • Provides support for data reallocation • Produces uniform data-set • gCube e-Infrastructure enabler: the VRE innovation
Time Series • Offers a set of tools to operate on capture statistics • Multiple key families support • Filtering, grouping, and aggregation • Union • Mining • Produce automatically provenance information • gCube e-Infrastructure enabler: the VRE innovation
Report Management • A collaboration-oriented suite providing for • template-oriented, feature-rich and flexible document format definition • effective and infrastructure-integrated report compilation (drag & drop workspace items) • collaborative and distributed editing (workspace based) • standard-based report materialisation (HTML, OpenXML) • gCube e-Infrastructure enabler: the VRE innovation
gCube model Wide-area computing based on shared computing, data, and service resources. • provision as Federation but resources can be acquired by the infrastructure • added value for consumers and providers • ownership is decentralised but control is autonomic • resources are heterogeneous • security is pervasive but mostly hidden by gCube middleware Application model is dominantly resource-oriented • VREs profiled as aggregation of resources dynamically deployed, executed, and terminated • are interactive • are built on shareable resources (including workflow) in their own right • are published and discoverable • may integrate storage elements sited at communities site • may host applications that can also be executed by interfacing classic grid and cloud • Deployment model • dynamic and autonomic • Development platform • complete service programming abstraction • gCube e-Infrastructure enabler: the VRE innovation
Interoperability: Assumptions • Consolidated facts: • Very rich applications and data collections are currently maintained by a multitude of authoritative providers • Different problems require different execution paradigms: batch, map-reduce, synchronous call, message-queue, … • Key distributed computation technologies exist: grid (gLite and Globus), distributed resource management (Condor), clusters (Hadoop), … • Several standards are adopted in the same domain • Societal observations • A rich variety of protocols, models, and formats • Create barriers in the usage of resources • Delay dramatically new exploitation patterns • Technical observations • Protocols, models, and formats heterogeneity increases load, • Load increases failures gCube interoperability framework: the challenge
Interoperability: Landscape • Unstructured Data: blob (binary), and textual files • Structured Data: tabular, statistical, geospatial, temporal, and textual data • Compound Data: data composed by unstructured and structured data entities gCube interoperability framework: the challenge security
Interoperability: gCube Vision • gCube objectives: • hide heterogeneity, i.e. abstract over differences in location, protocol, and model; • embrace heterogeneity, i.e. allow for multiple locations, protocols, and models; • Technical goals • no bottlenecks: scale no less than the interfaced resources • no outages: keep failures partial and temporary • autonomicity: system reacts and recovers gCube interoperability framework: the vision
Hiding Heterogeneity • Heterogeneous resources are virtually accessible in a common ecosystem of resources • despite their locations, technologies, and protocol • Different communities have access to different views • according to the conditions under which the sharing can occur gCube interoperability framework: the challenge • Each community can define its own VRE • for a limited timeframe and at no cost for the providers of the resource • Several VRE can coexist • without interfering each other even by competing for the same resources
Embracing Heterogeneity • Approaches and solutions to achieve interoperability : • Blackboard-based • asynchronous communication between components in a system • one protocol to R/W and one language to specify messages • Wrapper/ Mediator-based • translates one interface for a component into a compatible interface • Proxy-based • exposes the same interface but allows additional operation over received calls • Adaptor-based • provides a unified interface to a set of other components interfaces and encapsulates how this set of objects interact • Broker-based • Specialises an Adaptor by coordinating communication gCube interoperability framework: the approach
Interoperability Approaches: Resource Discovery • Each resource is represented by a profile (metadata) characterising: • the interface • the state • the list the dependencies • the run-time status • the policies • the configuration • the pending tasks to execute • A Resource profile • is published by the resource owner • is discovered by the resource consumers asynchronously through a common resource-independent protocol • gCube offers a distributed and scalable Information System (blackboard) to store, discover, and access resource profiles gCube interoperability framework: the solution
Interoperability Approaches: Content Interoperability gCube Open Content Management Architecture (OCMA) • Assumption • data stored in different storage back-ends • diverse locations, models, access types • few common primitives: documents, collections, repositories • gCube allows to • reach content that lies outside system • expose content (reachable from) inside system • perform coarse-grained as well as fine-grained retrieval, update, and addressing • Runtime scalability • autonomic read-only state replication, • maximize throughput, minimize response time: discovery-time load balancing • reduce latencies • Software • plugin-based architecture to reduce development costs gCube interoperability framework: the solution
Interoperability Approaches :Data Discovery and Access • gCube offers • Several index types • Forward indexing, which supports ultra fast lookups on tabular typed metadata; • XML indexing, that supports semistructured lookups on content metadata; • Textual field indexing, that supports full text and qualified lookups on textual (mainly) metadata; • Metadata full text indexing, that enables full text lookups on metadata; • Content full text indexing, that enables full text lookups on text extracted by content; • Geospatial/temporal indexing, that enables geospatial proximity and coverage queries to be executed over geospatial/temporal metadata; • Feature indexing, that enables high-dimension vector indexing, for feature lookup (currently the feature is inactive); • Runtime scalability - WORM (Write Once – Read Many) behavior pattern • multiple readers (Lookups in gCube lingo) • single updater for each index • Autonomic sync under a dynamically expanding/shrinking gCube interoperability framework: the approach
Interoperability Approaches :Data Representation and Manipulation • gCube offers • Open transformation service framework • Extendible with specific source-target mediators • To use for metadata and data crosswalk transformations • Tailored for statistical, geospatial, temporal, and textual data • Rich set of reference data • Extendible with domain-specific reference data • To reuse in services for data curation and harmonization • Support for geospatial services • To capture, manage, analyze, and display all forms of data that can be geographically referenced • Integrated resources registry • Format agnostic • To support discovery and access gCube interoperability framework: the approach
Interoperability Approaches : Process Execution [1/2] • gCube offers solutions to: • Decouple the business domain and infrastructure specific logic from the core “execution” functionality • Invocate a wide range of logic components: SOAP and REST WebServices, Shell Scripts, Executable Binaries, POJOs, … • Support most of the execution paradigms: batch, map-reduce, synchronous call • Bridges key distributed computation technologies: grid (gLite and Globus), Condor, Hadoop • Control and monitor the execution of a processing flow • Staging of data among different storage providers • Streaming data among computation elements gCube interoperability framework: the approach
Interoperability Approaches : Process Execution [2/2] • By using adaptors that • operate on a specific third party language and translate them into native constructs, • allow for the creation of complex workflows that exploit several diverse technologies deployed on different infrastructures gCube interoperability framework: the approach
Conclusions • gCube System: • Stable software being improved over the last 5 years • Powerful Ecosystem management system equipped with advanced infrastructure management functionality • gCube offers a variety of patterns, tools, and solutions • to delivery interoperability solutions and interconnect • Heterogeneous digital content • Heterogeneous repository systems • Heterogeneous computation platforms • to decrease the cost of adoption • to reduce the time to market of new ideas • to deal with plethora of standards
Supported Standards • WSRF Specifications • WS-ResourceProperties (WSRF-RP) • WS-ResourceLifetime (WSRF-RL) • WS-ServiceGroup (WSRF-SG) • WS-BaseFaults (WSRF-BF) • JSR • 168 : Simple Portlets • 286 : 186 update • 160 : JMX • WSN Specifications: • WS-BaseNotification • WS-Topics • (WS-BrokeredNotification) • WS-* Standards • SOAP • WSDL • WS-Addressing • ISO: • ISO3166 countries • ISO4217 currencies • ISO1915 geo-location • X-* • XML • XSD • XSL • XSLT • xPath • xQuery • OGC • Web Coverage Processing Service • Web Coverage Service • Web Feature Service • Web MapContext • Web Map Service • Web MapTile Service • Web Processing Service • Web Service Common • OGF Standard: • Glue Schema (2) • ………. • Comply with: • OAI-PMH • OAI-ORE
Find us • www.gcube-system.org www.d4science.eu Pasquale Pagano D4Science-II Technical Director pasquale.pagano@isti.cnr.it Thank You For Your Attention