240 likes | 256 Views
D4Science is a powerful e-infrastructure connecting scientists worldwide, integrating diverse data sets, and providing access to advanced analysis tools. Discover its capabilities and services for seamless data access and analysis.
E N D
Introduction toD4Science and its enabling platform gCube Pasquale Pagano ISTI - CNR pasquale.pagano@cnr.it
Distinguishing capabilities of the e-infrastructure D4Science Introduction to D4Science and its enabling platform gCube
D4Science Identity D4Science is a Hybrid Data Infrastructure • connecting +2000 scientists in 44 countries • integrating data from +50 heterogeneous providers • executing +20,000 models & algorithms/month • providing access to over a billion quality records in repositories worldwide • operating with 99,7% service availability D4Science hosts +40 Virtual Research Environments to serve the biological, ecological, environmental, and statistical communities world-wide. Introduction to D4Science and its enabling platform gCube
D4Science Services D4Science • hosts databases, no-sql database (Cassandra, Mongo-DB, Couch-DB) and services with defined QoS • offersservices for seamless access and analysis to a wide spectrum of data including • biological and ecological data, • geospatial data, • statistical data and • semi-structured data from multiple authoritative data providers and information systems These services can be exploited both via web based graphical user interfaces and web based protocols for programmatic access, e.g. OAI-PMH, CSW, WPS, WFS, SDMX. Introduction to D4Science and its enabling platform gCube
Born from the user needs Capacities Applications to access authoritative datasets to mash-up data to analyse big datasets to validate datasets and provide a standard access to them Data to analyze datasets to manage the full data life-cycle from import to validation, curation, harmonization and publication to reduce the costs of data maintenance of my dept. to host applications in a secure and scalable environment to maintain and preserve data to securely delivery data to known users e-Infrastructure - Just an overview
D4Science for Geeks • uses cloud-computing technologies to manage 245 servers (more than 1300 GB Ram, 1400 CPUs, 400 TB storage) • is monitored via Nagios and Ganglia • is maintained via Ansible (Application Deployment + Configuration Management + Continuous Delivery) • is governed by deployment and operationpolicies • has established SLA • is operated according to defined Terms of Use Introduction to D4Science and its enabling platform gCube
Distinguishing capabilities of the enabling software gCube Introduction to D4Science and its enabling platform gCube
gCube gCube promotes Hybrid Data Infrastructures by combining over 500 software components into a coherent and centrally managed system of hardware, software, and data resources. gCube: just an overview
gCube IdentityOne stable open-source platform Statistics form openhub.net/p/gCube gCube enables the D4Science HDI Introduction to Existing Technologies and Enabling Platform - P. Pagano
gCube Application Bundles https://www.gcube-system.org/catalogue-of-applications • AppsCube • ConnectCube • BiolCube • To develop applications interfacing gCube facilities • To aid modelling and analysing of distribuition data, comparing checklists, and producing maps • To facilitate data publication with appropriate tools including semantic technologies • GeosCube • StatsCube • IceCube • To assist tabular data validation, data enrichment ad efficient analytical tools • To support deployment, operation & mgmt of a gCube-based infrastructure • To properly access, consume and produce geospatial information Introduction to Existing Technologies and Enabling Platform - P. Pagano
gCube VisionFrom several tools to One Platform Introduction to Existing Technologies and Enabling Platform - P. Pagano
gCube Services SPD (BiolCube)ecological and biological data GeoExplorer(GeosCube)geospatial data Tabular Data (StatsCube)statistical and reference data (ConnectCube) Cotrixreference data Statistical Manager(StatsCube)data analytics for interdisciplinary research Introduction to Existing Technologies and Enabling Platform - P. Pagano
Species Product Discovery (SPD) Access Observations and Taxon Data from OBIS, GBIF, WoRMS, … Assisted query preparation • Assisted filtering • … and many more • Pluggable to interact with additional data sources • Export in multiple formats, including DwC and DwCA • Flexible query language • integration with workspace facilities Introduction to Existing Technologies and Enabling Platform - P. Pagano
Tabular Data Manager Manage Tabular Datasets Rule-based harmonization Assisted data preparation Introduction to Existing Technologies and Enabling Platform - P. Pagano
Statistical Manager External Computing Facility OGC WPS Interface 2013 2014 People can • use • R scripts • Java programs • Linux programs • OGC-WPS • data • Desktop • Infrastructure • OGC-W*S • Several formats Introduction to Existing Technologies and Enabling Platform - P. Pagano
Statistical Manager [cont.] • Not another cloud computer platform but • a platform where executions can be repeated, compared, discussed, logged • Not another computational engine but • a platform where interdisciplinary tools and services can be easily contributed and integrated by the communities Introduction to D4Science and its enabling platform gCube
Statistical Manager [cont.]Two exploitation models Introduction to D4Science and its enabling platform gCube
gCube Policy ManagementVirtual Research Environment Share Database Tables Workflow Files Communicate Post Favourite Connection Organize Dynamic VRE Creation Secure Policy Control to access, share and collaborate Introduction to D4Science and its enabling platform gCube
Virtual Research Environment L. Candela, D. Castelli, P. Pagano (2013) Virtual Research Environments: An Overview and a Research Agenda. Data Science Journal, Vol. 12 a distributed and dynamically created environment where subset of resources (data, services, computational, and storage resources) regulatedby tailored policies are assigned to a subset of users via interfaces for a limited timeframe at little or no cost for the providers of the participatory data e-infrastructures Introduction to D4Science and its enabling platform gCube
VRE Definition • Metadata • Applications Simple and effective process to define a new environment • Data • Configuration Introduction to D4Science and its enabling platform gCube
gCube PID supportEnabling Data Sharing • Citable • Shareable • Reference-able Managing PID in a data infrastructure with elastic management of resources Introduction to D4Science and its enabling platform gCube
References / Links • D4Science Web Site: http://www.d4science.org • gCube Web Site: http://www.gcube-system.org • Catalogue of Applications • https://www.gcube-system.org/catalogue-of-applications • Software Key Features • https://wiki.gcube-system.org/GCube_Features • Developer Guide • https://wiki.gcube-system.org/Developer%27s_Guide • FeatherWeightStack • https://wiki.gcube-system.org/Featherweight_Stack • SmartGears • https://wiki.gcube-system.org/SmartGears • gCube APIs • https://wiki.gcube-system.org/GCube_Application_Programming_Interface • Administration Guide • https://wiki.gcube-system.org/Administrator%27s_Guide Introduction to D4Science and its enabling platform gCube
Publications 2015 E. Trumpy, G. Coro, A. Manzella, P. Pagano, D. Castelli, P. Calcagno, A. Nador, T. Bragasson, S. Grellet, “Building a European Geothermal Information Network using a Distributed e-Infrastructure”, International Journal of Digital Earth, doi:10.1080/17538947.2015.1073378 G. Coro, T.J. Webb, W. Appeltans, N. Bailly, A. Cattrijsse, P. Pagano, “Detecting categories of species commonness: North Sea fish as a case study”, Ecological Modelling, doi: 10.1016/j.ecolmodel.2015.05.033 E. VandenBerghe, G. Coro, N. Bailly, F. Fiorellato, C. Aldemita, A. Ellenbroek, P. Pagano, “Retrieving taxa names from large biodiversity data collections using a flexible matching workflow”, Ecological Informatics, doi: 10.1016/j.ecoinf.2015.05.004 G. Coro, C. Magliozzi, A. Ellenbroek, and P. Pagano. ”Improving data quality to build a robust distribution model for Architeuthis dux.” Ecological Modelling 305 (2015): 29-39, doi:10.1016/j.ecolmodel.2015.03.011. G. Coro, “Un modèlemathématique pour aider la pêche en Europe”, Bulletins ElectroniquesItalie N. 133, Feb. 2015, Ministère des affaires étrangères et du developpement international, http://www.bulletins-electroniques.com/actualites/77874.htm. 2014 R. Froese, G. Coro, K. Kleisner, N. Demirel, ”Revisiting safe biological limits in fisheries”, Fish and Fisheries, DOI: 10.1111/faf.12102 , Ed. Wiley. G. Coro, P. Pagano, A. Ellenbroek, ”Comparing Heterogeneous Distribution Maps for Marine Species”, GIScience and Remote Sensing, DOI 10.1080/15481603.2014.959391, Ed. Taylor and Francis. Candela L., Castelli D., Coro G., De Faveri F., Italiano A., Lelii L., Mangiacrapa F., Marioli V., Pagano P. Integrating Species Occurrence Databases to Facilitate Data Analysis. Approved for the Ecological Informatics Journal, Elsevier 2014. Coro G., Candela L., Pagano P., Italiano A., Liccardo L. Parallelising the Execution of Native Data Mining Algorithms for Computational Biology. Submitted to Concurrency and Computation-Practice & Experience, Wiley 2014.
Publications [cont.] 2013 R. Froese, J. Thorson, R. B. Reyes Jr. A Bayesian approach for estimating length-weight relationships in fishes. Journal of Applied Ichthyology. Volume 30, Issue 1, pages 78–85, 2013 G. Coro, P. Pagano, A. Ellenbroek, ”Combining Simulated Expert Knowledge with Neural Networks to Produce Ecological Niche Models for Latimeriachalumnae”, Ecological Modelling, DOI 10.1016/j.ecolmodel.2013.08.005, Ed. Elsevier. G. Coro, L. Fortunati, P. Pagano. Deriving Fishing Monthly Effort and Caught Species from Vessel Trajectories. Oceans 2013, Proceedings of MTS/IEEE. P. Pagano, G. Coro, D. Castelli, L. Candela, F. Sinibaldi, A. Manzi. Cloud Computing for Ecological Modeling in the D4Science Infrastructure. Proceedings of EGI Community Forum 2013. D. Castelli, P. Pagano, G. Coro, F. Sinibaldi, ”ModellazionedellaNicchiaEcologica di Specie Marine (Marine Species Ecological Niche Modelling)”. In “Le Tecnologie del CNR per il Mare” (CNR Marine Technologies) pp. 140, Ed. CNR (Roma, Italy). D. Castelli, P. Pagano, G. Coro, ”VariazioniClimaticheedEffettosulle Specie Marine (Climate Changes and Effect on Marine Species)”. In ”Le Tecnologie del CNR per il Mare” (CNR Marine Technologies) pp. 139, Ed. CNR (Roma, Italy). D. Castelli, P. Pagano, G. Coro, ”Elaborazione di DatiTrasmessi da Pescherecci (Processing of fishing vessel transmitted information)”. In “Le Tecnologie del CNR per il Mare” (CNR Marine Technologies). pp. 133, Ed. CNR (Roma, Italy). G. Coro, P. Pagano, A. Ellenbroek. Automatic Procedures to Assist in Manual Review of Marine Species Distribution Maps. To be published in M. Tomassini et al. (Eds.): International Conference on Adaptive and Natural Computing Algorithms (ICANNGA’13), Springer, Heidelberg (2013). Candela L., Castelli D., Coro G., Pagano P., Sinibaldi F. Species distribution modeling in the cloud. In: Concurrency and Computation-Practice & Experience, Geoffrey C. Fox, David W. Walker (eds.). Wiley, Appeltans W., Pissierssens P., Coro G., Italiano A., Pagano P., Ellenbroek A., Webb T. Trendylyzer: a long-term trend analysis on biogeographic data. In: Bollettino di GeofisicaTeorica e Applicata: an International Journal of Earth Sciences, vol. 54 (Suppl.) pp. 203 - 205. Supplement: IMDIS 2013 - International Conference on Marine Data and Information Systems, 23-25 September, Lucca (Italy). OGS - IstitutoNazionale di Oceanografia e di GeofisicaSperimentale, 2013. Coro G., Gioia A., Pagano P., Candela L. A service for statistical analysis of marine data in a distributed e-infrastructure. In: Bollettino di GeofisicaTeorica e Applicata: an International Journal of Earth Sciences, vol. 54 (Suppl.) pp. 68 - 70. Supplement: IMDIS 2013 - International Conference on Marine Data and Information Systems, 23-25 September, Lucca (Italy). OGS - IstitutoNazionale di Oceanografia e di GeofisicaSperimentale, 2013. Castelli D., Pagano P., Candela L., Coro G. The iMarine data bonanza: improving data discovery and management through a hybrid data infrastructure. In: Bollettino di GeofisicaTeorica e Applicata: an International Journal of Earth Sciences, vol. 54 (Suppl.) pp. 105 - 107. Supplement: IMDIS 2013 - International Conference on Marine Data and Information Systems, 23-25 September, Lucca (Italy). OGS - IstitutoNazionale di Oceanografia e di GeofisicaSperimentale, 2013. Coro G. A Lightweight Guide on Gibbs Sampling and JAGS. A Lightweight Guide on Gibbs Sampling and JAGS. Technical report, 2013. VandenBerghe E., Bailly N., Aldemita C., Fiorellato F., Coro G., Ellenbroek A., Pagano P. BiOnym - a flexible workflow approach to taxon name matching. In: TDWG 2013 - Taxonomic Database Working Group 2013 (Firenze, 28-31 October 2013). Coro G., Pagano P., Candela L. Providing Statistical Algorithms as-a-Service. In: TDWG 2013 - Taxonomic Database Working Group 2013 (Firenze, 28-31 October 2013). 2012 L. Candela, G. Coro, P. Pagano, ”Supporting Tabular Data Characterization in a Large Scale Data Infrastructure by Lexical Matching Techniques”, In M. Agosti et al. (Eds.): IRCDL 2012, Communications in Computer and Information Science Volume 354, pp. 21–32. Springer, Heidelberg (2012).