Support for the Full e-Experimentation Cycle in the Virtual Laboratory Infrastructure

Support for the Full e-ExperimentationCycle in the Virtual Laboratory Infrastructure Piotr Nowakowski (1), Eryk Ciepiela (1), Tomasz Gubała (1), Maciej Malawski (1, 2), Marian Bubak (1, 2) (1) ACC Cyfronet AGH, ul. Nawojki 11, 30-950 Kraków, Poland (2) Institute of Computer Science AGH, Mickiewicza 30, 30-059Kraków, Poland KUKDM’10 Zakopane, 18-19 March 2010

Outline • Motivation • Problem definition • Scientific challenges • Iterative experimentation support • Experiment pipelines and traces • Sharing experiment data through Data Nets

Motivation: e-Science Experiments,Dataand Publications • Reproducible experiments, provenance in e-Science • Need to link publications with primary data (experimental data, algorithms, software, workflows, scripts) • Plentitude of scientific software: jobs, workflows, services, components, scripts, experiment plans • Huge amount of scientific data consumed and producedby e-Science • Earth and life Sciences, HEP, etc. • Large number of publications makes research difficult: • Computer Science: DBLP contains more than 220 = 1,048,576 publications, • PubMed stores ~17 million articles to date, • CM digital library, ISI Web of Knowledge, Scopus, Citeseer,arXiv, Google Scholar • Emergence of the Web 2.0-based Scientific Social Community (SSC) model

Open Science & Science 2.0 • New means of scientific communication: • Wikis, blogs • collaborative web 2.0 technologies • New methods of conducting science: • e-science, • in-silico experiments, • exploratory applications • Democratization of science • Increasing role of openness

Problem Definition • To construct a theoretical model facilitating open, collaborative e-experimentation, from experiment inception to publication of results, including primary scientific data • To develop a framework implementing the above model • To exploit the emerging solution in the context of existing HPC infrastructures and scientific collaboration

Scientific Challenges • Theoretical: A common method for referencing primary data (experimental data, algorithms, software, workflows, scripts) as part of publications should be developed and integrated with modern e-Science infrastructures • Technological: An integratedarchitecture for storing, annotating, publishing, referencing and reusing primary data sources.This architecture should span existing virtual laboratory and grid computing systems

Description of the Solution • Phase 1: Iterative experiment preparation • Phase 2: Experiment execution involving semantic storage of results and ensuring repeatability

Experimentation Pipeline • The process of developing an experiment beings with drafting its specification • This is followed by iteratively constructing an experiment plan • Each prototype is tested by a specific research community, using tools provided by the PL-Grid virtual laboratory • Upon completion of tests the experiment can be executed in a production mode • Obtained results can be published along with the experiment plan (i.e. a set of operations which enable reenactment and validation of a given experiment)

ExperimentTraces • An experiment trace consists of the following: • any input data provided by the experiment enactor; • all steps performed in order to transform this data into publishable scientific results (chronologically arranged); • the documentation of the experiment plan, prepared by a domain scientist (in the form of annotations and comments). • The outcome of this process will be easily manageable and readable, similarly to weblog entries • Our VL system will enable enrichment of individual data elements with provenance information, linking them to appropriate stages of the experiment

SharingPrimary Data: DataNets Data Net– unifying modern data storage mechanisms (relational databases, Grid-based file systems, Wiki pages etc.) A Data Net is a group of data entities linked by named relationships. Such relationships impose a structure upon the dataset and facilitate querying for entities

References • W. Funika, D. Harezlak, D. Krol, M. Bubak; Environment for Collaborative Development and Execution of Virtual Laboratory Applications. In: M. Bubak, G.D.v. Albada, J. Dongarra, P.M.A. Sloot (Eds.), Proceedings ICCS 2008, Kraków, Poland, LNCS 5103, pp. 246-458, Springer 2008. • T. Gubala, M. Bubak, P.M.A. Sloot; Semantic Integration of Collaborative Research Environments, M. Cannataro (ed.) Handbook of Research on Computational Grid Technologies for Life Sciences, Biomedicine and Healthcare, Information Science Reference, 2009, IGI Global. • M. Bubak, M. Malawski, T. Gubala, M. Kasztelnik, P. Nowakowski, D. Harezlak, T. Bartynski, J. Kocot, E. Ciepiela, W. Funika, D. Krol, B. Balis, M. Assel, and A. Tirado Ramos. Virtual laboratory for collaborative applications. In M. Cannataro, editor, Handbook of Research on Computational GridTechnologies for Life Sciences, Biomedicine and Healthcare, chapter XXVII, pages 531-551. IGI Global, 2009. • https://gs2.cyfronet.pl

Support for the Full e-Experimentation Cycle in the Virtual Laboratory Infrastructure