470 likes | 602 Views
HPC and GRID challenges in Bioinformatics. Milanesi Luciano National Research Council Institute of Biomedical Technologies, Milan, Italy luciano.milanesi@itb.cnr.it. Introduction.
E N D
HPC and GRID challenges in Bioinformatics. Milanesi Luciano National Research Council Institute of Biomedical Technologies, Milan, Italy luciano.milanesi@itb.cnr.it
Introduction • The potential of new biological and biomedical technological platforms in connection with HPC and GRID technology will be particularly useful to deal with the increasing amount, complexity, and heterogeneity of biological and biomedical data. • Bioinformatics applications for eHealth have become an ideal research area where computer scientists can apply and further develop new intelligent computation methods, in both experimental and theoretical cases. The European Bioinformatics initiative based on infrastructure created by the EGEE and BioinfoGRID and related projects will be illustrated.
Introduction: Post-genomic • “Post-genomic” focuses on the new tools and new methodologies emerging from the knowledge of genome sequences. • Production and use of DNA micro arrays, analysis of transciptome, proteome, metabolome are the different topics developed in this class.
The human organism: • ~ 3 billion nucleotides • ~ 30,000 genes coding for • ~ 100,000-300,000 transcripts • ~ 1-2 million proteins • ~ 60 trillion cells of • ~ 300 cell types in • ~14,000 distinguishable • morphological structures
ICT and Genomics • A key development in the computational world has been the arrival of de novo design algorithms that use all available spatial information to be found within the target to design novel drugs. • Coupling these algorithms to the rapidly growing body of information from structural genomics together with the new ICT technology (eg. HPC, GRID, Web Services, ecc.) • provides a powerful new possibility for exploring design to a broad spectrum of genomics targets, including more challenging techniques such as: • protein–protein interactions, docking, molecular dynamics, system biology, gene network ecc.
EU GRID EGEE Related EU projects EUIndia ISSeG BEinGRID
BioinfoGRID Project . • The BIOINFOGRID project proposes to combine the Bioinformatics services and applications for molecular biology users with the Grid Infrastructure by EGEE and EGEEII projects. • In the BIOINFOGRID initiative plan to perform research in genomics, transcriptomics, proteomics and molecular dynamics applications studies based on GRID technology.
Genomics applications in GRID • GRID analysis of genomic databases: integration of precomputed data, gene identification, differentiation of pseudogenes, comparative genome analysis, etc. • Perform functional protein analysis in GRID by using the functional protein domain annotations on large protein families using GRID and related databases.
Bioinformatics Applications • CSTminer • Goal: compare the entire genome of the Human being against the entire genome of some animals (mouse, dog… ecc) • First test: Human against mouse • Challenge dimension: • 850 million of BLAST comparison (~ 2 sec of CPU for each comparison) • More than 50 CPU years needed. • More than 65000 job submitted. • Up to 2 million of comparison per hour. • 22 different farms used. • More then 900 different hosts used. • 2 month of run on INFN-Grid infrastructure • Second test: Some genes of Human against many animals • Challenge dimension: • 1.7 million of comparison • More than 900 CPU hours needed. • < 1 day on INFN-Grid infrastructure
Proteomics Applications in GRID • Protein surface calculation: the grid will be used to elaborate the volumetric description of the protein obtaining a precise representation of the corresponding surface.
Transcriptomics applications • Computational GRIDs to analyse trascriptomics data Description • To perform algorithmic tools for gene expression data analysis in GRID: evaluate the computational tools for extracting biologically significant information from gene expression data. • Algorithms will focus on clustering steady state and time series gene expression data, multiple testing and meta analysis of different microarray experiments from different groups, and identification of transcription sites.
Transcriptomics applications Data analysis specific for bioinformatics allow the GRID user to store and search genetics data, with direct access to the data files stored on Data Storage element on GRID servers. Researchers perform their activities regardless geographical location, interact with colleagues, share and access data Scientific instruments and experiments provide huge amount of data from microarray
Influenza A Neuraminidase • Grid-enabled High-throughput in-silico Screening against Influenza A Neuraminidase • Encouraged by the success of the first EGEE biomedical data challenge against malaria (WISDOM), the second data challenge battling avian flu was kicked off in April 2006 to identify new drugs for the potential variants of the Influenza A virus. • Mobilizing thousands of CPUs on the Grid, the 6-weeks high-throughput screening activity has fulfilled over 100 CPU years of computing power. • In this project, the impact of a world-wide Grid infrastructure to efficiently deploy large scale virtual screening to speed up the drug design processhasbeen demonstrated.
Identification of Applications in EELA E-infrastructure shared between Europe and Latin America V • EELA Biomedical Applications Fall into Three Categories • Bioinformatics Applications • BLAST in Grids. • Phylogeny. • Computational Biochemical Processes • Wide in-Silico Docking on Malaria (WISDOM). • Biomedical Models • GEANT4 Application for Tomographic Emission (GATE)
EuChinaGRID • Facility for the prediction of the three dimensional structure of “never born proteins”
Grid added value for international collaboration on neglected diseases • Grids offer unprecedented opportunities for sharing information and resources world wide Grids are unique tools for : • Collecting and sharing information (Epidemiology, Genomics) • Networking experts • Mobilizing resources routinely or in emergency (vaccine & drug discovery)
Molecular applications in GRID Aim : The objective is to docking and Molecular Dynamics simulations, which usually take a very long time to complete the analysis. Description • Wide In Silico Docking On Malaria initiative WISDOM-II:This project perform the docking and molecular dynamics simulation on the GRID platform for discovery new targets for neglecteddiseases.Analysis can be performed notably using the data generated by the WISDOM application on the EGEE infrastructure.
Grid impact on drug discovery workflow down to drug delivery (1/2) • Grids provide the necessary tools and data to identify new biological targets • Bioinformatics services (database replication, workflow…) • Resources for CPU intensive tasks such as genomics comparative analysis, inverse docking… • Grids provide the resources to speed up lead discovery • Large scale in silico docking to identify potentially promising compounds • Molecular dynamics computations to refine virtual screening and further assess selected compounds
Grid impact on drug discovery workflow down to drug delivery (2/2) • Grids provide environments for epidemiology • Federation of databases to collect data in endemic areas to study a disease and to evaluate impact of vaccine, vector control measures • Resources for data analysis and mathematical modelling • Grids provide the services needed for clinical trials • Federation of databases to collect data in the centres participating to the clinical trials • Grids provide the tools to monitor drug delivery • Federation of databases to monitor drug delivery
Virtual screening process by docking There are successful examples • rapid, • cost effective… But there are limitations • CPU and storage needed Docking: predict how small molecules bind to a receptor of known 3D structure
Grid-enabled high throughput virtual screening by docking • 1 to 30 mn by docking • A few MB by output • 100 CPU years, 1 TB Millions of chemical compounds Docking software • Challenges: - Speed-up the process - Manage the data • Large scale deployment on grid infrastructure A few target structures
WISDOM-II, second large scale docking deployment against malaria Involved in Malaria target Biology partners GST from Plasmodiumfalciparum Parasite detoxification U. of Pretoria, South-Africa DHFR from Plasmodiumvivax Parasite DNA synthesis U. of Los Andes, Venezuela U. of Modena, Italia DHFR from Plasmodium falciparum Parasite DNA synthesis U. of Modena, Italia Tubulin from Plasmodium/plant/mamal Parasite cell replication CEA, Acamba project, France
Grid infrastructures and projects contributing to WISDOM-II EMBRACE BioinfoGrid SHARE EGEE Auvergrid EUMedGrid EUChinaGrid TWGrid EELA : European grid project : European grid infrastructure : Regional/national grid infrastructure
Filtering process 1,000, 000 chemical compounds Sorting based on scoring in different parameter sets; Consensus scoring 10,000 compounds selected Based on key interactions 1,000 compounds Key interactions, binding modes, descriptors, knowledge of active site 100 compounds MD 50 compounds to be tested in experimental lab Credit: V. Kasam Fraunhofer Institute
A grid for neglected diseases SCAI Fraunhofer: Knowledge extraction, Chemoinformatics LPC Clermont-Ferrand: Biomedical grid Univ. Modena: Biological targets, Molecular Dynamics CEA, Acamba project: Biological targets, Chemogenomics BioinfoGRID: Bioinformatics Grid ITB CNR: Bioinformatics, Molecular modelling HealthGrid: Biomedical grid, Dissemination Academica Sinica: Grid user interface Univ. Los Andes: Biological targets, Malaria biology Univ. Pretoria: Bioinformatics, Malaria biology Use the grid technology to foster research and development on malaria and other neglected diseases Contacts also established with WHO, Microsoft, TATRC, Argonne, SDSC, SERONO, NOVARTIS, Sanofi-Aventis, Hospitals in subsaharian Africa,
The Cell Cycle • Cell Cycle: • repeated sequence of events which leads the division of a mother cell into daughter cells • Biological process frequently studied in correlation to tumour disease • It is considered a valuable target for drug discovery in the context of cancer and neurodegenerative disease
Systems Biology Approach • Systems biology studies how biological functions emerge from the protein-protein interactions in the living systems; • The complexity of this biological process relies in the high number of genes and networks of protein interactions involved in; • The quantification of the behavior of each cell cycle components has a crucial role in the understanding the complex mechanism of cell cycle regulation.
Simulation Section The simulation of a single ODE system describing a cell cycle model 2D plot: image exported in png using GnuPlot
Tissue Microarray in GRID Genetic Diseases High throughput techniques (i.e. DNA microarray) to screen the whole genome Low reliability Validation through Tissue Microarray
Tissue Microarray in GRID Genes and proteins detection
Tissue Microarray in GRID elaboration SE elaboration CE SE CE GRID Node Edge detection on every TMA on GRID having “age”>80 AND “gender”=F AND “desease”=colon cancer GRID Node CE elaboration AMGA SERVER UI SE GRID Node
Deployment of BLAST in Grid • A large fraction of the biological data produced is publicly available on web or ftp sites • data can be downloaded as “flat files”. • A procedure has been set up to • Check the remote site for un updated version of the DB’s • Automatic download of the data • Register the file in a grid catalogue (LFC) • Create a DB index for its use with BLAST (using the Grid) • Register the indexes file(s) in the grid catalogue (LFC)
Biological Database handling • The Automatic Updater (AU) constantly monitors FTP sites looking for newest versions of each databases • When a new timestamp on FTP sites is detected, the newest version is automatically downloaded and replaces the older version on the grid • Before clearing the older version, an xdelta patch is computed allowing to regenerate the old version starting from the new one.
Biological Database handling • This software for the data management allows to replicate dynamically each database in relation with its usage in order to balance the number of replicas, and so the performance, taking into account the occupied disk space. • It relies on the statistical analysis of the database usage by the grid jobs, working on data acquired after each job execution, regarding grid queue times, database set up times and overall job computation. • We face complex data challenges performing both the parsing of the output results and the storage of the data in the database directly from the GRID
Results • In order to make this software rapidly accessible a user interface has been developed. • It is used to submit jobs in the grid infrastructure, to visualize in a clear form the obtained results and to hide the complexity of the distributed platform.
Results • The main feature of the portal is the possibility to hide completely the JDL scripts layer for the grid job submission. • While it is still possible to submit simple job to grid writing it’s own JDL script, the idea is to hide this process to make the grid use more user friendly for the bioinformatics community.
Results • The interfaces to application jobs are automatically generated by the conversion of XML files that describe both the end user parameters and the structure of the JDL scripts that have to be automatically generated to submit the jobs.
Results • A selection can be made among different databases against which to perform the analysis: all these databases are updated automatically. • In figure is reported the summary of the submitted application jobs, with information about the analysis software, the global computation status and the user interface used for submission.
Italian Bioinformatics Networks 30 Research Nodes
CNR-BIOINFORMATICS Networks 24 CNR Research Nodes National Research Council CNR-Bioinformatics project
Virtual Physiological Human • Concept basis Basis is the International physiome projectwww.physiome.org • Computational frameworks and ICT-based tools for multiscale models of the human anatomy, physiology and pathology • Libraries of data and toolbox for simulation and visualisation • Patient specific model from biosignals and images including molecular images Loukianos Gatzouli ICT for Health
Acknowledgments • BioinfoGRIDhttp://www.bioinfogrid.eu • EGEE Enabling Grid for E-science project http://www.eu.egee.org • EELA: e-Infrastructure between Europe and Latin America project http://www.eu-eela.org/index.htm • Euchinagrid: Interconnection & Interoperability of Grids between Europe & China project. http://www.euchinagrid.org/ • FIRB-MIUR LITBIO: Laboratory for Interdisciplinary Technologies in Bioinformatics http://www.litbio.org,