260 likes | 373 Views
Deployment and Preparation of Metagenomic Analysis on the EELA Grid. Gabriel Aparício, et al. Topic summary. Global Process Topics. Introduction Cases of study Deployment Automation System Results and Performances Analysis Future Plans. Introduction. What is a Metagenome?
E N D
Deployment and Preparation of Metagenomic Analysis on the EELA Grid Gabriel Aparício, et al.
Topic summary Global Process Topics Introduction Cases of study Deployment Automation System Results and Performances Analysis Future Plans
Introduction • What is a Metagenome? • What is a Metagenome Analysis? • Why Grid is a Good Solution? • Which is the Proposed Structure? • Which are the Future Plans?
What is a Metagenome? Introduction Term first used by Jo Handelsman and others in the University of Wisconsin in 1998. A metagenome is a collection of genes. It can be studied as a single gene. Analysis can be done without isolating genes and lab-cultivating them.
What is a Metagenome Analysis? Introduction • A Metagenome Analysis is the group of necessary steps to transform a file of a coded metagenome into another file with some interest information. • This can include: • Database filtering. • BLAST alignments. • BLAST output filtering. • Creation of Phylogenetic Trees.
Why Grid is a Good Solution? Introduction • A Metagenome can be coded into several hundred of thousand sequences. • Sequential time can take more than a year. • Public databases are continuously changing. • Several coarse steps can be done in parallel. • In a Grid, the global job can be divided into subjobs. • A Metagenome Analysis can be processed in a few days with a Grid Infrastructure.
Farm Soil Metagenome Cases of Study This is a sample from a nutrient rich, moderately contaminated soil environment. This community is very diverse and complex. Many yet unknown enzymes are probably present there. Its study is very interesting from the biotechnological point of view.
Whale fall Metagenome Cases of Study Whale carcasses are known to be a nutrient-rich environment in the bottom of the ocean. A heterogeneous mixture of bacteria flourish there. It is one of the best examples of marine bacterial communities.
Sargasso Sea Metagenome Cases of Study These oceanic samples are taken from surface waters. They represent the diversity of bacteria that live planktonically.
Gut Metagenomes Cases of Study • Several metagenomes of the human intestinal microbiota. • This consortia of bacteria helps its host to metabolize many nutrients that would be indigestible otherwise. • It is involved in other functions • Maturation and modulation of the immune response of the host. • Prevention of infection by bacterial pathogens.
Sequential or Parallel jobs? (I) Deployment There are around 150 CEs in BIOMED and EELA VOs. There are only around 30 CEs able to run MPICH jobs. The number of CEs decreases when the number of required nodes increases. Full efficiency in MPICH jobs is achieved occasionally.
Sequential or Parallel jobs? (II) Deployment
Selecting CEs (I) Deployment • Several jobs are needed • A single job can take more than a year. • It is needed to split the Analysis into several subjobs (often more than 100 subjobs). • Several CEs are needed • To decentralize processing, storing and network bandwidth. • A Metagenome Analysis job has requirements • On software, hardware and configuration.
Selecting CEs (II) Deployment Not all available CEs are able to produce results. Not all available CEs have the same performance. It is needed to select CEs and to distribute jobs according to their performance.
Selecting CEs (III) Deployment
Selecting SEs and Replicating Files Deployment All jobs need certain common files. These files have to be replicated to increase performance and to distribute network bandwidth. SEs selected will be located according to their geographical and administrative nearness to selected CEs, their performance and their configuration.
Splitting global job Deployment • The global job has to be broken down into subjobs. • The subjob lifetime will decrease • Increase interactivity. • Improve monitoring capabilities.
Submitting Jobs Automation System Subjobs are assigned to a list of CEs These CEs have been tested. Assignation is done according to obtained performances in previous experiments.
Monitoring Jobs Automation System Periodically, jobs status are monitored. In case of errors (aborted job, bad results, etc.), the job is automatically resubmitted. In case the job is running too long, the job is cancelled and resubmitted. In case the job has finished successfully, its CEs is annotated for later submissions.
Resubmitting Jobs Automation System Each correctly finished job annotates its CEs and puts it into a list. The jobs are resubmitted to a random CE of this list. If the list does not exist, the job is submitted to a random CE.
Retrieving Results Automation System Once results are available, they are downloaded and the standard outputs are explored to find any error. A retrieved job is no longer monitored.
First conclusions Results and Performances • Jobs are too long to run sequentially • Sargasso Sea Metagenome takes 512 days. • The same job in Grid takes 13 days to be fully finished. • Speedup is around 40. • High speed for most jobs (90% in 7 days) • Speedup is around 80. • No need to finish all jobs to begin with new stages.
Correctly finished jobs percentage Results and Performances
Sequences processed per hour Results and Performances
Future plans Future plans To create several shell-scripts with different stages depending on the desired results. To increase cases of study. To improve automation performances. To make a report with the issues and lessons learnt in EGEE and EELA infrastructures.
Contact Contact Gabriel Aparício i Pla Ignacio Blanquer Espert Vicente Hernández García Universitat Politècnica de València Camí de Vera, s/n 46022 València, Spain Emails: gaparicio@itaca.upv.es iblanque@itaca.upv.es vhernand@itaca.upv.es