210 likes | 315 Views
Acelerando la bioinformatica con el GRID computing. Angel Merino Centro Nacional de Biotecnología, Unidad de Biocomputación. Qué contar …. Microscopia Electrónica Qué es la EM. Cuál es el proceso de trabajo.
E N D
Acelerando la bioinformaticacon el GRID computing Angel Merino Centro Nacional de Biotecnología, Unidad de Biocomputación
Qué contar …. • Microscopia Electrónica • Qué es la EM. • Cuál es el proceso de trabajo. • Que se está resolviendo con la GRID: Procesos/Aplicaciones que se han “gridificado” • Maximum Likelihood • Estimación de la CTF • Superando la barrera de potencial • Web-portal • Web/Grid Services & Workflows • Otras aplicaciones del mundillo
Que es la EM (I) • La EM es una técnica de análisis estructural. • Nos permite adentrarnos en el entorno molecular de las partículas a estudiar.
Cual es el proceso de trabajo Procesado de las imágenes y cálculo de volúmenes 3D Preparación de muestras. Obtención de las imágenes.
Biological Material - High H2O content - Elevated radiation damage • Negative Tint • Dehydration • - Structural changes / Crushing • Image comes from metal mold • Cryomicroscopy • - Hydrated / Biologic-friendly • - Less distorsions • Image comes from biological • specimen Que es la EM (II)
Que es la EM (III) Tinción negativa Criomicroscopía
Estimación de la CTF (I) Estimation of the CTF allows correction of the blurred images. Aberrations in the microscope optics affect the experimental images (blurring). These effect may be described by the CTF. CTF-estimation in Xmipp may take up to half a day per micrograph. Moreover per experiment, a user processes about 100 micrographs. Therefore, grid computing is necessary.
Estimación de la CTF (III) Por micrografía
1000x Maximum-Likelihood
Maximum-Likelihood (I) Ejecución “lenta” 1 iteración
Maximum-Likelihood(II) Ejecución “rapida” (MPI)
Desarrollo de Maximum-Likelihood usando EGEE-GRID vs local cluster Usando EGEE GRID Durante el pasado mes de Noviembre se consumieron 17160 horas de CPU (casi 2 años!) 23 CPUs tiempo completo Tiempo de uso real = 50%del tiempo total debido a la actividad de desarrollo que se estaba realizando Grid 46 CPUs!!! Usando nuestro cluster local (50%) (jumilla.cnb.uam.es), para la misma actividad 20 cpu´s
Superando la barrera de potencial 4 simple steps to run all jobs that you need for your experiment 2º Login into the UI 1º Select your application 3º Upload your necessary files 4º Submit your experiment, giving a notification e-mail address and your password certificate
Superando la barrera de potencial (I) El motor del portal JDLs Input from Grid portal For each JDL C++ Object Required scripts (3) Required input tar´s Second script Run the job and publish the output data when job finishes. First script Third script Get Output and retrieve the output data. Checking status Submit job and publish the data(first time) Done (success) Aborted or not submitted Send e-mail to the notification e-mail address
Superando la barrera de potencial (II) Workflows & Grid Services
Otras aplicaciones Grid Protein Structure Analysis Scientific objectives Bioinformatic analysis of data produced by complete genome sequencing projects is one of the major challenge of the next years. Integrating up-to-date databanks and relevant algorithms is a clear requirement of such an analysis. Grid computing, such as the infrastructure provided by the EGEE European project, would be a viable solution to distribute data, algorithms, computing and storage resources for Genomics. Providing bioinformatician with a good interface to grid infrastructure will also be a challenge that should be successful. GPS@ web portal, Grid Protein Sequence Analysis, aims to be such an user-friendly interface for these grid genomic resources on the EGEE grid. Method A well-known web interface eases the access to the algorithms offered. Protein databases are stored on grid storage as flat files. Most protein sequence analysis tools are reference legacy code that is run unchanged. This tools are wrapped in grid jobs to be executed on grid resources. The algorithms output are analysed and displayed in graphic format through the web interface.
Otras aplicaciones(I) In silico Drug Discovery • Scientific objectives • Provide docking information helping in search for new drugs. • Biological goal: propose new inhibitors (drug candidates) addressed to neglected diseases. • Bioinformatics goal: in silico virtual screening of drug candidate DBs. • Grid goal : demonstrate to the research communities active in the area of drug discovery the relevance of grid infrastructures through the deployment of a compute intensive application. • Method • Large scale molecular dockingon malaria • to compute million of potential drugs with • some software and parameters settings. • Docking is about computing the binding • energy of a protein target to a library of • potential drugs using a scoring algorithm.
Otrasaplicaciones (II) Genome evolution modeling Scientific objectives Study human evolutionary genetics and answer questions such as the geographic origin of modern human populations, the genetic signature of expanding populations, the genetic contacts between modern humans and Neanderthals, and the expected null distributions of genetic statistics applied on genome-wide data sets. Method Simulate the past demography (growth and migrations) of human populations into a geographically realistic landscape, by taking into account the spatial and temporal heterogeneity of the environment. Generate the molecular diversity of several samples of genes drawn at any location of the current human's range, and compare it to the observed contemporary molecular diversity. SPLATCHE uses a region sampling Bayesian framework that requires105 independent demographic and genetic simulations.
Paramas info Xmipp web page: www.cnb.uam.es/~bioinfo Unit web page: http://biocomp.cnb.uam.es NA4 EGEE biomed applications home: http://egee-na4.ct.infn.it/biomed/index.php aj.merino@cnb.uam.es