280 likes | 423 Views
Large Scale Virtual Screening of Drug Design on the Grid Fighting against Avian Flu. Yun-Ta Wu and Hurng-Chun Lee ISGC 2006, Taiwan. Docking workflow preparation Contact point: Y.T. Wu E. Rovida P. D'Ursi N. Jacq Grid resource management Contact point: J. Salzemann
E N D
Large Scale Virtual Screening of Drug Design on the GridFighting against Avian Flu Yun-Ta Wu and Hurng-Chun Lee ISGC 2006, Taiwan
Docking workflow preparation Contact point: Y.T. Wu E. Rovida P. D'Ursi N. Jacq Grid resource management Contact point: J. Salzemann TWGrid : H.C. Lee, H. Y. Chen AuverGrid : E. Medernach EGEE : Y. Legré Platform deployment on the Grid Contact point: H.C. Lee, J. Salzemann M. Reichstadt N. Jacq Users (deputy) J. Salzemann (N. Jacq) M. Reichstadt (E. Medernach) L. Y. Ho (H. C. Lee) I. Merelli, C. Arlandini (L. Milanesi) J. Montagnat (T. Glatard) R. Mollon (C. Blanchet) I. Blanque (D. Segrelles) D. Garcia Credit + Gridoperational supports from sites and operation centers + DIANE technical supports from CERN-ARDA group
Outline • The avian flu • EGEE biomed data challenge II • Conclusion
H7N7 H5N1 H5N1 H9N2 H5N1 Influenza A pandemic HA NA H1N1 H1N1 H2N2 H3N2 H1N1 2006 2005 Apr 21, 2006 113 deaths/204 cases http://www.who.int/csr/disease/avian_influenza
A closer look at bird flu • The bird flu virus is named H5N1. H5 and N1 correspond to the name of proteins (Hemagglutinins and Neuraminidases) on the virus surface. • Neuraminidases play a major role in the virus multiplication • Present drugs such as Tamiflu inhibit the action of neuraminidases and stop the virus proliferation • The N1 protein is known to evolve into variants if it comes under drug stress • To free-up medicinal chemists’ time to better response to instant and large scale threats, a large scale in-silico screening was set for initial investment of the design of new drug
In-silico (virtual) screening of drug design • Computer-based in-silico screening can help to identify the most promising leads for biological tests • systematic and productive • reduces the cost of trail-and-error approach • The requirement of CPU power and storage space increases proportional to the number of compounds and target receptors involved in the screening • massive virtual screening is time consuming
The computing challenge of large scale in-silico screening • Molecular docking engine • Autodock • FlexX • Problem size • 8 predicted possible variants of Influenza A neuraminidase N1 as targets • around 300 K compounds from ZINC database and a chemical combinatorial library • Computing challenge (a rough measurement based on Xeon 2.8 GHz) • Each docking requires ~ 30 mins CPU time • Required computing power in total ~ 137 CPU years • Storage requirement • Each docking produces results with the size of 130 KByte • Required storage space in total ~ 600 GByte (with 1 back-up) • To speed-up and reduce the cost to develop new drugs, high-throughput screening is demanded • That’s the Grid can help !!
EGEE Biomed DC II – objectives • Biological goal • finding potential compounds that can inhibit the activities of Influenza A neuraminidase N1 subtype variants • Biomedical goal • accelerating the discovery of novel potent inhibitors thru minimizing non-productive trial-and-error approaches • Grid goal • aspect of massive throughput: reproducing a grid-enabled in silico process (exercised in DC I) with a shorter time of preparation • aspect of interactive feedback: evaluating an alternative light-weight grid application framework (DIANE) in terms of stability, scalability and efficiency
EGEE Biomed DC II – grid resources • AuverGrid • BioinfoGrid • EGEE-II • Embrace • TWGrid • a world-wide infrastructure providing over than 5,000 CPUs
EGEE biomed DC II – current status • The first DC job was submitted at 10 Apr, 2006 • It is scheduled to be finished in the mid of May • As of today, we have completed 1500K dockings • ~60 % of the whole challenge (i.e. 82 CPU years) • Grid efficiency ~ 80%
EGEE Biomed DCII – the Grid tools • WISDOM • has succeeded to handle the first EGEE biomed DC • a workflow of Grid job handling: automated job submission, status check and report, error recovery • push model job scheduling • batch mode job handling • DIANE • a framework for applications with master-worker model • pull mode job scheduling • interactive mode job handling with flexible failure recovery feature • we will focus on this framework in the following discussions
The WISDOM workflow in DC2 • Developed for 1st data challenge fighting against Malaria • 40 millian dockings (80 CPU years) were done in 6 weeks • 1700 CPUs in 15 countries were used simultaneously • Reproducing a grid-enabled in silico processing with a shorter time of preparation (< 1 month preparation time has been achieved) • Testing new submission strategy to improve the Grid efficiency Use AutoDock in DC2 http://wisdom.eu-egee.fr
The DIANE framework • DIANE = Distributed Analysis Environment • A lightweight framework for parallel scientific applications in master-worker model • ideal for applications without communications between parallel tasks (e.g. for most of the Bioinformatics applications in analyzing huge amount of independent dataset) • The framework takes care of all synchronization, communication and workflow management details on behalf of application http://cern.ch/diane
The DIANE exercise in DC2 • Taking care of the dockings of 1 variant • the mission is to complete 300 K dockings • Taking a small subset of the resources • the mission is to handle several hundred concurrent DIANE workers by one DIANE master for a long period • Testing the stability of the framework • Evaluating the deployment efforts and the usability of the framework • Demonstrating efficient computing resource integration and usage
Statistics of one of the DIANE runs • Submitted Grid jobs: 300 • “Healthy” jobs: 261 (87%) • Total number of dockings: 40210 • Total CPU time: 55684848 sec (1.76 year) • Job duration: 249746 sec (2.9 days) 9.24 CPU years 250 CPUs x two week
Development efforts The Autodock adapter for DC2 is around 500 lines of python codes Deployment efforts The DIANE framework and Autodock adaptor are installed on-the-fly on the Grid nodes Targets and compound databases can be prepared on the UI or pre-stored on the Grid storages Output are returned to the UI interactively Master Worker Development and deployment efforts of DIANE
autodock.job Intuitive user interface of DIANE • Start the DIANE job and allocate 64 workers from LCG and local cluster • Allocate more workers from LCG if resources are available # -*- python -*- Application = 'Autodock' JobInitData = {'macro_repos' :‘file:///home/hclee/diane_demo/autodock/macro', 'ligand_repos':‘file:///home/hclee/diane_demo/autodock/ligand', 'ligand_list':'/home/hclee/diane_demo/biomed_dc2/ligand/ligands.list', 'dpf_parafile':'/home/hclee/diane_demo/biomed_dc2/parameters/dpf3gen.awk', 'output_prefix':'autodock_test' } ## The input files will be staged in to workers InputFiles = [JobInitData[‘dpf_parafile’]] % diane.startjob –job autodock.job –ganga –w 32@lcg,32@pbs % diane.ganga.submitworkers –job autodock.job –nw=100 –bk=lcg
The profile of DIANE job good load balance A simple test on local cluster a DIANE/Autodock Task = 1 docking
DIANE removes the “bad” workers The profile of “realistic” DIANE job • Each horizontal line segment = one task = one docking • Unhealthy workers are removed from the worker list • Failed tasks are rescheduled to healthy workers
stable throughput DIANE utilizes ~ 95% of the healthy resources Efficiency and throughput of DIANE • 280 DIANE worker agents were submitted as LCG jobs • 200 jobs (~71%) were healthy • ~16 % failures related to middleware errors • ~12 % failures related to application errors
plain table of GANGA job logging info. Logging and bookkeeping … Thanks to GANGA In [1]: print jobs[‘DIANE_6’] Statistics: 325 jobs slice("DIANE_6") -------------- # id status name subjobs application backend backend.actualCE # 1610 running DIANE_6 Executable LCG melon.ngpp.ngp.org.sg:2119/jobmanager-lcgpbs- # 1611 running DIANE_6 Executable LCG node001.grid.auth.gr:2119/jobmanager-lcgpbs-b # 1612 running DIANE_6 Executable LCG polgrid1.in2p3.fr:2119/jobmanager-lcgpbs-biom # 1613 failed DIANE_6 Executable LCG polgrid1.in2p3.fr:2119/jobmanager-lcgpbs-sdj # 1614 submitted DIANE_6 Executable LCG ce01.ariagni.hellasgrid.gr:2119/jobmanager-pb # 1615 running DIANE_6 Executable LCG ce01.pic.es:2119/jobmanager-lcgpbs-biomed # 1616 running DIANE_6 Executable LCG ce01.tier2.hep.manchester.ac.uk:2119/jobmanag # 1617 running DIANE_6 Executable LCG clrlcgce03.in2p3.fr:2119/jobmanager-lcgpbs-bi • Helpful for tracing the execution progress and Grid job errors • Fairly easy to visualize the job statistics
The in-silico screening provides not only the docking poses of a compound against the target but also the docking energy • By ranking the information, chemist can select the promising compounds to go on the structure-based drug design for potential drugs
Enrichment Wilde-type NA ~13% 2qw = Relenza
Re-scored by considering both binding energyand subsite preference Two know strong inhibitors to “wilde-type” NA were ranked 1st and 2nd … The subsite preference will guide the future screening for variant targets in the lab
Conclusion • From biological point of view • We managed to shorten the molecular docking process of structure-based drug design from 137 year to 4 weeks • A large set of complexes has been produced on the Grid for further analysis • From Grid point of view • The DC has demonstrated that large-scale scientific challenge can be effortlessly tackled on the Grid • The WISDOM system has successfully reproduced the massive throughput of in-silico screening with minimized deployment effort • The DIANE framework which can take control of Grid failures and isolate Grid system latency does benefit the Grid application in terms of efficiency, stability and usability • Moving toward a service • Stability and reliability of Grid has been tested through the DC activity and the result encourages the movement from prototype to real service • Friendly graphic user interfaces for up-coming analysis among the large set of outputs is needed