250 likes | 267 Views
Efficient handling of Large Scale in-silico Screening Using DIANE The 2nd EGEE Biomed Data Challenge against Avian Flu. Hurng-Chun Lee HealthGrid 2006, Valencia, Spain. Docking workflow preparation Contact point: Y.T. Wu E. Rovida P. D'Ursi N. Jacq Grid resource management
E N D
Efficient handling of Large Scalein-silico Screening Using DIANEThe 2nd EGEE Biomed Data Challenge against Avian Flu Hurng-Chun Lee HealthGrid 2006, Valencia, Spain
Docking workflow preparation Contact point: Y.T. Wu E. Rovida P. D'Ursi N. Jacq Grid resource management Contact point: J. Salzemann TWGrid : H.C. Lee, H. Y. Chen AuverGrid : E. Medernach EGEE : Y. Legré Platform deployment on the Grid Contact point: H.C. Lee, J. Salzemann M. Reichstadt N. Jacq Users (deputy) J. Salzemann (N. Jacq) M. Reichstadt (E. Medernach) L. Y. Ho (H. C. Lee) I. Merelli, C. Arlandini (L. Milanesi) J. Montagnat (T. Glatard) R. Mollon (C. Blanchet) I. Blanque (D. Segrelles) D. Garcia Credit + Gridoperational supports from sites and operation centers + DIANE technical supports from CERN-ARDA group
Outline • The avian flu • EGEE biomed data challenge II • The activity of DIANE in DC2 • Conclusion and future work
H7N7 H5N1 H5N1 H9N2 H5N1 Influenza A pandemic HA NA H1N1 H1N1 H2N2 H3N2 H1N1 2006 2005 Apr 21, 2006 113 deaths/204 cases http://www.who.int/csr/disease/avian_influenza
A closer look at bird flu • The bird flu virus is named H5N1. H5 and N1 correspond to the name of proteins (Hemagglutinins and Neuraminidases) on the virus surface. • Neuraminidases play a major role in the virus multiplication • Present drugs such as Tamiflu inhibit the action of neuraminidases and stop the virus proliferation • The N1 protein is known to evolve into variants if it comes under drug stress • To free-up medicinal chemists’ time to better response to instant and large scale threats, a large scale in-silico screening was set for initial investment of the design of new drug
In-silico (virtual) screening of drug design • Computer-based in-silico screening can help to identify the most promising leads for biological tests • systematic and productive • reduces the cost of trail-and-error approach • The requirement of CPU power and storage space increases proportional to the number of compounds and target receptors involved in the screening • massive virtual screening is time consuming
The computing challenge of large scale in-silico screening • Molecular docking engine • Autodock • FlexX • Problem size • 8 predicted possible variants of Influenza A neuraminidase N1 as targets • around 300 K compounds from ZINC database and a chemical combinatorial library • Computing challenge (a rough measurement based on Xeon 2.8 GHz) • Each docking requires ~ 30 mins CPU time • Required computing power in total is more than 100 CPU year • Storage requirement • Each docking produces results with the size of 130 KByte • Required storage space in total ~ 600 GByte (with 1 back-up) • To reduce the cost to develop new drugs, high-throughput screening is demanded • That’s the Grid can help !!
EGEE Biomed DC II – objectives • Biological goal • finding potential compounds that can inhibit the activities of Influenza A neuraminidase N1 subtype variants • Biomedical goal • accelerating the discovery of novel potent inhibitors thru minimizing non-productive trial-and-error approaches • Grid goal • high throughput screening (HTS): reproducing a grid-enabled in silico process (exercised in DC I) with a short period of preparation time • interactive and efficient control of distributed dockings: evaluating an alternative light-weight grid application framework (DIANE) in terms of stability, scalability and efficiency
EGEE Biomed DC II – grid resources • AuverGrid • BioinfoGrid • EGEE-II • Embrace • TWGrid • a world-wide Grid infrastructure providing over 5,000 CPUs
EGEE Biomed DCII – the Grid tools • The WISDOM platform • has succeeded to handle the first EGEE biomed DC • a workflow of Grid job handling: automated job submission, status check and report, error recovery • push model job scheduling • batch mode job handling • The DIANE framework • a framework for applications in master-worker model • pull mode job scheduling • interactive mode job handling with flexible failure recovery feature • we will focus on this framework in the following discussions
The WISDOM workflow in DC2 • Developed for 1st data challenge fighting against Malaria • 40 millian dockings (80 CPU years) were done in 6 weeks • 1700 CPUs in 15 countries were used simultaneously • Reproducing a grid-enabled in silico processing with a shorter time of preparation (< 1 month preparation time has been achieved) • Testing new submission strategy to improve the Grid efficiency Use AutoDock in DC2 http://wisdom.eu-egee.fr/avianflu/
DIANE = Distributed Analysis Environment The framework takes care of all synchronization, communication and workflow management details on behalf of application A lightweight framework for parallel scientific applications in master-worker model ideal for applications without communications between parallel tasks (e.g. for most of the Bioinformatics applications in analyzing a large number of independent dataset) The DIANE framework Master Worker http://cern.ch/diane
The DIANE exercise in DC2 • Taking care of the dockings of 1 variant • the mission is to complete 300 K dockings • 1/8 of the whole data challenge • Taking a small subset of the resources • the mission is to handle several hundred concurrent DIANE workers by one DIANE master for a long period • Testing the stability of the framework • Evaluating the deployment efforts and the usability of the framework • Demonstrating efficient computing resource integration and usage
Development efforts The Autodock adapter for DC2 is around 500 lines of python codes Deployment efforts The DIANE framework and Autodock adaptor are installed on-the-fly on the Grid nodes Targets and compound databases can be prepared on the Grid UI or pre-stored on the Grid storages Development and deployment efforts of DIANE Running AutoDock
autodock.job Intuitive command line interface of DIANE • Start the data challenge • Request more CPUs (DIANE workers) # -*- python -*- Application = 'Autodock' JobInitData = {'macro_repos' :‘file:///home/hclee/diane_demo/autodock/macro', 'ligand_repos':‘file:///home/hclee/diane_demo/autodock/ligand', 'ligand_list':'/home/hclee/diane_demo/biomed_dc2/ligand/ligands.list', 'dpf_parafile':'/home/hclee/diane_demo/biomed_dc2/parameters/dpf3gen.awk', 'output_prefix':'autodock_test' } ## The input files will be staged in to workers InputFiles = [JobInitData[‘dpf_parafile’]] % diane.startjob –job autodock.job –ganga –w 32@lcg,32@pbs % diane.ganga.submitworkers –job autodock.job –nw=100 –bk=lcg
DIANE CLI binding target and compound selector and filter Graphic user interface for job submission Under development !!
Statistics of the DIANE activity in DC2 • Submitted Grid jobs: 2580 • Total number of dockings: 308585 • Peak of concurrent running CPUs (DIANE workers): 240 • Total CPU time: 526892665.2 sec (~16.7 years) • activity duration: 2645558.4 sec (~30 days) • overall speedup: ~200
The profile of DIANE job good load balance A simple test on local cluster a DIANE/Autodock Task = 1 docking
DIANE removes the “bad” workers The profile of “realistic” DIANE job • Each horizontal segment = one task = one docking • Unhealthy workers are removed from the worker list • Failed tasks are rescheduled to healthy workers
steady throughput DIANE utilizes ~ 95% of the healthy resources Efficiency and throughput of DIANE • 280 DIANE worker agents were submitted as LCG jobs • 200 jobs (~71%) were healthy • ~16 % failures related to middleware errors • ~12 % failures related to application errors
A DIANE instance running for two weeks ~ 2 weeks
plain table of GANGA job logging info. Logging and bookkeeping … Thanks to GANGA In [1]: print jobs[‘DIANE_6’] Statistics: 325 jobs slice("DIANE_6") -------------- # id status name subjobs application backend backend.actualCE # 1610 running DIANE_6 Executable LCG melon.ngpp.ngp.org.sg:2119/jobmanager-lcgpbs- # 1611 running DIANE_6 Executable LCG node001.grid.auth.gr:2119/jobmanager-lcgpbs-b # 1612 running DIANE_6 Executable LCG polgrid1.in2p3.fr:2119/jobmanager-lcgpbs-biom # 1613 failed DIANE_6 Executable LCG polgrid1.in2p3.fr:2119/jobmanager-lcgpbs-sdj # 1614 submitted DIANE_6 Executable LCG ce01.ariagni.hellasgrid.gr:2119/jobmanager-pb # 1615 running DIANE_6 Executable LCG ce01.pic.es:2119/jobmanager-lcgpbs-biomed # 1616 running DIANE_6 Executable LCG ce01.tier2.hep.manchester.ac.uk:2119/jobmanag # 1617 running DIANE_6 Executable LCG clrlcgce03.in2p3.fr:2119/jobmanager-lcgpbs-bi • Helpful for tracing the execution progress and Grid job errors • Fairly easy to visualize the job statistics
Conclusion and future work • From biological point of view • A large-scale molecular docking against predicted H5N1 variants has been done as the initial investment of finding potential compounds that can inhibit the activities of Influenza A neuraminidase N1 subtype variants • By using the Grid, the cost of the large-scale molecular docking process is reduced (from over 100 years to few weeks). • A large set of docking complexes has been produced on the Grid for further analysis • From Grid point of view • The DC has demonstrated that large-scale scientific challenge can be effortlessly tackled on the Grid • The WISDOM system has successfully reproduced the high throughput screening with minimized deployment effort • The DIANE framework which can take control of Grid failures and isolate Grid system latency does benefit the Grid application in terms of efficiency, stability and usability • Moving toward a service • Stability and reliability of Grid has been tested through the DC activity and the result encourages the movement from prototype to real service • An portal based graphic user interface allowing biologists to distribute docking tasks and to ease the access to the docking results on the Grid is under development