410 likes | 597 Views
From Sequence Analysis to Simulations: Applications of HPC in Modern Biology. R. Sankararamakrishnan Department of Biological Sciences & Bioengineering IIT-Kanpur. IIT-K REACH Symposium 2010 Oct 9 th 2010. Computers and Computing in Biology. Mathematical Biology Biostatistics
E N D
From Sequence Analysis to Simulations: Applications of HPC in Modern Biology R. Sankararamakrishnan Department of Biological Sciences & Bioengineering IIT-Kanpur IIT-K REACH Symposium 2010 Oct 9th 2010
Computers and Computing in Biology Mathematical Biology Biostatistics Biomathematics Quantitative Biology Biophysics Bioinformatics Computational Biology
Definitions What is Bioinformatics? - Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data. What is Computational Biology? - The development and application of data-analytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological, behavioral, and social systems. - NIH Definition http://www.bisti.nih.gov/
HPC Applications: Three examples • Evolutionary relationship among a given set of protein or DNA sequences • Drug Discovery and Design • Structure-function relationship of large biomolecular assemblies
Phylogeny and Phylogenetic tree • Study of evolutionary relationships (sequences/species) • Relationships between organisms with common ancestor • Phylogenetic tree is a graph representing evolutionary history of sequences/species
Orangutan Orangutan Human Human Chimpanzee Gorilla Chimpanzee Gorilla Phylogenetic trees can be represented in two different ways Rooted Tree Unrooted Tree Direction of evolution No assumption about common ancestry Has a unique node
Maximum Likelihood Method – An Introduction David Mount (2002)
Maximum Likelihood Method – An Introduction David Mount (2002)
For each unrooted tree, there will be many possible rooted trees
Computing phylogenetic trees using ML method Maximum likelihood phylogeny problem is NP-hard Very CPU intensive For trees containing more than 20 to 25 sequences, the problem cannot be solved any more Efficient heuristic tree search algorithms are required to reduce the size of the search space Recently developed algorithms: IQPNNI, PHYML, GARLI, RAxML None of these algorithms are guaranteed to find theML tree; only yield the best known ML tree
Parallelization strategy Ott et al. (2008)
RAxML performance in some HPC platforms • 212 sequences, 566,470 base pairs • One of the largest datasets analyzed under ML • IBM BlueGene/L; 1024 CPUs • 7 distinct tree searches in 14 hours Ott et al. (2008)
Phylogenetic analysis of plant channel proteins identified new subfamily Bansal and Sankararamakrishnan, BMC Struct. Biol. (2007) Gupta and Sankararamakrishnan, BMC Plant Biol. (2009)
Roles of Computation in Drug Discovery “Is there really a case where a drug that is on the market was designed by a computer?” “The reality is that the use of computers and computer methods permeates all aspects of drug discovery today” Jorgensen (2004)
Computation in Drug Discovery “Drug discovery is complex: Successful teams and companies need to congratulated, whereas search for one individual or computer program is counterproductive. There is not going to be a voila moment at the computer terminal. Instead, there is systematic use of wide-ranging computational tools to facilitate and enhance the drug discovery process” Jorgensen (2004)
Structure-based Drug Design – An Introduction http://csb.stanford.edu/levitt/demo_lectures/lec7/Lecture7/Discovering_Drugs/pages/Structure_Based_Drug_Design.html http://www.biocryst.com/our_science
Drug targets and Drug discovery: Issues Lead Generation Lead optimization De novo design Virtual screening All drugs that are presently in the market are estimated to target less than 500 biomolecules Docking & Scoring Issues: Scoring function, solvent effect and protein flexibility Bleicher et al. (2003)
Four proteins: trypsin, HIV PR, CDK2 and AChE • Test set for each protein: 10,000 randomly selected compounds • 6000 docking poses were selected for the top 1000 compounds • They served as initial conformations for MD simulations Combination of docking and MD showed a higher and more stable enrichment performance than docking method used alone
A special purpose computer, MDGRAPE-3, was used for MD simulations • It is a cluster of personal computers • Each equipped with 24 MDGRAPE-3 chips and has a peak speed of approximately 2 Tflops • 50 such computers were used • Average computational time for a single protein-ligand complex is 2.5 h • For 6,000 protein-ligand conformations, calculations were completed in a week
Steered MD in Drug Discovery Jorgensen, 2010 • Steered Molecular Dynamics to compute the force required to extract the inhibitors from enzymes • A small string is connected to the ligand in the complex • This string is pulled at constant velocity into the surrounding water • Force is determined from the extension of the spring and recorded as a function of time • Strongly-bound inhibitors higher peak forces • Weaker inhibitors flatter profiles
Protein-protein interactions in programmed cell death Bcl-2 family complex structures Total number of atoms: ~50,000 to ~75,000 Simulation period: 50 ns Lama and Sankararamakrishnan, Proteins (2008) Lama and Sankararamakrishnan, Biochemistry (2010)
MD simulations of channel proteins in bilayers AQP1: 75057 Atoms GlpF: 81006 Atoms PfAQP: 81503 Atoms • 30ns production run was performed for all the three systems. • Each simulation takes ~40 days CPU time (Total CPU time ~ 120 days). Alok Jain, Ravi Verma and R. Sankararamakrishnan, Manuscript in preparation
Simulations reaching the million-atom mark Complete virus: 1 million atoms (Freddolino et al., 2006) Arrays of light-harvesting proteins – 1 million atoms (Chandler et al., 2008) BAR domain proteins – 2.3 million atoms (Yin et al., 2009) The flagellum – 2.4 million atoms (Kitao et al., 2006)
Complete virus: 1 million atoms Minimization and equilibration Cluster of 48 AMD Athlon 2600+ processors Simulation 256 Altix nodes at NCSA @UIUC 1.1. ns/day (Freddolino et al., 2006)
Functions of large molecular machines Fungal fatty acid synthase 30S ribosome
MD of protein-conducting channel bound to ribosome Bacterial ribosomes are important targets for antibiotics 2.7 million atoms 50 ns simulation Largest system simulated to date Gumbart et al. (2009)
Drug Design & Discovery HPC Large Biomolecular systems Phylogenetic analysis
HPC Platforms for Biology Applications FPGA-boards: Field programmable gate arrays are ICs which can be programmed. FGPA boards with commonly used bioinformatics algorithms are available Graphics-Processing Unit (GPU):All bioinformatics applications Grid Computing: Many applications Distributed Computing: Protein folding, Drug docking Cloud Computing:
Acknowledgements • Anjali Bansal • Dilraj Lama • Alok Jain • Tuhin Kumar Pal • Priyanka Srivastava • Vivek Modi • Ravi Kumar Verma • Krishna Deepak • Phani Deep DST, DBT, CSIR, MHRD