From Sequence Analysis to Simulations: Applications of HPC in Modern Biology

From Sequence Analysis to Simulations: Applications of HPC in Modern Biology R. Sankararamakrishnan Department of Biological Sciences & Bioengineering IIT-Kanpur IIT-K REACH Symposium 2010 Oct 9th 2010

Computers and Computing in Biology Mathematical Biology Biostatistics Biomathematics Quantitative Biology Biophysics Bioinformatics Computational Biology

Definitions What is Bioinformatics? - Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data. What is Computational Biology? - The development and application of data-analytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological, behavioral, and social systems. - NIH Definition http://www.bisti.nih.gov/

Explosive growth of biological data

HPC Applications: Three examples • Evolutionary relationship among a given set of protein or DNA sequences • Drug Discovery and Design • Structure-function relationship of large biomolecular assemblies

I. HPC in Phylogenetics

Phylogeny and Phylogenetic tree • Study of evolutionary relationships (sequences/species) • Relationships between organisms with common ancestor • Phylogenetic tree is a graph representing evolutionary history of sequences/species

Orangutan Orangutan Human Human Chimpanzee Gorilla Chimpanzee Gorilla Phylogenetic trees can be represented in two different ways Rooted Tree Unrooted Tree Direction of evolution No assumption about common ancestry Has a unique node

Molecular phylogeny in a criminal investigation

Maximum Likelihood Method – An Introduction David Mount (2002)

For each unrooted tree, there will be many possible rooted trees

Number of possible unrooted and rooted trees

Computing phylogenetic trees using ML method Maximum likelihood phylogeny problem is NP-hard Very CPU intensive For trees containing more than 20 to 25 sequences, the problem cannot be solved any more Efficient heuristic tree search algorithms are required to reduce the size of the search space Recently developed algorithms: IQPNNI, PHYML, GARLI, RAxML None of these algorithms are guaranteed to find theML tree; only yield the best known ML tree

Parallelization strategy Ott et al. (2008)

RAxML performance in some HPC platforms • 212 sequences, 566,470 base pairs • One of the largest datasets analyzed under ML • IBM BlueGene/L; 1024 CPUs • 7 distinct tree searches in 14 hours Ott et al. (2008)

Phylogenetic analysis of plant channel proteins identified new subfamily Bansal and Sankararamakrishnan, BMC Struct. Biol. (2007) Gupta and Sankararamakrishnan, BMC Plant Biol. (2009)

II. HPC in Drug Discovery & Drug Design

Roles of Computation in Drug Discovery “Is there really a case where a drug that is on the market was designed by a computer?” “The reality is that the use of computers and computer methods permeates all aspects of drug discovery today” Jorgensen (2004)

Computation in Drug Discovery “Drug discovery is complex: Successful teams and companies need to congratulated, whereas search for one individual or computer program is counterproductive. There is not going to be a voila moment at the computer terminal. Instead, there is systematic use of wide-ranging computational tools to facilitate and enhance the drug discovery process” Jorgensen (2004)

Structure-based Drug Design – An Introduction http://csb.stanford.edu/levitt/demo_lectures/lec7/Lecture7/Discovering_Drugs/pages/Structure_Based_Drug_Design.html http://www.biocryst.com/our_science

www.bmsc.washington.edu/WimHol/sbdd3.JPG Wim Hol

Drug targets and Drug discovery: Issues Lead Generation Lead optimization De novo design Virtual screening All drugs that are presently in the market are estimated to target less than 500 biomolecules Docking & Scoring Issues: Scoring function, solvent effect and protein flexibility Bleicher et al. (2003)

Four proteins: trypsin, HIV PR, CDK2 and AChE • Test set for each protein: 10,000 randomly selected compounds • 6000 docking poses were selected for the top 1000 compounds • They served as initial conformations for MD simulations Combination of docking and MD showed a higher and more stable enrichment performance than docking method used alone

A special purpose computer, MDGRAPE-3, was used for MD simulations • It is a cluster of personal computers • Each equipped with 24 MDGRAPE-3 chips and has a peak speed of approximately 2 Tflops • 50 such computers were used • Average computational time for a single protein-ligand complex is 2.5 h • For 6,000 protein-ligand conformations, calculations were completed in a week

Steered MD in Drug Discovery Jorgensen, 2010 • Steered Molecular Dynamics to compute the force required to extract the inhibitors from enzymes • A small string is connected to the ligand in the complex • This string is pulled at constant velocity into the surrounding water • Force is determined from the extension of the spring and recorded as a function of time • Strongly-bound inhibitors  higher peak forces • Weaker inhibitors  flatter profiles

Protein-protein interactions in programmed cell death Bcl-2 family complex structures Total number of atoms: ~50,000 to ~75,000 Simulation period: 50 ns Lama and Sankararamakrishnan, Proteins (2008) Lama and Sankararamakrishnan, Biochemistry (2010)

III. Large Biomolecular Assemblies

First Biomolecular simulation was performed in 1977

MD simulations of channel proteins in bilayers AQP1: 75057 Atoms GlpF: 81006 Atoms PfAQP: 81503 Atoms • 30ns production run was performed for all the three systems. • Each simulation takes ~40 days CPU time (Total CPU time ~ 120 days). Alok Jain, Ravi Verma and R. Sankararamakrishnan, Manuscript in preparation

Simulations reaching the million-atom mark Complete virus: 1 million atoms (Freddolino et al., 2006) Arrays of light-harvesting proteins – 1 million atoms (Chandler et al., 2008) BAR domain proteins – 2.3 million atoms (Yin et al., 2009) The flagellum – 2.4 million atoms (Kitao et al., 2006)

Complete virus: 1 million atoms Minimization and equilibration Cluster of 48 AMD Athlon 2600+ processors Simulation 256 Altix nodes at NCSA @UIUC 1.1. ns/day (Freddolino et al., 2006)

Functions of large molecular machines Fungal fatty acid synthase 30S ribosome

MD of protein-conducting channel bound to ribosome Bacterial ribosomes are important targets for antibiotics 2.7 million atoms 50 ns simulation Largest system simulated to date Gumbart et al. (2009)

Drug Design & Discovery HPC Large Biomolecular systems Phylogenetic analysis

HPC Platforms for Biology Applications FPGA-boards: Field programmable gate arrays are ICs which can be programmed. FGPA boards with commonly used bioinformatics algorithms are available Graphics-Processing Unit (GPU):All bioinformatics applications Grid Computing: Many applications Distributed Computing: Protein folding, Drug docking Cloud Computing:

Acknowledgements • Anjali Bansal • Dilraj Lama • Alok Jain • Tuhin Kumar Pal • Priyanka Srivastava • Vivek Modi • Ravi Kumar Verma • Krishna Deepak • Phani Deep DST, DBT, CSIR, MHRD

From Sequence Analysis to Simulations: Applications of HPC in Modern Biology

From Sequence Analysis to Simulations: Applications of HPC in Modern Biology

Presentation Transcript

OVERVIEW

NCBI

Modern Piracy:

Sequence analysis with Scripture

Chapter 55

Modern Art

Lecture 19 Flow Analysis flow analysis in prolog; applications of flow analysis

Modern Architecture

Sequence Alignment

Biology

Lecture 5 Advanced (= Modern) Regression Analysis

Algorithms for Discovering Patterns in Sequences

EOC BIOLOGY REVIEW

1-month Practical Course Genome Analysis Lecture 5: Multiple Sequence Alignment

Oklahoma City University

In silico systems biology:network reconstruction, analysis and network based modelling

Computational Systems Biology … Biology X – Lecture 1 …

Tools for multiple sequence alignment

NCBI Molecular Biology Resources

8086 Interrupts and Interrupt Applications

Chapter 22: Evolution

Oral Biology 5301