Computational Biology and Bioinformatics Environment ComBinE

National Facility Projects 1. Professor Mark Ragan (Institute for Molecular Bioscience)2. Dr Thomas Huber (Department of Mathematics) Computational Biology andBioinformatics EnvironmentComBinE

Comparison of protein families among completely sequenced microbial genomes The scientific problem: Handcrafted analyses suggest that gene transfer in nature may be not only from parents to offspring (“vertical”), but also from one lineage to another (“lateral” or “horizontal”) From microbial genomics we have complete inventories of genes & proteins in ~ 80 genomes Comparative analysis should identify all cases of vertical and lateral gene transfer

Computational requirement for 80 genomes: 1012 BLAST comparisons 5000 T-Coffee alignments 5000 Bayesian inference trees 107 topological comparisons The approach Find all interestingly large protein families in all microbial genomes Generate structure-sensitive multiple alignments Infer phylogenetic trees with appropriate statistics Compare trees, look for topological incongruence

Computations on APAC National Facility Usage of NF: Code not yet parallelised With each run costing a few 10s of hours and need for 1000s analyses, it’s more efficient to use many processors simultaneously Motif-based multiple alignment 30-50 sequences = 2-5 hours per run Will need ~5000 runs @ 4 - 60 seqs Bayesian inference Parameterisation of (MC)3 search NF used for trials of up to 106 Markov chain generations (~200 hours / run) 1.5-2.0 Gb RAM per run

Parameterisation of Metropolis-coupled Markov chain Monte Carlo optimisation through protein tree space Bayesian inference (MrBayes 2.0) applied to 34-sequence Elongation Factor 1 dataset. Eight simultaneous Markov chains, discrete approximation of gamma distribution ( = 0.29), chain temperature 0.1000 Log-likelihood as a function of number of Markov chain generations Approach to stationarity under Jones et al. (1992) and General time-reversible models of protein sequence change

With thanks to collaborators Mark Borodovsky, Georgia Tech Robert Charlebois, NGI Inc. (Ottawa) Tim Harlow, University of Queensland Jeffrey Lawrence, University of Pittsburgh Thomas Rand, St Mary’s University

National Facility Projects 1. Professor Mark Ragan (Institute for Molecular Bioscience)2. Dr Thomas Huber (Department of Mathematics) Computational Biology andBioinformatics EnvironmentComBinE

Protein Structure Prediction • Two Lineages • The bioinformatics approach • Compare sequence to other sequence • huge datasets (0.5*106 sequences) • Match sequence with known structure • (Low resolution force field development) • The biophysics approach • Simulations that mimic natural behaviour

Protein Structure Prediction • Two Lineages • The bioinformatics approach • Compare sequence to other sequence • huge datasets (0.5*106 sequences) • Match sequence with known structure • (Low resolution force field development) • The biophysics approach • Simulations that mimic natural behaviour Hardware Requirements: CPU: minutes/seq Mem:  1 GB CPU: hours/seq Mem:  100s MB CPU: 100s hours Mem: 10s MB

Protein Structure Prediction • Two Lineages • The bioinformatics approach • Compare sequence to other sequence • huge datasets (0.5*106 sequences) • Match sequence with known structure • (Low resolution force field development) • The biophysics approach • Simulations that mimic natural behaviour Parallelism: Trivial parallel Trivial parallel Hard parallel High bandwidth + low latency requirement

Force splitting and multiple time step integration (Ian Lenane) • Time step required: 10-15s • Time scale wanted: >10-3s • System is split in different domains • Fast varying forces (cheap to calculate) are integrated more frequent • Slow varying forced (expensive to calculate) are integrated less frequent • More efficient integration • Easy to expand to parallel simulations

Path simulations (Ben Gladwin) • What if start and end points are given? • proteins: unfolded  folded • Molecular machines: 1 cycle • Shortest path calculations • Floyd, Dijkstra • Hamilton’s principle of least action • Computationally very attractive • Extremely long time steps • Very well suited for parallel architectures (Floyd algorithm parallelized, but performance problems >4PE on -GS NUMA architecture)

National Facility supercomputer use • 2001 CPU quota: 2*5250 + 8000 service units • Total use  12000 units (3000 units in parallel) • 2002 CPU quota: 4 * 6000 service units • First quarter: 2000 units • Second quarter: 85 units • Collaborators • Dr A. Torda (ANU) Low resolution force fields / protein structure prediction • Prof. D. Hume, A/Prof. B. Kobe and Dr. J. Martin (UQ) Structural genomics project • Prof. K. Burrage, I. Lenane and B. Galdwin (UQ) Numerical integration and path simulations • Special Thanks • Mrs J. Jenkinson and Dr D. Singleton (NF/ANUSF)

Computational Biology and Bioinformatics Environment ComBinE