180 likes | 393 Views
Blue Gene for Protein Structure Prediction (Predicting CASP Targets in Record Time). Ross C. Walker. The CASP Competition. What is CASP? Critical Assessment of Techniques for Protein Structure Prediction (CASP) Biennial competition in protein structure prediction
E N D
Blue Gene for ProteinStructure Prediction(Predicting CASP Targets in Record Time) Ross C. Walker
The CASP Competition • What is CASP? • Critical Assessment of Techniques for Protein Structure Prediction (CASP) • Biennial competition in protein structure prediction • “world cup” of protein structure prediction • CASP v7 ran 10th May 2006 to 29th Aug 2006 • ca. 100 sequences over 100 days
Protein Structure Prediction(Rosetta) • Homology Modeling • (Large sequence alignment) • Template Based Modeling • (Some sequence alignment) • Ab Initio • (No appreciable sequence alignment) The Rosetta Code of Prof. David Baker (HHMI) Supports all 3 Approaches
Template Based Predictions • Used for the majority of CASP targets • Align sequence with proteins of known structure • Generate initial “decoy” structures • Do a monte-carlo refinement of the structures • Structures with lowest energy “should” be the native structure.
The Problem • Many thousands of refinements need to be completed in order to adequately sample phase space. • CASP competition is time sensitive • Sequences released continuously • Predictions must be submitted within 3 weeks of sequence release • Requires access to large computing resources.
SDSC and Rosetta • A collaboration between SDSC’s Scientific Applications Computing (SAC) group and David Baker • Scientists from SDSC parallelized the Rosetta code to run on many thousands of processors • Provided tailored resource allocation on SDSC Blue Gene and DataStar machines • Provided the Baker team with access to 2 orders of magnitude more computing power than they had for CASP 6 (2004).
Modifications Specific to Blue Gene • Aggressively account for all memory used. • Variable Chunk Size Distribution by Master Thread. • No Global Communications - All point to point. • Distributed I/O - All tasks read directly from disk and write directly to disk.(No distribution of work packets over interconnect - overloads master thread. Only Job ID info sent) • Master generation of random seed for each slave thread - ensures no two threads have the same random seed.
Rosetta Usage on SDSC Blue Gene • CASP 2006 • 1,080,000 SUs used (Average run size = 2048 cpus) • 2007 (Estimated) • Protein Structure Prediction 2,500,000 SUs (4096 cpus) • Protein Design 1,800,000 SUs (2048 cpus)
A Demonstration • Successful scaling to >40,000 processors allowed a demonstration to be run at IBM Watson Research Labs • Ross Walker (SDSC) and Srivatsan Raman (UW) took a CASP target released earlier in the day • Generated Initial Guesses • Submitted Job to all 20 racksof IBM Watson Blue Gene • Ran for 3 hours • Generated 120,000 Decoys • Best candidate was selected andsubmitted as CASP prediction thesame day.
ResultsCASP 2006 Target T0380 Green = PredictionBlue = X-RayPink = Initial Template
ResultsCASP 2006 Target T0380 Baker team results shown in black.
The Future(1 million <gulp!> CPUs and beyond) • Hierarchical Job Distribution System (1 master thread approach will be overloaded). • On the fly detection of failed nodes and error correction. • Manual Buffering of I/O? [Requires more memory per node] • Parallelization of individual refinements. (SMP or MPI options)
Acknowledgements • David Baker (UW) • Srivatsan Raman (UW) • John Karanicolas (UW) • IBM T.J.Watson Research • SDSC NSF Funded SAC Program