100 likes | 116 Views
Explore the use of swarm and bundle techniques on the Biowulf cluster for bioinformatics and biostatistics applications. Learn about GWAS, sequence analysis, statistical modeling, protein folding, molecular docking, tomographic reconstruction, and more. Discover how to run multiple independent processes in parallel and maximize computational efficiency.
E N D
Swarms and Bundles: Bioinformatics and Biostatistics on Biowulf David Hoover Scientific Computing Branch, Division of Computer System Services CIT, NIH
Embarrassingly Parallel Problems • GWAS, with huge numbers of SNPs • Sequence analysis, assembly, and mapping • Testing and validating statistical models • Protein folding and threading • Molecular docking and compound screening • Tomographic reconstruction
Characterization of Surface Protein 3 from Malaria Parasite P. Falciparum Protein folding calculations with Rosetta++ 100,000 cpu hours Tsai et al., Mol. Biochem. Parasitology, online preprint 2008
How to run multiple independent processes in parallel 16 independent processes input output input output command command
Biowulf Cluster Batch System job1 job1.out script batch job16 job16.out script batch
Swarm biowulf% swarm -f file job1 job2 job3 job4 Node 1 Node 2 Node 3 Node 4 job1.out job2.out job3.out job4.out
Bundled Swarm biowulf% swarm -f file -b 4 job1 Node 1 job1.out
Swarm Facts • Written and maintained by Helix Systems Staff • swarm introduced in late 2000 • 82% of all batch jobs run on the cluster since 2002 are swarm jobs • ~60% of all wall time spent on swarm jobs • swarm has been shared with clusters around the world
Swarm World Records • Largest swarm: 683,445 commands • Largest bundle: 24,000 commands per CPU
Future Challenges • How to deal with larger multicore nodes? Node 1 Node 2 Node 3