190 likes | 285 Views
HPC and the ROMS BENCHMARK Program. Kate Hedstrom August 2003. Outline. New ARSC systems Experience with ROMS benchmark problem Other computer news. New ARSC Systems. Cray X1 128 MSP (1.5 TFLOPS) 4 GB/MSP Water cooled IBM p690+ and p655+ 5 TFLOPS total At least 2 GB/cpu Air cooled
E N D
HPC and the ROMS BENCHMARK Program Kate Hedstrom August 2003
Outline • New ARSC systems • Experience with ROMS benchmark problem • Other computer news
New ARSC Systems • Cray X1 • 128 MSP (1.5 TFLOPS) • 4 GB/MSP • Water cooled • IBM p690+ and p655+ • 5 TFLOPS total • At least 2 GB/cpu • Air cooled • Arriving in September, switch later
Cray • Cray X1 Node • Node is a 4-way SMP • 16 GB/node • Each MSP has four vector/scalar processors • Processors in MSP share cache • Node usable as 4 MSPs or 16 SSPs • IEEE floating point hardware
Cray • Programming Environment • Fortran, C, C++ • Support for • MPI • SHMEM • Co-Array Fortran • UPC • OpenMP (Fall 2003) • Compiling executes on CPES - Sun V480, happens invisibly to user
IBM • Two p690+ • Like our Regatta, but faster, more memory (8 GB/cpu) • Shared memory between 32 cpu • For big OpenMP jobs • Six p655+ towers • Like our SP, but faster, more memory (2 GB/cpu) • Shared memory on each 8 cpu node, 92 nodes in all • For big MPI jobs and small OpenMP jobs
Benchmark Problem • No external files to read • Three different resolutions • Periodic channel representing the Antarctic Circumpolar Current (ACC) • Steep bathymetry • Idealized winds, clouds, etc., but full computation of atmospheric boundary layer • KPP vertical mixing
IBM and SX6 Notes • SX6 is 8 GFLOPS, Power4 is 5.2 GFLOPS peak • Both less than 10% of peak • IBM scales better, Cray person says SX6 is even worse for more than one node • SX6 best for 1xN tiling, IBM better closer to MxM even though this problem is 512x64
Cray X1 Notes • Have choice of MSP or SSP mode • Four SSPs faster than one MSP • Sixteen MSPs much faster than 64 SSPs • On one MSP, vanilla ROMS spends: • 66% in bulk_flux • 28% in LMD • 2% in 2-D engine • Slower than either Power4 or SX6 • Can inline lmd_wscale and vastly speed up LMD with compiler option, John Levesque has offered to rewrite bulk_flux - aim for 6-8 times faster than Power4 for CCSM
Clusters • Can buy rack mounted turnkey systems running Linux • Need to spend money on: • Memory • Processors - single cpu nodes may be best • Switch - low latency, high bandwidth • Disk storage
Don Morton’s Experience • No such thing as turnkey Beowulf • Need someone to take care of it: • Configure queuing system to make it useful for more than one user • Security updates • Backups
DARPA Petaflops award • Sun, IBM, Cray each awarded ~$50 million for phase-two development • Two will be awarded phase 3 in 2006 • Goal is to achieve petaflops by about 2010, also easier to program, more robust operating environment • Sun - new switch between cpus, memory • IBM - huge cache on chip • Cray - heavyweight, lightweight cpus
Conclusions • Things are still exciting in the computer industry • The only thing you can count on is change