1 / 19

HPC and the ROMS BENCHMARK Program

HPC and the ROMS BENCHMARK Program. Kate Hedstrom August 2003. Outline. New ARSC systems Experience with ROMS benchmark problem Other computer news. New ARSC Systems. Cray X1 128 MSP (1.5 TFLOPS) 4 GB/MSP Water cooled IBM p690+ and p655+ 5 TFLOPS total At least 2 GB/cpu Air cooled

mercury
Download Presentation

HPC and the ROMS BENCHMARK Program

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. HPC and the ROMS BENCHMARK Program Kate Hedstrom August 2003

  2. Outline • New ARSC systems • Experience with ROMS benchmark problem • Other computer news

  3. New ARSC Systems • Cray X1 • 128 MSP (1.5 TFLOPS) • 4 GB/MSP • Water cooled • IBM p690+ and p655+ • 5 TFLOPS total • At least 2 GB/cpu • Air cooled • Arriving in September, switch later

  4. Cray X1 (klondike)

  5. Cray • Cray X1 Node • Node is a 4-way SMP • 16 GB/node • Each MSP has four vector/scalar processors • Processors in MSP share cache • Node usable as 4 MSPs or 16 SSPs • IEEE floating point hardware

  6. Cray • Programming Environment • Fortran, C, C++ • Support for • MPI • SHMEM • Co-Array Fortran • UPC • OpenMP (Fall 2003) • Compiling executes on CPES - Sun V480, happens invisibly to user

  7. IBM • Two p690+ • Like our Regatta, but faster, more memory (8 GB/cpu) • Shared memory between 32 cpu • For big OpenMP jobs • Six p655+ towers • Like our SP, but faster, more memory (2 GB/cpu) • Shared memory on each 8 cpu node, 92 nodes in all • For big MPI jobs and small OpenMP jobs

  8. Benchmark Problem • No external files to read • Three different resolutions • Periodic channel representing the Antarctic Circumpolar Current (ACC) • Steep bathymetry • Idealized winds, clouds, etc., but full computation of atmospheric boundary layer • KPP vertical mixing

  9. IBM and SX6 Notes • SX6 is 8 GFLOPS, Power4 is 5.2 GFLOPS peak • Both less than 10% of peak • IBM scales better, Cray person says SX6 is even worse for more than one node • SX6 best for 1xN tiling, IBM better closer to MxM even though this problem is 512x64

  10. Cray X1 Notes • Have choice of MSP or SSP mode • Four SSPs faster than one MSP • Sixteen MSPs much faster than 64 SSPs • On one MSP, vanilla ROMS spends: • 66% in bulk_flux • 28% in LMD • 2% in 2-D engine • Slower than either Power4 or SX6 • Can inline lmd_wscale and vastly speed up LMD with compiler option, John Levesque has offered to rewrite bulk_flux - aim for 6-8 times faster than Power4 for CCSM

  11. Clusters • Can buy rack mounted turnkey systems running Linux • Need to spend money on: • Memory • Processors - single cpu nodes may be best • Switch - low latency, high bandwidth • Disk storage

  12. Don Morton’s Experience • No such thing as turnkey Beowulf • Need someone to take care of it: • Configure queuing system to make it useful for more than one user • Security updates • Backups

  13. DARPA Petaflops award • Sun, IBM, Cray each awarded ~$50 million for phase-two development • Two will be awarded phase 3 in 2006 • Goal is to achieve petaflops by about 2010, also easier to program, more robust operating environment • Sun - new switch between cpus, memory • IBM - huge cache on chip • Cray - heavyweight, lightweight cpus

  14. Conclusions • Things are still exciting in the computer industry • The only thing you can count on is change

More Related