HPC and the ROMS BENCHMARK Program

HPC and the ROMS BENCHMARK Program Kate Hedstrom August 2003

Outline • New ARSC systems • Experience with ROMS benchmark problem • Other computer news

New ARSC Systems • Cray X1 • 128 MSP (1.5 TFLOPS) • 4 GB/MSP • Water cooled • IBM p690+ and p655+ • 5 TFLOPS total • At least 2 GB/cpu • Air cooled • Arriving in September, switch later

Cray X1 (klondike)

Cray • Cray X1 Node • Node is a 4-way SMP • 16 GB/node • Each MSP has four vector/scalar processors • Processors in MSP share cache • Node usable as 4 MSPs or 16 SSPs • IEEE floating point hardware

Cray • Programming Environment • Fortran, C, C++ • Support for • MPI • SHMEM • Co-Array Fortran • UPC • OpenMP (Fall 2003) • Compiling executes on CPES - Sun V480, happens invisibly to user

IBM • Two p690+ • Like our Regatta, but faster, more memory (8 GB/cpu) • Shared memory between 32 cpu • For big OpenMP jobs • Six p655+ towers • Like our SP, but faster, more memory (2 GB/cpu) • Shared memory on each 8 cpu node, 92 nodes in all • For big MPI jobs and small OpenMP jobs

Benchmark Problem • No external files to read • Three different resolutions • Periodic channel representing the Antarctic Circumpolar Current (ACC) • Steep bathymetry • Idealized winds, clouds, etc., but full computation of atmospheric boundary layer • KPP vertical mixing

IBM and SX6 Notes • SX6 is 8 GFLOPS, Power4 is 5.2 GFLOPS peak • Both less than 10% of peak • IBM scales better, Cray person says SX6 is even worse for more than one node • SX6 best for 1xN tiling, IBM better closer to MxM even though this problem is 512x64

Cray X1 Notes • Have choice of MSP or SSP mode • Four SSPs faster than one MSP • Sixteen MSPs much faster than 64 SSPs • On one MSP, vanilla ROMS spends: • 66% in bulk_flux • 28% in LMD • 2% in 2-D engine • Slower than either Power4 or SX6 • Can inline lmd_wscale and vastly speed up LMD with compiler option, John Levesque has offered to rewrite bulk_flux - aim for 6-8 times faster than Power4 for CCSM

Clusters • Can buy rack mounted turnkey systems running Linux • Need to spend money on: • Memory • Processors - single cpu nodes may be best • Switch - low latency, high bandwidth • Disk storage

Don Morton’s Experience • No such thing as turnkey Beowulf • Need someone to take care of it: • Configure queuing system to make it useful for more than one user • Security updates • Backups

DARPA Petaflops award • Sun, IBM, Cray each awarded ~$50 million for phase-two development • Two will be awarded phase 3 in 2006 • Goal is to achieve petaflops by about 2010, also easier to program, more robust operating environment • Sun - new switch between cpus, memory • IBM - huge cache on chip • Cray - heavyweight, lightweight cpus

Conclusions • Things are still exciting in the computer industry • The only thing you can count on is change

HPC and the ROMS BENCHMARK Program

HPC and the ROMS BENCHMARK Program

Presentation Transcript

PRACE and the HPC Tiers

User Environment Enhancements in the DoD HPC Modernization Program

Building ROMS and using the ROMS Matlab repository

Deep Learning and HPC

HPC Program

The HPC Challenge (HPCC) Benchmark Suite

HPC

The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite

The HPC Challenge (HPCC) Benchmark Suite

HPC, Now and into the Future

Advanced EDA Benchmark Program: Status Report

ROMS Development and Operational Forecast

ROMS Embedded Gridding,

R and HPC

Performance Issues Application Programmers View John Cownie HPC Benchmark Engineer

King County Benchmark Program: 1996 - 2004

The Benchmark Project

The HPC Cluster

HPC Program

The HPC Challenge (HPCC) Benchmark Suite