210 likes | 327 Views
Benchmark performance on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl.gov NERSC User Group Meeting June 12, 2006. Architectural Comparison. NERSC 5 Application Benchmarks. CAM3 Climate model, NCAR GAMESS Computational chemistry, Iowa State, Ames Lab GTC Fusion, PPPL
E N D
Benchmark performance on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl.gov NERSC User Group Meeting June 12, 2006
NERSC 5 Application Benchmarks • CAM3 • Climate model, NCAR • GAMESS • Computational chemistry, Iowa State, Ames Lab • GTC • Fusion, PPPL • MADbench • Astrophysics (CMB analysis), LBL • Milc • QCD, multi-site collaboration • Paratec • Materials science,developed LBL and UC Berkeley • PMEMD • Computational chemistry, University of North Carolina-Chapel Hill
CAM3 • Community Atmospheric Model version 3 • Developed at NCAR with substantial DOE input, both scientific and software. • The atmosphere model for CCSM, the coupled climate system model. • Also the most timing consuming part of CCSM. • Widely used by both American and foreign scientists for climate research. • For example, Carbon, bio-geochemistry models are built upon (integrated with) CAM3. • IPCC predictions use CAM3 (in part) • About 230,000 lines codes in Fortran 90. • 1D Decomposition, runs up to 128 processors at T85 resolution (150Km) • 2D Decomposition, runs up to 1680 processors at 0.5 deg (60Km) resolution.
GAMESS • Computational chemistry application • Variety of electronic structure algorithms available • About 550,000 lines of Fortran 90 • Communication layer makes use of highly optimized vendor libraries • Many methods available within the code • Benchmarks are DFT energy and gradient calculation, MP2 energy and gradient calculation • Many computational chemistry studies rely on these techniques • Exactly the same as DOD HPCMP TI-06 GAMESS benchmark • Vendors will only have to do the work once
GAMESS: Performance • Small case: large, messy, low computational-intensity kernels problematic for compilers • Large case depends on asynchronous messaging
GTC • Gyrokinetic Toroidal Code • Important code for Fusion SciDAC Project and for the International Fusion collaboration ITER. • Transport of thermal energy via plasma microturbulence using particle-in-cell approach (PIC) 3D visualization of electrostatic potential in magnetic fusion device
GTC: Performance • SX8 highest raw performance (ever) but lower efficiency than ES • Scalar architectures suffer from low computational intensity, irregular data access, and register spilling • Opteron/IB is 50% faster than Itanium2/Quadrics and only 1/2 speed of X1 • Opteron: on-chip memory controller and caching of FP L1 data • X1 suffers from overhead of scalar code portions
MADbench • Cosmic microwave background radiation analysis tool (MADCAP) • Used large amount of time in FY04 and one of the highest scaling codes at NERSC • MADBench is a benchmark version of the original code • Designed to be easily run with synthetic data for portability. • Used in a recent study in conjunction with Berkeley Institute for Performance Studies (BIPS). • Written in C making extensive use of ScaLAPACK libraries • Has extensive I/O requirements
MADbench: Performance • Dominated by • Blas3 • I/O
MILC • Quantum ChromoDynamics application • Widespread community use, large allocation • Easy to build, no dependencies, standards conforming • Can be setup to run on wide-range of concurrency • Conjugate gradient algorithm • Physics on a 4D lattice • Local computations are 3x3 complex matrix multiplies, with sparse (indirect) access pattern
PARATEC • Parallel Total Energy Code • Plane Wave DFT using custom 3D FFT • 70% of Materials Science Computation at NERSC is done via Plane Wave DFT codes. PARATEC capture the performance of a wide range of codes (VASP, CPMD, PETOT).
PARATEC: Performance • All architectures generally perform well due to computational intensity of code (BLAS3, FFT) • SX8 achieves highest per-processor performance • X1/X1E shows lowest % of peak • Non-vectorizable code much more expensive on X1/X1E (32:1) • Lower bisection bandwidth to computational ratio (4D-hypercube) • X1 Performance is comparable to Itanium2 • Itanium2 outperforms Opteron because • Paratec less sensitive to memory access issues (BLAS3) • Opteron lacks FMA unit • Quadrics shows better scaling of all-to-all at large concurrencies
PMEMD • Particle Mesh Ewald Molecular Dynamics • A F90 code with advanced MPI coding should test compiler and stress asynchronous point to point messaging. • PMEMD is very similar to the MD Engine in AMBER 8.0 used in both chemistry and biosciences • Test system is a 91K atom blood coagulation protein
Summary • Average ratio bassi to seaborg is 6.0 for N5 application benchmarks