100 TF Sustained on Cray X Series

100 TF Sustained on Cray X Series SOS 8 April 13, 2004 James B. White III (Trey) trey@ornl.gov

Disclaimer • The opinions expressed here do not necessarily represent those of the CCS, ORNL, DOE, the Executive Branch of the Federal Government of the United States of America, or even UT-Battelle.

Disclaimer (cont.) • Graph-free, chart-free environment • For graphs and chartshttp://www.csm.ornl.gov/evaluation/PHOENIX/

100 Real TF on Cray Xn • Who needs capability computing? • Application requirements • Why Xn? • Laundry, Clean and Otherwise • Rants • Custom vs. Commodity • MPI • CAF • Cray

Who needs capability computing? • OMB? • Politicians? • Vendors? • Center directors? • Computer scientists?

Who needs capability computing? • Application scientists • According to scientists themselves

Personal Communications • Fusion • General Atomics, Iowa, ORNL, PPPL, Wisconsin • Climate • LANL, NCAR, ORNL, PNNL • Materials • Cincinnati, Florida, NC State, ORNL, Sandia, Wisconsin • Biology • NCI, ORNL, PNNL • Chemistry • Auburn, LANL, ORNL, PNNL • Astrophysics • Arizona, Chicago, NC State, ORNL, Tennessee

Scientists Need Capability • Climate scientists need simulation fidelity to support policy decisions • All we can say now is that humans cause warming • Fusion scientists need to simulate fusion devices • All we can do now is model decoupled subprocesses at disparate time scales • Materials scientists need to design new materials • Just starting to reproduce known materials

Scientists Need Capability • Biologists need to simulate proteins and protein pathways • Baby steps with smaller molecules • Chemists need similar increases in complexity • Astrophysics need to simulate nucleogenesis (high-res, 3D CFD, 6D neutrinos, long times) • Low-res, 3D CFD, approximate 3D neutrinos, short times

Why Scientists Might Resist • Capacity also needed • Software isn’t ready • Coerced to run capability-sized jobs on inappropriate systems

Capability Requirements • Sample DOE SC applications • Climate: POP, CAM • Fusion: AORSA, Gyro • Materials: LSMS, DCA-QMC

Parallel Ocean Program (POP) • Baroclinic • 3D, nearest neighbor, scalable • Memory-bandwidth limited • Barotropic • 2D implicit system, latency bound • Ocean-only simulation • Higher resolution • Faster time steps • As ocean component for CCSM • Atmosphere dominates

Community Atmospheric Model (CAM) • Atmosphere component for CCSM • Higher resolution? • Physics changes, parameterization must be retuned, model must be revalidated • Major effort, rare event • Spectral transform not dominant • Dramatic increases in computation per grid point • Dynamic vegetation, carbon cycle, atmospheric chemistry, … • Faster time steps

All-Orders Spectral Algorithm (AORSA) • Radio-frequency fusion-plasma simulation • Highly scalable • Dominated by ScaLAPACK • Still in weak-scaling regime • But… • Expanded physics reducing ScaLAPACK dominance • Developing sparse formulation

Gyro • Continuum gyrokinetic simulation of fusion-plasma microturbulence • 1D data decomposition • Spectral method - high communication volume • Some need for increased resolution • More iterations

Locally Self-Consistent Multiple Scattering (LSMS) • Calculates electronic structure of large systems • One atom per processor • Dominated by local DGEMM • First real application to sustain a TF • But… moving to sparse formulation with a distributed solve for each atom

Dynamic Cluster Aproximation (DCA-QMC) • Simulates high-temp superconductors • Dominated by DGER (BLAS2) • Memory-bandwidth limited • Quantum Monte Carlo, but… • Fixed start-up per process • Favors fewer, faster processors • Needs powerful processors to avoid parallelizing each Monte-Carlo stream

Few DOE SC Applications • Weak-ish scaling • Dense linear algebra • But moving to sparse

Many DOE SC Applications • “Strong-ish” scaling • Limited increase in gridpoints • Major increase in expense per gridpoint • Major increase in time steps • Fewer, more-powerful processors • High memory bandwidth • High-bandwidth, low-latency communication

Why X1? • “Strong-ish” scaling • Limited increase in gridpoints • Major increase in expense per gridpoint • Major increase in time steps • Fewer, more-powerful processors • High memory bandwidth • High-bandwidth, low-latency communication

Tangent: Strongish* Scaling * Greg Lindahl, Vendor Scum • Firm • Semistrong • Unweak • Strongoidal • MSTW (More Strong Than Weak) • JTSoS (Just This Side of Strong) • WNS (Well-Nigh Strong) • Seak, Steak, Streak, Stroak, Stronk • Weag, Weng, Wong, Wrong, Twong

X1 for 100 TF Sustained? • Uh, no • OS not scalable, fault-resilient enough for 104 processors • That “price/performance” thing • That “power & cooling” thing

Xn for 100 TF Sustained • For DOE SC applications, YES • Most-promising candidate -or- • Least-implausible candidate

Why X, again? • Most-powerful processors • Reduce need for scalability • Obey Amdahl’s Law • High memory bandwidth • See above • Globally addressable memory • Lowest, most hide-able latency • Scale latency-bound applications • High interconnect bandwidth • Scale bandwidth-bound applications

The Bad News • Scalar performance • “Some tuning required” • Ho-hum MPI latency • See Rants

Scalar Performance • Compilation is slow • Amdahl’s Law for single processes • Parallelization -> Vectorization • Hard to port GNU tools • GCC? Are you kidding? • GCC compatibility, on the other hand… • Black Widow will be better

“Some Tuning Required” • Vectorization requires: • Independent operations • Dependence information • Mapping to vector instructions • Applications take a wide spectrum of steps to inhibit this • May need a couple of compiler directives • May need extensive rewriting

Application Results • Awesome • Indifferent • Recalcitrant • Hopeless

Awesome Results • 256-MSP X1 already showing unique capability • Apps bound by memory bandwidth, interconnect bandwidth, interconnect latency • POP, Gyro, DCA-QMC, AGILE-BOLTZTRAN, VH1, Amber, … • Many examples from DoD

Indifferent Results • Cray X1 is brute-force fast, but not cost effective • Dense linear algebra • Linpack, AORSA, LSMS

Recalcitrant Results • Inherent algorithms are fine • Source code or ongoing code mods don’t vectorize • Significant code rewriting done, ongoing, or needed • CLM, CAM, Nimrod, M3D

Aside: How to Avoid Vectorization • Use pointers to add false dependencies • Put deep call stacks inside loops • Put debug I/O operations inside compute loops • Did I mention using pointers?

Aside: Software Design • In general, we don’t know how to systematically design efficient, maintainable HPC software • Vectorization imposes constraints on software design • Bad: Existing software must be rewritten • Good: Resulting software often faster on modern superscalar systems • “Some tuning required” for X series • Bad: You must tune • Good: Tuning is systematic, not a Black Art • Vectorization “constraints” may help us develop effective design patterns for HPC software

Hopeless Results • Dominated by unvectorizable algorithms • Some benchmark kernels of questionable relevance • No known DOE SC applications

Summary • DOE SC scientists do need 100 TF and beyond of sustained application performance • Cray X series is the least-implausible option for scaling DOE SC applications to 100 TF of sustained performance and beyond

“Custom” Rant • “Custom vs. Commodity” is Red Herring • CMOS is commodity • Memory is commodity • Wires are commodity • Cooling is independent of vector vs. scalar • PNNL liquid-cooling clusters • Vector systems may move to air-cooling • All vendors do custom packaging • Real issue: Software

MPI Rant • Latency-bound apps often limited by “MPI_Allreduce(…, MPI_SUM, …)” • Not ping pong! • An excellent abstraction that is imminently optimizable • Some apps are limited by point-to-point • Remote load/store implementations (CAF, UPC) have performance advantages over MPI • But MPI could be implemented using load/store, inlined, and optimized • On the other hand, easier to avoid pack/unpack with load/store model

Co-Array-Fortran Rant • No such thing as one-sided communication • It’s all two sided: send+receive, sync+put+sync, sync+get+sync • Same parallel algorithms • CAF mods can be highly nonlocal • Adding CAF in a subroutine can have implications on the argument types, and thus on the callers, the callers’ callers, etc. • Rarely the case for MPI • We use CAF to avoid MPI-implementation performance inadequacies • Avoiding nonlocality by cheating with Cray pointers

Cray Rant • Cray XD1 (OctigaBay) follows in tradition of T3E

Cray Rant • Cray XD1 (OctigaBay) follows in tradition of T3E • Very promising architecture • Dumb name • Interesting competitor with Red Storm

Questions? James B. White III (Trey) trey@ornl.gov http://www.csm.ornl.gov/evaluation/PHOENIX/

100 TF Sustained on Cray X Series

100 TF Sustained on Cray X Series

Presentation Transcript

Cray

AQ 100 Series

Triton X-100

Efficient I/O on the Cray XT

Progress Towards Optimizing the PETSc Numerical Toolkit on the Cray X-1

Profiling S3D on Cray XT3 using TAU

Cray Supercomputers: The Cray X1

Open MPI on the Cray XT

Seymour Cray

Seymour Cray

X Series

Mstack on the Cray MTA-2

Update on Whois TF

Cray SV1

TCM-X series

100 series Markets and Competition

X Series Mobile DVR

_ X = 100

Xbox Series X

Xbox Series X And Series S Wikipedia