410 likes | 417 Views
100 TF Sustained on Cray X Series. SOS 8 April 13, 2004 James B. White III (Trey) trey@ornl.gov. Disclaimer.
E N D
100 TF Sustained on Cray X Series SOS 8 April 13, 2004 James B. White III (Trey) trey@ornl.gov
Disclaimer • The opinions expressed here do not necessarily represent those of the CCS, ORNL, DOE, the Executive Branch of the Federal Government of the United States of America, or even UT-Battelle.
Disclaimer (cont.) • Graph-free, chart-free environment • For graphs and chartshttp://www.csm.ornl.gov/evaluation/PHOENIX/
100 Real TF on Cray Xn • Who needs capability computing? • Application requirements • Why Xn? • Laundry, Clean and Otherwise • Rants • Custom vs. Commodity • MPI • CAF • Cray
Who needs capability computing? • OMB? • Politicians? • Vendors? • Center directors? • Computer scientists?
Who needs capability computing? • Application scientists • According to scientists themselves
Personal Communications • Fusion • General Atomics, Iowa, ORNL, PPPL, Wisconsin • Climate • LANL, NCAR, ORNL, PNNL • Materials • Cincinnati, Florida, NC State, ORNL, Sandia, Wisconsin • Biology • NCI, ORNL, PNNL • Chemistry • Auburn, LANL, ORNL, PNNL • Astrophysics • Arizona, Chicago, NC State, ORNL, Tennessee
Scientists Need Capability • Climate scientists need simulation fidelity to support policy decisions • All we can say now is that humans cause warming • Fusion scientists need to simulate fusion devices • All we can do now is model decoupled subprocesses at disparate time scales • Materials scientists need to design new materials • Just starting to reproduce known materials
Scientists Need Capability • Biologists need to simulate proteins and protein pathways • Baby steps with smaller molecules • Chemists need similar increases in complexity • Astrophysics need to simulate nucleogenesis (high-res, 3D CFD, 6D neutrinos, long times) • Low-res, 3D CFD, approximate 3D neutrinos, short times
Why Scientists Might Resist • Capacity also needed • Software isn’t ready • Coerced to run capability-sized jobs on inappropriate systems
Capability Requirements • Sample DOE SC applications • Climate: POP, CAM • Fusion: AORSA, Gyro • Materials: LSMS, DCA-QMC
Parallel Ocean Program (POP) • Baroclinic • 3D, nearest neighbor, scalable • Memory-bandwidth limited • Barotropic • 2D implicit system, latency bound • Ocean-only simulation • Higher resolution • Faster time steps • As ocean component for CCSM • Atmosphere dominates
Community Atmospheric Model (CAM) • Atmosphere component for CCSM • Higher resolution? • Physics changes, parameterization must be retuned, model must be revalidated • Major effort, rare event • Spectral transform not dominant • Dramatic increases in computation per grid point • Dynamic vegetation, carbon cycle, atmospheric chemistry, … • Faster time steps
All-Orders Spectral Algorithm (AORSA) • Radio-frequency fusion-plasma simulation • Highly scalable • Dominated by ScaLAPACK • Still in weak-scaling regime • But… • Expanded physics reducing ScaLAPACK dominance • Developing sparse formulation
Gyro • Continuum gyrokinetic simulation of fusion-plasma microturbulence • 1D data decomposition • Spectral method - high communication volume • Some need for increased resolution • More iterations
Locally Self-Consistent Multiple Scattering (LSMS) • Calculates electronic structure of large systems • One atom per processor • Dominated by local DGEMM • First real application to sustain a TF • But… moving to sparse formulation with a distributed solve for each atom
Dynamic Cluster Aproximation (DCA-QMC) • Simulates high-temp superconductors • Dominated by DGER (BLAS2) • Memory-bandwidth limited • Quantum Monte Carlo, but… • Fixed start-up per process • Favors fewer, faster processors • Needs powerful processors to avoid parallelizing each Monte-Carlo stream
Few DOE SC Applications • Weak-ish scaling • Dense linear algebra • But moving to sparse
Many DOE SC Applications • “Strong-ish” scaling • Limited increase in gridpoints • Major increase in expense per gridpoint • Major increase in time steps • Fewer, more-powerful processors • High memory bandwidth • High-bandwidth, low-latency communication
Why X1? • “Strong-ish” scaling • Limited increase in gridpoints • Major increase in expense per gridpoint • Major increase in time steps • Fewer, more-powerful processors • High memory bandwidth • High-bandwidth, low-latency communication
Tangent: Strongish* Scaling * Greg Lindahl, Vendor Scum • Firm • Semistrong • Unweak • Strongoidal • MSTW (More Strong Than Weak) • JTSoS (Just This Side of Strong) • WNS (Well-Nigh Strong) • Seak, Steak, Streak, Stroak, Stronk • Weag, Weng, Wong, Wrong, Twong
X1 for 100 TF Sustained? • Uh, no • OS not scalable, fault-resilient enough for 104 processors • That “price/performance” thing • That “power & cooling” thing
Xn for 100 TF Sustained • For DOE SC applications, YES • Most-promising candidate -or- • Least-implausible candidate
Why X, again? • Most-powerful processors • Reduce need for scalability • Obey Amdahl’s Law • High memory bandwidth • See above • Globally addressable memory • Lowest, most hide-able latency • Scale latency-bound applications • High interconnect bandwidth • Scale bandwidth-bound applications
The Bad News • Scalar performance • “Some tuning required” • Ho-hum MPI latency • See Rants
Scalar Performance • Compilation is slow • Amdahl’s Law for single processes • Parallelization -> Vectorization • Hard to port GNU tools • GCC? Are you kidding? • GCC compatibility, on the other hand… • Black Widow will be better
“Some Tuning Required” • Vectorization requires: • Independent operations • Dependence information • Mapping to vector instructions • Applications take a wide spectrum of steps to inhibit this • May need a couple of compiler directives • May need extensive rewriting
Application Results • Awesome • Indifferent • Recalcitrant • Hopeless
Awesome Results • 256-MSP X1 already showing unique capability • Apps bound by memory bandwidth, interconnect bandwidth, interconnect latency • POP, Gyro, DCA-QMC, AGILE-BOLTZTRAN, VH1, Amber, … • Many examples from DoD
Indifferent Results • Cray X1 is brute-force fast, but not cost effective • Dense linear algebra • Linpack, AORSA, LSMS
Recalcitrant Results • Inherent algorithms are fine • Source code or ongoing code mods don’t vectorize • Significant code rewriting done, ongoing, or needed • CLM, CAM, Nimrod, M3D
Aside: How to Avoid Vectorization • Use pointers to add false dependencies • Put deep call stacks inside loops • Put debug I/O operations inside compute loops • Did I mention using pointers?
Aside: Software Design • In general, we don’t know how to systematically design efficient, maintainable HPC software • Vectorization imposes constraints on software design • Bad: Existing software must be rewritten • Good: Resulting software often faster on modern superscalar systems • “Some tuning required” for X series • Bad: You must tune • Good: Tuning is systematic, not a Black Art • Vectorization “constraints” may help us develop effective design patterns for HPC software
Hopeless Results • Dominated by unvectorizable algorithms • Some benchmark kernels of questionable relevance • No known DOE SC applications
Summary • DOE SC scientists do need 100 TF and beyond of sustained application performance • Cray X series is the least-implausible option for scaling DOE SC applications to 100 TF of sustained performance and beyond
“Custom” Rant • “Custom vs. Commodity” is Red Herring • CMOS is commodity • Memory is commodity • Wires are commodity • Cooling is independent of vector vs. scalar • PNNL liquid-cooling clusters • Vector systems may move to air-cooling • All vendors do custom packaging • Real issue: Software
MPI Rant • Latency-bound apps often limited by “MPI_Allreduce(…, MPI_SUM, …)” • Not ping pong! • An excellent abstraction that is imminently optimizable • Some apps are limited by point-to-point • Remote load/store implementations (CAF, UPC) have performance advantages over MPI • But MPI could be implemented using load/store, inlined, and optimized • On the other hand, easier to avoid pack/unpack with load/store model
Co-Array-Fortran Rant • No such thing as one-sided communication • It’s all two sided: send+receive, sync+put+sync, sync+get+sync • Same parallel algorithms • CAF mods can be highly nonlocal • Adding CAF in a subroutine can have implications on the argument types, and thus on the callers, the callers’ callers, etc. • Rarely the case for MPI • We use CAF to avoid MPI-implementation performance inadequacies • Avoiding nonlocality by cheating with Cray pointers
Cray Rant • Cray XD1 (OctigaBay) follows in tradition of T3E
Cray Rant • Cray XD1 (OctigaBay) follows in tradition of T3E • Very promising architecture • Dumb name • Interesting competitor with Red Storm
Questions? James B. White III (Trey) trey@ornl.gov http://www.csm.ornl.gov/evaluation/PHOENIX/