410 likes | 422 Views
Explore the crucial application requirements for capability computing in various scientific fields, such as climate, fusion, materials, biology, chemistry, and astrophysics. Understand why scientists might resist, the importance of capacity, and the challenges faced. Dive into sample applications like Parallel Ocean Program (POP), All-Orders Spectral Algorithm (AORSA), Gyro, Locally Self-Consistent Multiple Scattering (LSMS), and Dynamic Cluster Approximation-Quantum Monte Carlo (DCA-QMC) to grasp the complexities and advancements needed to meet the demands of scientists. Learn why X1 may not be the solution for achieving sustained 100 TF computing and why Xn could be a more suitable option for such high-performance computing requirements.
E N D
100 TF Sustained on Cray X Series SOS 8 April 13, 2004 James B. White III (Trey) trey@ornl.gov
Disclaimer • The opinions expressed here do not necessarily represent those of the CCS, ORNL, DOE, the Executive Branch of the Federal Government of the United States of America, or even UT-Battelle.
Disclaimer (cont.) • Graph-free, chart-free environment • For graphs and chartshttp://www.csm.ornl.gov/evaluation/PHOENIX/
100 Real TF on Cray Xn • Who needs capability computing? • Application requirements • Why Xn? • Laundry, Clean and Otherwise • Rants • Custom vs. Commodity • MPI • CAF • Cray
Who needs capability computing? • OMB? • Politicians? • Vendors? • Center directors? • Computer scientists?
Who needs capability computing? • Application scientists • According to scientists themselves
Personal Communications • Fusion • General Atomics, Iowa, ORNL, PPPL, Wisconsin • Climate • LANL, NCAR, ORNL, PNNL • Materials • Cincinnati, Florida, NC State, ORNL, Sandia, Wisconsin • Biology • NCI, ORNL, PNNL • Chemistry • Auburn, LANL, ORNL, PNNL • Astrophysics • Arizona, Chicago, NC State, ORNL, Tennessee
Scientists Need Capability • Climate scientists need simulation fidelity to support policy decisions • All we can say now is that humans cause warming • Fusion scientists need to simulate fusion devices • All we can do now is model decoupled subprocesses at disparate time scales • Materials scientists need to design new materials • Just starting to reproduce known materials
Scientists Need Capability • Biologists need to simulate proteins and protein pathways • Baby steps with smaller molecules • Chemists need similar increases in complexity • Astrophysics need to simulate nucleogenesis (high-res, 3D CFD, 6D neutrinos, long times) • Low-res, 3D CFD, approximate 3D neutrinos, short times
Why Scientists Might Resist • Capacity also needed • Software isn’t ready • Coerced to run capability-sized jobs on inappropriate systems
Capability Requirements • Sample DOE SC applications • Climate: POP, CAM • Fusion: AORSA, Gyro • Materials: LSMS, DCA-QMC
Parallel Ocean Program (POP) • Baroclinic • 3D, nearest neighbor, scalable • Memory-bandwidth limited • Barotropic • 2D implicit system, latency bound • Ocean-only simulation • Higher resolution • Faster time steps • As ocean component for CCSM • Atmosphere dominates
Community Atmospheric Model (CAM) • Atmosphere component for CCSM • Higher resolution? • Physics changes, parameterization must be retuned, model must be revalidated • Major effort, rare event • Spectral transform not dominant • Dramatic increases in computation per grid point • Dynamic vegetation, carbon cycle, atmospheric chemistry, … • Faster time steps
All-Orders Spectral Algorithm (AORSA) • Radio-frequency fusion-plasma simulation • Highly scalable • Dominated by ScaLAPACK • Still in weak-scaling regime • But… • Expanded physics reducing ScaLAPACK dominance • Developing sparse formulation
Gyro • Continuum gyrokinetic simulation of fusion-plasma microturbulence • 1D data decomposition • Spectral method - high communication volume • Some need for increased resolution • More iterations
Locally Self-Consistent Multiple Scattering (LSMS) • Calculates electronic structure of large systems • One atom per processor • Dominated by local DGEMM • First real application to sustain a TF • But… moving to sparse formulation with a distributed solve for each atom
Dynamic Cluster Aproximation (DCA-QMC) • Simulates high-temp superconductors • Dominated by DGER (BLAS2) • Memory-bandwidth limited • Quantum Monte Carlo, but… • Fixed start-up per process • Favors fewer, faster processors • Needs powerful processors to avoid parallelizing each Monte-Carlo stream
Few DOE SC Applications • Weak-ish scaling • Dense linear algebra • But moving to sparse
Many DOE SC Applications • “Strong-ish” scaling • Limited increase in gridpoints • Major increase in expense per gridpoint • Major increase in time steps • Fewer, more-powerful processors • High memory bandwidth • High-bandwidth, low-latency communication
Why X1? • “Strong-ish” scaling • Limited increase in gridpoints • Major increase in expense per gridpoint • Major increase in time steps • Fewer, more-powerful processors • High memory bandwidth • High-bandwidth, low-latency communication
Tangent: Strongish* Scaling * Greg Lindahl, Vendor Scum • Firm • Semistrong • Unweak • Strongoidal • MSTW (More Strong Than Weak) • JTSoS (Just This Side of Strong) • WNS (Well-Nigh Strong) • Seak, Steak, Streak, Stroak, Stronk • Weag, Weng, Wong, Wrong, Twong
X1 for 100 TF Sustained? • Uh, no • OS not scalable, fault-resilient enough for 104 processors • That “price/performance” thing • That “power & cooling” thing
Xn for 100 TF Sustained • For DOE SC applications, YES • Most-promising candidate -or- • Least-implausible candidate
Why X, again? • Most-powerful processors • Reduce need for scalability • Obey Amdahl’s Law • High memory bandwidth • See above • Globally addressable memory • Lowest, most hide-able latency • Scale latency-bound applications • High interconnect bandwidth • Scale bandwidth-bound applications
The Bad News • Scalar performance • “Some tuning required” • Ho-hum MPI latency • See Rants
Scalar Performance • Compilation is slow • Amdahl’s Law for single processes • Parallelization -> Vectorization • Hard to port GNU tools • GCC? Are you kidding? • GCC compatibility, on the other hand… • Black Widow will be better
“Some Tuning Required” • Vectorization requires: • Independent operations • Dependence information • Mapping to vector instructions • Applications take a wide spectrum of steps to inhibit this • May need a couple of compiler directives • May need extensive rewriting
Application Results • Awesome • Indifferent • Recalcitrant • Hopeless
Awesome Results • 256-MSP X1 already showing unique capability • Apps bound by memory bandwidth, interconnect bandwidth, interconnect latency • POP, Gyro, DCA-QMC, AGILE-BOLTZTRAN, VH1, Amber, … • Many examples from DoD
Indifferent Results • Cray X1 is brute-force fast, but not cost effective • Dense linear algebra • Linpack, AORSA, LSMS
Recalcitrant Results • Inherent algorithms are fine • Source code or ongoing code mods don’t vectorize • Significant code rewriting done, ongoing, or needed • CLM, CAM, Nimrod, M3D
Aside: How to Avoid Vectorization • Use pointers to add false dependencies • Put deep call stacks inside loops • Put debug I/O operations inside compute loops • Did I mention using pointers?
Aside: Software Design • In general, we don’t know how to systematically design efficient, maintainable HPC software • Vectorization imposes constraints on software design • Bad: Existing software must be rewritten • Good: Resulting software often faster on modern superscalar systems • “Some tuning required” for X series • Bad: You must tune • Good: Tuning is systematic, not a Black Art • Vectorization “constraints” may help us develop effective design patterns for HPC software
Hopeless Results • Dominated by unvectorizable algorithms • Some benchmark kernels of questionable relevance • No known DOE SC applications
Summary • DOE SC scientists do need 100 TF and beyond of sustained application performance • Cray X series is the least-implausible option for scaling DOE SC applications to 100 TF of sustained performance and beyond
“Custom” Rant • “Custom vs. Commodity” is Red Herring • CMOS is commodity • Memory is commodity • Wires are commodity • Cooling is independent of vector vs. scalar • PNNL liquid-cooling clusters • Vector systems may move to air-cooling • All vendors do custom packaging • Real issue: Software
MPI Rant • Latency-bound apps often limited by “MPI_Allreduce(…, MPI_SUM, …)” • Not ping pong! • An excellent abstraction that is imminently optimizable • Some apps are limited by point-to-point • Remote load/store implementations (CAF, UPC) have performance advantages over MPI • But MPI could be implemented using load/store, inlined, and optimized • On the other hand, easier to avoid pack/unpack with load/store model
Co-Array-Fortran Rant • No such thing as one-sided communication • It’s all two sided: send+receive, sync+put+sync, sync+get+sync • Same parallel algorithms • CAF mods can be highly nonlocal • Adding CAF in a subroutine can have implications on the argument types, and thus on the callers, the callers’ callers, etc. • Rarely the case for MPI • We use CAF to avoid MPI-implementation performance inadequacies • Avoiding nonlocality by cheating with Cray pointers
Cray Rant • Cray XD1 (OctigaBay) follows in tradition of T3E
Cray Rant • Cray XD1 (OctigaBay) follows in tradition of T3E • Very promising architecture • Dumb name • Interesting competitor with Red Storm
Questions? James B. White III (Trey) trey@ornl.gov http://www.csm.ornl.gov/evaluation/PHOENIX/