David E. Keyes Department of Mathematics & Statistics, Old Dominion University

The Next Four Orders of Magnitude in Parallel PDE Simulation Performance http://www.math.odu.edu/~keyes/talks.html David E. Keyes Department of Mathematics & Statistics, Old Dominion University Institute for Scientific Computing Research, Lawrence Livermore National Laboratory Institute for Computer Applications in Science & Engineering, NASA Langley Research Center

Background of this Presentation • Originally prepared for Petaflops II Conference • History of the Petaflops Initiative in the USA • Enabling Technologies for Petaflops Computing, Feb 1994 • book by Sterling, Messina, and Smith, MIT Press, 1995 • Applications Workshop, Aug 1994 • Architectures Workshop, Apr 1995 • Systems Software Workshop, Jun 1996 • Algorithms Workshop, Apr 1997 • Systems Operations Workshop Review, Jun 1998 • Enabling Technologies for Petaflops II, Feb 1999 • Topics in Ultra-scale Computing, book by Sterling et al., MIT Press, 2001 (to appear)

Weighing in at the Bottom Line • Characterization of a 1 Teraflop/s computer of today • about 1,000 processors of 1 Gflop/s (peak) each • due to inefficiencies within the processors, more practically characterized as about 4,000 processors of 250 Mflop/s each • How do we want to get to 1 Petaflop/s by 2007 (original goal)? • 1,000,000 processors of 1 Gflop/s each (only wider)? • 10,000 processors of 100 Gflop/s each (mainly deeper)? • From the point of view of PDE simulations on quasi-static Eulerian grids • Either! • Caveat: dynamic grid simulations are not coveredin this talk • but see work at Bonn, Erlangen, Heidelberg, LLNL, and ODU presented elsewhere

Perspective • Many “Grand Challenges” in computational science are formulated as PDEs (possibly among alternative formulations) • However, PDE simulations historically have not performed as well as other scientific simulations • PDE simulations require a balance among architectural components that is not necessarily met in a machine designed to “max out” on the standard LINPACK benchmark • The justification for building petaflop/s architectures undoubtedly will (and should) include PDE applications • However, cost-effective use of petaflop/s on PDEs requires further attention to architectural and algorithmic matters • Memory-centric view of computation needs further promotion

Application Performance History“3 orders of magnitude in 10 years” – better than Moore’s Law

Bell Prize Performance History

Plan of Presentation • General characterization of PDE requirements • Four sources of performance improvement to get from current 100's of Gflop/s (for PDEs) to 1 Pflop/s • Each illustrated with examples from computational aerodynamics • offered as typical of real workloads (nonlinear, unstructured, multicomponent, multiscale, etc.) • Performance presented • on up to thousands of processors of T3E and ASCI Red (for parallel aspects) • on numerous uniprocessors (for memory hierarchy aspects)

Purpose of Presentation • Not to argue for specific algorithms/programming models/codes in any detail • but see talks on Newton-Krylov-Schwarz (NKS) methods under homepage • Provide a requirements target for designers of today's systems • typical of several contemporary successfully parallelized PDE applications • not comprehensive of all important large-scale applications • Speculate on requirements target for designers of tomorrow's sytems • Promote attention to current architectural weaknesses, relative to requirements of PDEs

Four Sources of Performance Improvement • Expanded number of processors • arbitrarily large factor, through extremely careful attention to load balancing and synchronization • More efficient use of processor cycles, and faster processor/memory elements • one to two orders of magnitude, through memory-assist language features, processors-in-memory, and multithreading • Algorithmic variants that are more architecture-friendly • approximately an order of magnitude, through improved locality and relaxed synchronization • Algorithms that deliver more “science per flop” • possibly large problem-dependent factor, through adaptivity • This last does not contribute to raw flop/s!

PDE Varieties and Complexities • Varieties of PDEs • evolution (time hyperbolic, time parabolic) • equilibrium (elliptic, spatially hyperbolic or parabolic) • mixed, varying by region • mixed, of multiple type (e.g., parabolic with elliptic constraint) • Complexity parameterized by: • spatial grid points, Nx • temporal grid points, Nt • components per point, Nc • auxiliary storage per point, Na • grid points in stencil, Ns • Memory: M  Nx ( Nc + Na + NcNcNs ) • Work: W  Nx Nt ( Nc + Na + Nc Nc Ns )

Resource Scaling for PDEs • For 3D problems, (Memory)  (Work)3/4 • for equilibrium problems, work scales with problem size  no. of iteration steps – for “reasonable” implicit methods proportional to resolution in single spatial dimension • for evolutionary problems, work scales with problem size  time steps – CFL-type arguments place latter on order of resolution in single spatial dimension • Proportionality constant can be adjusted over a very wide range by both discretization and by algorithmic tuning • If frequent time frames are to be captured, disk capacity and I/O rates must both scale linearly with work

Typical PDE Tasks • Vertex-based loops • state vector and auxiliary vector updates • Edge-based “stencil op” loops • residual evaluation, approximate Jacobian evaluation • Jacobian-vector product (often replaced with matrix-free form, involving residual evaluation) • intergrid transfer (coarse/fine) • Sparse, narrow-band recurrences • approximate factorization and back substitution • smoothing • Vector inner products and norms • orthogonalization/conjugation • convergence progress and stability checks

Edge-based Loop • Vertex-centered tetrahedral grid • Traverse by edges • load vertex values • compute intensively • store contributions to flux at vertices • Each vertex appears in approximately 15 flux computations

Explicit PDE Solvers • Concurrency is pointwise, O(N) • Comm.-to-Comp. ratio is surface-to-volume, O((N/P)-1/3) • Communication range is nearest-neighbor, except for time-step computation • Synchronization frequency is once per step, O((N/P)-1) • Storage per point is low • Load balance is straightforward for static quasi-uniform grids • Grid adaptivity (together with temporal stability limitation) makes load balance nontrivial

Domain-decomposed Implicit PDE Solvers • Concurrency is pointwise, O(N), or subdomainwise, O(P) • Comm.-to-Comp. ratio still mainly surface-to-volume, O((N/P)-1/3) • Communication still mainly nearest-neighbor, but nonlocal communication arises from conjugation, norms, coarse grid problems • Synchronization frequency often more than once per grid-sweep, up to Krylov dimension, O(K(N/P)-1) • Storage per point is higher, by factor of O(K) • Load balance issues the same as for explicit

Source #1: Expanded Number of Processors • Amdahl's law can be defeated if serial sections make up a nonincreasing fraction of total work as problem size and processor count scale up together – true for most explicit or iterative implicit PDE solvers • popularized in 1986 Karp Prize paper by Gustafson, et al. • Simple, back-of-envelope parallel complexity analyses show that processors can be increased as fast, or almost as fast, as problem size, assuming load is perfectly balanced • Caveat: the processor network must also be scalable (applies to protocols as well as to hardware) • Remaining four orders of magnitude could be met by hardware expansion (but this does not mean that fixed-size applications of today would run 104 times faster)

Back-of-Envelope Scalability Demonstration for Bulk-synchronized PDE Computations • Given complexity estimates of the leading terms of: • the concurrent computation (per iteration phase) • the concurrent communication • the synchronization frequency • And a model of the architecture including: • internode communication (network topology and protocol reflecting horizontal memory structure) • on-node computation (effective performance parameters including vertical memory structure) • One can estimate optimal concurrency and execution time • on per-iteration basis, or overall (by taking into account any granularity-dependent convergence rate) • simply differentiate time estimate in terms of (N,P) with respect to P, equate to zero and solve for P in terms of N

3D Stencil Costs (per Iteration) • grid points in each direction n, total work N=O(n3) • processors in each direction p, total procs P=O(p3) • memory per node requirements O(N/P) • execution time per iterationA n3/p3 • grid points on side of each processor subdomain n/p • neighbor communication per iteration B n2/p2 • cost of global reductions in each iteration C log p or C p(1/d) • Cincludes synchronization frequency • same dimensionless units for measuringA, B, C • e.g., cost of scalar floating point multiply-add

3D Stencil Computation IllustrationRich local network, tree-based global reductions • total wall-clock time per iteration • for optimal p, , or or (with ), • without “speeddown,” pcan grow with n • in the limit as

3D Stencil Computation Illustration Rich local network, tree-based global reductions • optimal running time where • limit of infinite neighbor bandwidth, zero neighbor latency ( ) (This analysis is on a per iteration basis; fuller analysis would multiply this cost by an iteration count estimate that generally depends on n and p.)

Summary for Various Networks • With tree-based (logarithmic) global reductions and scalable nearest neighbor hardware: • optimal number of processors scales linearly with problem size • With 3D torus-based global reductions and scalable nearest neighbor hardware: • optimal number of processors scales as three-fourths power of problem size (almost “scalable”) • With common network bus (heavy contention): • optimal number of processors scales as one-fourth power of problem size (not “scalable”) • bad news for conventional Beowulf clusters, but see 2000 Bell Prize “price-performance awards”

1999 Bell Prize Parallel Scaling Results on ASCI RedONERA M6 Wing Test Case, Tetrahedral grid of 2.8 million vertices (about 11 million unknowns) on up to 3072 ASCI Red Nodes (Pentium Pro 333 MHz processors)

Surface Visualization of Test Domain for Computing Flow over an ONERA M6 Wing

Transonic “Lambda” Shock Solution

Fixed-size Parallel Scaling Results (Flop/s)

Fixed-size Parallel Scaling Results(Time in seconds)

Algorithm: Newton-Krylov-Schwarz Newton nonlinear solver asymptotically quadratic Krylov accelerator spectrally adaptive Schwarz preconditioner parallelizable

Fixed-size Scaling Results for W-cycle (Time in seconds, courtesy of D. Mavriplis) ASCI runs: for grid of 3.1M vertices; T3E runs: for grid of 24.7M vertices

Source #2: More Efficient Use of Faster Processors • Current low efficiencies of sparse codes can be improved if regularity of reference is exploited with memory-assist features • PDEs have periodic workingset structure that permits effective use of prefetch/dispatch directives, and lots of slackness • Combined with “processors-in-memory” (PIM) technology for gather/scatter into densely used block transfers and multithreading, PDEs can approach full utilization of processor cycles • Caveat: high bandwidth is critical, since PDE algorithms do only O(N) work for O(N) gridpoints worth of loads and stores • One to two orders of magnitude can be gained by catching up to the clock, and by following the clock into the few-GHz range

Following the Clock • 1999 Predictions from the Semiconductor Industry Associationhttp://public.itrs.net/files/1999_SIA_Roadmap/Home.htm • A factor of 2-3 can be expected by 2007 by following the clock alone

Example of Multithreading • Same ONERA M6 wing Euler code simulation on ASCI Red • ASCI Red contains two processors per node, sharing memory • Can use second processor in either message-passing mode with its own subdomain, or in multithreaded shared memory mode, which does not require the number of subdomain partitions to double • Latter is much more effective in flux evaluation phase, as shown by cumulative execution time (here, memory bandwidth is not an issue)

PDE Workingsets • Smallest: data for single stencil: • Ns(Nc2 + Nc + Na )(sharp) • Largest: data for entire subdomain: • (Nx/P)  Ns (Nc2 + Nc + Na )(sharp) • Intermediate: data for a neighborhood collection of stencils, reused as possible

Cache Traffic for PDEs • As successive workingsets “drop” into a level of memory, capacity (and with effort conflict) misses disappear, leaving only compulsory, reducing demand on main memory bandwidth

Strategies Based on Workingset Structure • No performance value in memory levels larger than subdomain • Little performance value in memory levels smaller than subdomain but larger than required to permit full reuse of most data within each subdomain subtraversal • After providing L1 large enough for smallest workingset (and multiple independent copies up to accommodate desired level of multithreading) all additional resources should be invested in large L2 • Tables describing grid connectivity are built (after each grid rebalancing) and stored in PIM --- used to pack/unpack dense-use cache lines during subdomain traversal

Costs of Greater Per-processor Efficiency • Programming complexity of managing subdomain traversal • Space to store gather/scatter tables in PIM • Time to (re)build gather/scatter tables in PIM • Memory bandwidth commensurate with peak rates of all processors

Source #3: More “Architecture Friendly” Algorithms • Algorithmic practice needs to catch up to architectural demands • several “one-time” gains remain to be contributed that could improve data locality or reduce synchronization frequency, while maintaining required concurrency and slackness • “One-time” refers to improvements by small constant factors, nothing that scales in N or P – complexities are already near information-theoretic lower bounds, and we reject increases in flop rates that derive from less efficient algorithms • Caveat: remaining algorithmic performance improvements may cost extra space or may bank on stability shortcuts that occasionally backfire, making performance modeling less predictable • Perhaps an order of magnitude of performance remains here

Raw Performance Improvement from Algorithms • Spatial reorderings that improve locality • interlacing of all related grid-based data structures • ordering gridpoints and grid edges for L1/L2 reuse • Discretizations that improve locality • higher-order methods (lead to larger denser blocks at each point than lower-order methods) • vertex-centering (for same tetrahedral grid, leads to denser blockrows than cell-centering) • Temporal reorderings that improve locality • block vector algorithms (reuse cached matrix blocks; vectors in block are independent) • multi-step vector algorithms (reuse cached vector blocks; vectors have sequential dependence)

Raw Performance Improvement from Algorithms • Temporal reorderings that reduce synchronization penalty • less stable algorithmic choices that reduce synchronization frequency (deferred orthogonalization, speculative step selection) • less global methods that reduce synchronization range by replacing a tightly coupled global process (e.g., Newton) with loosely coupled sets of tightly coupled local processes (e.g., Schwarz) • Precision reductions that make bandwidth seem larger • lower precision representation of preconditioner matrix coefficients or poorly known coefficients (arithmetic is still performed on full precision extensions)

Processor Clock MHz Peak Mflop/s Opt. % of Peak Opt. Mflop/s Reord. Only Mflop/s Interl. only Mflop/s Orig. Mflop/s Orig. % of Peak R10000 250 500 25.4 127 74 59 26 5.2 P3 200 800 20.3 163 87 68 32 4.0 P2SC (2 card) 120 480 21.4 101 51 35 13 2.7 P2SC (4 card) 120 480 24.3 117 59 40 15 3.1 604e 332 664 9.9 66 43 31 15 2.3 Alpha 21164 450 900 8.3 75 39 32 14 1.6 Alpha 21164 600 1200 7.6 91 47 37 16 1.3 Ultra II 300 600 12.5 75 42 35 18 3.0 Ultra II 360 720 13.0 94 54 47 25 3.5 Ultra II/HPC 400 800 8.9 71 47 36 20 2.5 Pent. II/LIN 400 400 20.8 83 52 47 33 8.3 Pent. II/NT 400 400 19.5 78 49 49 31 7.8 Pent. Pro 200 200 21.0 42 27 26 16 8.0 Pent. Pro 333 333 18.8 60 40 36 21 6.3 Improvements Resulting from Locality Reordering

Improvements from Blocking Vectors • Same ONERA M6 Euler simulation, on SGI Origin • One vector represents standard GMRES acceleration • Four vectors is a blocked Krylov method, not yet in “production” version • Savings arises from not reloading matrix elements of Jacobian for each new vector (four-fold increase in matrix element use per load) • Flop/s rate is effectively tripled – however, can the extra vectors be used efficiently from a numerical viewpoint??

Improvements from Reduced Precision • Same ONERA M6 Euler simulation, on SGI Origin • Standard (middle column) is double precision in all floating quantities • Optimization (right column) is to store preconditioner for Jacobian matrix in single precision only, promoting to double before use in the processor • Bandwidth and matrix cache capacities are effectively doubled, with no deterioration in numerical properties

Source #4: Algorithms Packing More Science Per Flop • Some algorithmic improvements do not improve flop rate, but lead to the same scientific end in the same time at lower hardware cost (less memory, lower operation complexity) • Caveat: such adaptive programs are more complicated and less thread-uniform than those they improve upon in quality/cost ratio • Desirable that petaflop/s machines be general purpose enough to run the “best” algorithms • Not daunting, conceptually, but puts an enormous premium on dynamic load balancing • An order of magnitude or more can be gained here for many problems

Example of Adaptive Opportunities • Spatial Discretization-based adaptivity • change discretization type and order to attain required approximation to the continuum everywhere without over-resolving in smooth, easily approximated regions • Fidelity-based adaptivity • change continuous formulation to accommodate required phenomena everywhere without enriching in regions where nothing happens • Stiffness-based adaptivity • change solution algorithm to provide more powerful, robust techniques in regions of space-time where discrete problem is linearly or nonlinearly stiff without extra work in nonstiff, locally well-conditioned regions

Experimental Example of Opportunity for Advanced Adaptivity • Driven cavity: Newton’s method (left) versus new Additive Schwarz Preconditioned Inexact Newton (ASPIN) nonlinear preconditioning (right)

Status and Prospectsfor Advanced Adaptivity • Metrics and procedures well developed in only a few areas • method-of-lines ODEs for stiff IBVPs and DAEs, FEA for elliptic BVPs • Multi-model methods used in ad hoc ways in production • Boeing TRANAIR code • Poly-algorithmic solvers demonstrated in principle but rarely in the “hostile” environment of high-performance computing • Requirements for progress • management of hierarchical levels of synchronization • user specification of hierarchical priorities of different threads

Summary of Suggestions for PDEPetaflops • Algorithms that deliver more “science per flop” • possibly large problem-dependent factor, through adaptivity (but we won't count this towards rate improvement) • Algorithmic variants that are more architecture-friendly • expect half an order of magnitude, through improved locality and relaxed synchronization • More efficient use of processor cycles, and faster processor/memory • expect one-and-a-half orders of magnitude, through memory-assist language features, PIM, and multithreading • Expanded number of processors • expect two orders of magnitude, through dynamic balancing and extreme care in implementation

Reminder about the Source of PDEs • Computational engineering is not about individual large-scale analyses, done fast and “thrown over the wall” • Both “results” and their sensitivities are desired; often multiple operation points to be simulated are known a priori, rather than sequentially • Sensitivities may be fed back into optimization process • Full PDE analyses may also be inner iterations in a multidisciplinary computation • In such contexts, “petaflop/s” may mean 1,000 analyses running somewhat asynchronously with respect to each other, each at 1 Tflop/s – clearly a less daunting challenge and one that has better synchronization properties for exploiting “The Grid” – than 1 analysis running at 1 Pflop/s

Summary Recommendations for Architects • Support rich (mesh-like) interprocessor connectivity and fast global reductions • Allow disabling of expensive interprocessor cache coherence protocols for user-tagged data • Support fast message-passing protocols between processors that physically share memory, for legacy MP applications • Supply sufficient memory system bandwidth per processor (at least one word per clock per scalar unit) • Give user optional control of L2 cache traffic through directives • Develop at least gather/scatter processor-in-memory capability • Support variety of precisions in blocked transfers and fast precision conversions

Recommendations for New Benchmarks • Recently introduced sPPM benchmark fills a void for memory-system-realistic full-application PDE performance, but is explicit, structured, and relatively high-order • Similar full-application benchmark is needed for implicit, unstructured, low-order PDE solvers • Reflecting the hierarchical, distributed memory layout of high end computers, this benchmark would have two aspects • uniprocessor (“vertical”) memory system performance – suite of problems of various grid sizes and multicomponent sizes with different interlacings and edge orderings • parallel (“horizontal”) network performance – problems of various subdomain sizes and synchronization frequencies

Bibliography • High Performance Parallel CFD, Gropp, Kaushik, Keyes & Smith, 2001, Parallel Computing (to appear, 2001) • Toward Realistic Performance Bounds for Implicit CFD Codes, Gropp, Kaushik, Keyes & Smith, 1999, in “Proceedings of Parallel CFD'99,” Elsevier • Prospects for CFD on Petaflops Systems, Keyes, Kaushik & Smith, 1999, in “Parallel Solution of Partial Differential Equations,” Springer, pp. 247-278 • Newton-Krylov-Schwarz Methods for Aerodynamics Problems: Compressible and Incompressible Flows on Unstructured Grids, Kaushik, Keyes & Smith, 1998, in “Proceedings of the 11th Intl. Conf. on Domain Decomposition Methods,” Domain Decomposition Press, pp. 513-520 • How Scalable is Domain Decomposition in Practice, Keyes, 1998, in “Proceedings of the 11th Intl. Conf. on Domain Decomposition Methods,” Domain Decomposition Press, pp. 286-297 • On the Interaction of Architecture and Algorithm in the Domain-Based Parallelization of an Unstructured Grid Incompressible Flow Code, Kaushik, Keyes & Smith, 1998, in “Proceedings of the 10th Intl. Conf. on Domain Decomposition Methods,” AMS, pp. 311-319

David E. Keyes Department of Mathematics & Statistics, Old Dominion University