Evaluation of Leading Parallel Architectures for Scientific Computing

Evaluation of Leading Parallel Architectures for Scientific Computing Leonid Oliker Future Technologies Group NERSC/LBNL www.nersc.gov/~oliker

Overview • Success of parallel computing in solving large-scale practical applications relies on efficient mapping and execution on available architectures • Most real-life applications are complex, irregular, and dynamic • Generally believed that unstructured methods will constitute significant fraction of future high-end computing • Evaluate existing and emerging architectural platforms in the context of irregular and dynamic applications • Examine the complex interactions between high-level algorithms, programming paradigms, and architectural platforms.

Algorithms Architecturesand Programming Paradigms • Several parallel architectures with distinct programming methodologies and performance characteristics have emerged • Examined three irregular algorithms • N-Body Simulation, Dynamic Remeshing, Conjugate Gradient • Parallel Architectures • Cray T3E, SGI Origin2000, IBM SP, Cray (Tera) MTA • Programming Paradigms • Message-Passing, Shared Memory, Hybrid, Multithreading • Partitioning and/or ordering strategies to decompose domain • multilevel (MeTiS), linearization (RCM, SAW), combination (MeTiS+SAW)

Sparse Conjugate Gradient • CG oldest and best-known Krylov subspace method to solve sparse linear system (Ax = b) • starts from an initial guess of x • successively generates approximate solutions in the Krylov subspace & search directions to update the solution and residual • slow convergence for ill-conditioned matrices (use preconditioner) • Sparse matrix vector multiply (SPMV) usually accounts for most of the flops within a CG iteration, and is one of most heavily-used kernels in large-scale numerical simulations • if A is O(n) with nnz nonzeros, SPMV is O(nnz) but DOT is O(n) flops • To perform SPMV (yAx) • assume A stored in compressed row storage (CRS) format • dense vector x stored sequentially in memory with unit stride • Various numberings of mesh elements result in different nonzero patterns of A, causing different access patterns for x

Preconditioned Conjugate Gradient • PCG Algorithm Compute r0 = b - Ax0, p0 = z0 = M-1r0, for some initial guess x0 forj = 0, 1, …, until convergence j = (rj, zj) / (Apj, pj); xj+1 = xj + j pj; rj+1 = rj - aj Apj; zj+1 = M-1rj+1; bj = (rj+1, zj+1) / (rj, zj); pj+1 = zj+1 +bj pj; end for • Each PCG iteration involves • 1 SPMV iteration for Apj • 1 solve with preconditioner M (we consider ILU(0)) • 3 vector updates (AXPY) for xj+1, rj+1, pj+1 • 3 inner products (DOT) for update scalars j, bj • For symmetric positive definite linear systems, these conditions minimize distance between approximate and true solutions • For most practical matrices, SPMV and triangular solves dominate

Graph Partitioning Strategy: MeTiS • Most popular class of multilevel partitioners • Objectives • balance computational workload • minimize edge cut (interprocessor communication) • Algorithm • collapses vertices & edges using heavy-edge matching scheme • applies greedy partitioning algorithm to coarsest graph • uncoarsens it back using greedy graph growing + Kernighan-Lin Initial partitioning Refinementphase Coarseningphase

Linearization Strategy:Reverse Cuthill-McKee (RCM) • Matrix bandwidth (profile) has significant impact on efficiency of linear systems and eigensolvers • Geometry-based algorithm that generates a permutation so that non-zero entries are close to diagonal • Good preordering for LU or Cholesky factorization (reduces fill) • Improves cache performance (but does not explicitly reduce edge cut)

Linearization Strategy:Self-Avoiding Walks (SAW) • Mesh-based technique similar to space-filling curves • Two consecutive triangles in walk share edge or vertex (no jumps) • Visits each triangle exactly once, entering/exiting over edge/vertex • Improves parallel efficiency related to locality (cache reuse) and load balancing, but does not explicitly reduce edge cuts • Amenable to hierarchical coarsening & refinement Heber, Biswas, Gao: Concurrency: Practice & Experience, 12 (2000) 85-109

MPI Distributed-Memory Implementation • Each processor has local memory that only it can directly access; message passing required to access memory of another processor • User decides data distribution and organizes comm structure • Allows efficient code design at the cost of higher complexity • CG uses Aztec sparse linear library;PCG uses BlockSolve95 (Aztec does not have ILU(0) routine) • Matrix A partitioned into blocks of rows; each block to processor • Associated component vectors (x, b) distributed accordingly • Communication needed to transfer some components of x • AXPY (local computation); DOT (local sum, global reduction) • T3E (450 MHz Alpha processor, 900 Mflops peak, 245 MB main memory, 96 KB secondary cache, 3D torus interconnect)

SPMV Locality & Communication Statistics • Performance results using T3E hardware performance monitor • ORIG ordering has large edge cut (interprocessor communication) and poor data locality (high number of cache misses) • MeTiS minimizes edge cut; SAW minimizes cache misses

SPMV and CG Runtimes:Performance on T3E • Smart ordering / partitioning required to achieve good performance and high scalability • For this combination of apps & archs, improving cache reuse is more important than reducing interprocessor communication • Adaptivity will require repartitioning (reordering) and remapping

TriSolve and PCG Runtimes:Performance on T3E • Initial ordering / partitioning significant, even though matrix further reordered by BlockSolve95 • TriSolve dominates, and sensitive to ordering • SAW has slight advantage over RCM & MeTiS;an order faster than ORIG

(OpenMP) Shared-Memory Implementation • Origin2000 (SMP of nodes, each with dual 250 MHz R10000 processor & 512 MB local memory) • hardware makes all memory equally accessible from software perspective • non-uniform memory access time (depends on # hops) • each processor has 4 MB secondary data cache • when processor modifies word, all other copies of cache line invalidated • OpenMP-style directives (requires significantly less effort than MPI) • Two implementation approaches taken (identical kernels) • FLATMEM: assume Origin2000 has uniform shared-memory (arrays not explicitly distributed, non-local data handled by cache coherence) • CC-NUMA: consider underlying architecture by explicit data distribution • Each processor assigned equal # rows in matrix (block) • No explicit synchronization required since no concurrent writes • Global reduction for DOT operation

Origin2000 (Hardware Cache Coherency) Memory Directory Router Dir (>32P) Hub R12K R12K L2 Cache L2 Cache Node Architecture Communication Architecture

CG Runtimes:Performance on Origin2000 • CC-NUMA performs significantly better than FLATMEM • RCM & SAW reduce runtimes compared to ORIG • Little difference between RCM & SAW, probably due to large cache • CC-NUMA (with ordering) and MPI runtimes comparable, even though programming methodologies quite different • Adaptivity will require reordering and remapping

Hybrid (MPI+OpenMP) Implementation • Latest teraflop-scale systems designs contain large number of SMP nodes • Mixed programming paradigm combines two layers of parallelism • OpenMP within each SMP • MPI among SMP nodes • Allows codes to benefit from loop-level parallelism & shared-memory algorithms in addition to coarse-grained parallelism • Natural mapping to underlying architecture • Currently unclear if hybrid performance gains compensate for increased programming complexity and potential loss of portability • Incrementally add OpenMP directives to Aztec, and some code reorganization (including temp variables for correctness) • IBM SP (222 MHz Power3, 8-way SMP, current switch limits 4 MPI tasks per node)

CG Runtimes:Performance on SP • Intelligent orderings important for good hybrid performance • MeTiS+SAW best strategy, but not dramatically • For a given processor count, varying MPI tasks & OpenMP threads have little effect • Hybrid implementation does not offer noticeable advantage

Cray (Tera) MTA Multithreaded Architecture • 255 MHz MTA uses multithreading to hide latency (100-150 cycles per word) & keep processors saturated with work • no data cache • hashed memory mapping (data layout impossible) • near uniform data access from any processor to any memory location • Each processor has 128 hardware streams (32 registers & PC) • context switch on each cycle, choose next instruction from ready streams • a stream can execute an instruction only once every 21 cycles, even if no instructions reference memory • Synchronization between threads accomplished using full / empty bits in memory, allowing fine-grained threads • No explicit load balancing required since dynamic scheduling of work to threads can keep processor saturated • No difference between uni- and multiprocessor parallelism

MTA Multithreaded Implementation • Straightforward implementation (only requires compiler directives) • Special assertions used to indicate no loop-carried dependencies • Compiler then able to parallelize loop segments • Load balancing by OS (dynamically assigns matrix rows to threads) • Other than reduction for DOT, no special synchronization constructs required for CG • Synchronization required however for PCG • No special ordering required to achieve good parallel performance

SPMV and CG Runtimes:Performance on MTA • Both SPMV and CG show high scalability (over 90%) with 60 streams per processor • sufficient TLP to tolerate high overhead of memory access • 8-proc MTA faster than 32-proc Origin2000 & 16-proc T3E with no partitioning / ordering overhead; but will scaling continue beyond 8 procs? • Adaptivity does not require extra work to maintain performance

PCG Runtimes:Performance on MTA • Developed multithreaded version of TriSolve • matrix factorization times not included • use low-level locks to perform on-the-fly dependency analysis • TriSolve responsible for most of the computational overhead • Limited scalability due to insufficient TLP in our dynamic dependency scheme

CG Summary • Examined four different parallel implementations of CG and PCG using four leading programming paradigms and architectures • MPI most complicated • compared graph partitioning & linearization strategies • improving cache reuse more important than reducing communication • Smart ordering algorithms significantly improve overall performance • possible to achieve message passing performance using shared memory constructs through careful data ordering & distribution • Hybrid paradigm increases programming complexity with little performance gains • MTA easiest to program • no partitioning / ordering required to obtain high efficiency & scalability • no additional complexity for dynamic adaptation • limited scalability for PCG due to lack of thread level parallelism

2D Unstructured Mesh Adaptation • Powerful tool for efficiently solving computational problems with evolving physical features (shocks, vortices, shear layers, crack propagation) • Complicated logic and data structures • Difficult to parallelize efficiently • Irregular data access patterns (pointer chasing) • Workload grows/shrinks at runtime (dynamic load balancing) • Three types of element subdivision

Parallel Code Development • Programming paradigms • Message passing (MPI) • Shared memory (OpenMP-style pragma compiler directives) • Multithreading (Tera compiler directives) • Architectures • Cray T3E  SGI Origin2000  Cray (Tera) MTA • Critical factors • Runtime • Scalability • Programmability • Portability • Memory overhead

Test Problem • Computational mesh to simulate flow over airfoil • Mesh geometrically refined 5 levels in specific regions to better capture fine-scale phenomena Serial Code 6.4 secs on 250 MHz R10K 14,605 vertices 28,404 triangles 488,574 vertices 1,291,834 triangles

Distributed-Memory Implementation • 512-node T3E (450 MHz DEC Alpha procs) • 32-node Origin2000 (250 MHz dual MIPS R10K procs) • Code implemented in MPI within PLUM framework • Initial dual graph used for load balancing adapted meshes • Parallel repartitioning of adapted meshes (ParMeTiS) • Remapping algorithm assigns new partitions to processors • Efficient data movement scheme (predictive & asynchronous) • Three major steps (refinement, repartitioning, remapping) • Overhead • Programming (to maintain consistent D/S for shared objects) • Memory (mostly for bulk communication buffers)

MESH ADAPTOR Edge Marking Coarsening Refinement Overview of PLUM INITIALIZATION LOAD BALANCER Initial Mesh Balanced? N Y Partitioning Repartitioning Mapping Reassignment Expensive? Y N FLOW SOLVER Remapping

Performance of MPI Code • More than 32 procs required to outperform serial case • Reasonable scalability for refinement & remapping • Scalable repartitioner would improve performance • Data volume different due to different word sizes

Shared-Memory Implementation • 32-node Origin2000 (250 MHz dual MIPS R10K procs) • Complexities of partitioning & remapping absent • Parallel dynamic loop scheduling for load balance • GRAPH_COLOR strategy (significant overhead) • Use SGI’s native pragma directives to create IRIX threads • Color triangles (new ones on the fly) to form independent sets • All threads process each set to completion, then synchronize • NO_COLOR strategy (too fine grained) • Use low-level locks instead of graph coloring • When thread processes triangle, lock its edges & vertices • Processors idle while waiting for blocked objects

Performance of Shared-Memory Code • Poor performance due to flat memory assumption • System overloaded by false sharing • Page migration unable to remedy problem • Need to consider data locality and cache effects to improve performance (require partitioning & reordering) • For GRAPH_COLOR • Cache misses15 M (serial) to85 M (P=1) • TLB misses7.3 M (serial) to53 M (P=1)

Multithreaded Implementation • 8-processor 250 MHz Tera MTA • 128 streams/proc, flat hashed memory, full-empty bit for sync • Executes pipelined instruction from different stream at each clock tick • Dynamically assigns triangles to threads • Implicit load balancing • Low-level synchronization variables ensure adjacent triangles do not update shared edges or vertices simultaneously • No partitioning, remapping, graph coloring required • Basically, the NO_COLOR strategy • Minimal programming to create multithreaded version

Performance of Multithreading Code • Sufficient instruction level parallelism exists to tolerate memory access overhead and lightweight synchronization • Number of streams changed via compiler directive

Schematic of Different Paradigms Distributed memory Shared memory Multithreading Before and after adaptation (P=2 for distributed memory)

Mesh Adaptation:Comparison and Conclusions • Different programming paradigms require varying numbers of operations and overheads • Multithreaded systems offer tremendous potential for solving some of the most challenging real-life problems on parallel computers

MPI SHMEM P0 P1 P0 P1 A A A A Put Send Receive Communication Library Communication Library Send-Receive pair Put or Get, not both Comparison of Programming Modelson Origin2000 CC-SAS P0 P1 A0 A1 A1 = A0 Load/Store We focus on adaptive applications (D-mesh, N-body)

Characteristics of the Model Transparency (for implementation) Increasing

INITIALIZATION Initial Mesh Partitioning FLOW SOLVER Matrix Transform Iterative Solver D-Mesh Algorithm MESH ADAPTOR LOAD BALANCER Edge Marking Balanced ? Y N Partitioning Refinement Re-mapping

P0 P0 P1 P1 P1 P1 MPI (physical partition): SAS (logical partition): Implementation of Flow Solver Matrix Transform: Easier in CC-SAS Iterative Solver: Conjugate Gradient, SPMV

Performance of Solver Most of the time is spent on iterative solver

P0 P0 P1 P1 P0 P0 P1 P1 Implementation of Load Balancer CC-SAS: MPI/SHMEM: Data Re-mapping in D-Mesh Logically partition Physically partition SAS provides substantial ease of programming in conceptual and orchestration level , far beyond implicit load/store vs. explicit messages

Performance of Adaptor CC-SAS suffers from the poor spatial locality of shared data

Performance of Load Balancer

Performance of D-Mesh CC-SAS suffers from poor spatial locality for smaller data sets CC-SAS benefits from the ease of programming for larger data sets

N-Body Simulation:Evolution of Two Plummer Bodies Barnes-Hut arises in many areas of science and engineering such as astrophysics, molecular dynamics, and graphics

N-Body Simulation (Barnes-Hut) Time Steps Build the Oct-tree Compute forces on all bodies Compute forces based on the tree Update body positions and velocities

MPI/SHMEM: “Locally Essential” Tree P0: Distribute/Collect Cells/Bodies N-Body:Tree Building Method CC-SAS: Shared Tree P0:

Performance of N-Body • For 16-processors, performance is similar • For 64-processors: • CC-SAS is better for smaller data sets • but worse for larger data sets

NBODY:Time Breakdown for (64P, 16K) Processor Identifier Less BUSY time for CC-SAS due to ease of programming

N-BODY:TIME BREAKDOWN (64P,1024K) Processor Identifier High MEM time for CC-SAS

N-Body :Improved Implementation SAS: Shared Tree Duplicate high-level cells MPI/SHMEM: Locally Essential Tree

Evaluation of Leading Parallel Architectures for Scientific Computing