The MAP 3 S Static-and-Regular Mesh Simulation and Wavefront Parallel-Programming Patterns

The MAP3S Static-and-Regular Mesh Simulation and WavefrontParallel-Programming Patterns By Robert Niewiadomski, José Nelson Amaral, and Duane Szafron Department of Computing Science University of AlbertaEdmonton, Alberta, Canada

Pattern-based parallel-programming • Observation: • Many seemingly different parallel programs have a common parallel computation-communication-synchronization pattern. • A Parallel-programming pattern instance: • Is a parallel program that adheres to a certain parallel computation-communication-synchronization pattern. • Consists of engine-side code and user-side code: • Engine-side code: • Is complete and handles all communication and synchronization. • User-side code: • Is incomplete and handles all computation. • User completes the incomplete portions. • MAP3S targets distributed-memory systems.

MAP3S • MAP3S = MPI/C Advanced Pattern-based Parallel Programming System Technical expertise Domain knowledge Engine designer Pattern designer Application developer

Pattern-based parallel-programming • The MAP3S usage scheme: Select Pattern Create Specification File (p.e. dimensions of mesh, data dependences, etc) Generate Pattern Instance (automatic by pattern-instance generator) Write User-side Code (domain-specific computation code)

The Simulation and Wavefront computations • The computations operate on a k-dimensional mesh of elements. • Simulation: • Multiple mesh instances M0, M1, … are computed. • In iteration i = 0, elements of M0 are initialized. • In iteration i > 0, certain elements of Mi are computed using elements of Mi - 1 that were initialized/computed in previous iteration. • Execution proceeds until a terminating condition is met. • Example: cellular-automata computations. • Wavefront: • Single mesh instance M is computed. • In iteration i = 0,certain elements of M are initialized. • In iteration i > 0, elements of M whose data dependences are satisfied are computed. • Execution proceeds until there are no elements to compute. • Example: dynamic-programming computations.

Mesh-blocks 0 1 2 3 4 5 A 0 1 B 2 3 4 C 5 6 7 8 9 10 11 6 7 8 9 10 11 • A k-dimensional mesh is logically partitioned into k-dimensional sub-meshes called mesh-blocks. • Computation proceeds at granularity of mesh-blocks. 12 13 14 15 16 17 D 12 13 E 14 15 F 16 17 18 19 20 21 22 23 18 19 20 21 22 23 24 25 26 27 28 29 24 G 25 H 26 27 I 28 29 30 31 32 33 34 35 30 31 32 33 34 35

User-side code: Simulation Prelude: process command-line arguments. Prelude Prelude Prologue: initialize first mesh, possibly at granularity of mesh blocks. Prologue Prologue BodyLocal: compute next mesh at granularity of mesh blocks. BodyLocal BodyLocal BodyGlobal: decide whether to compute another mesh or to terminate. BodyGlobal BodyGlobal Epilogue: process last computed mesh, possibly at granularity of mesh blocks. Epilogue Epilogue

User-side code: Wavefront Prelude: process command-line arguments Prelude Prelude Prologue: initialize mesh, possibly at granularity of mesh blocks. Prologue Prologue Body Body Body: continue computing of mesh at granularity of mesh blocks Epilogue Epilogue Epilogue: process mesh, possibly at granularity of mesh blocks.

Data-dependency specification • The computation of an element depends on the values of certain other elements. • In MAP3S, the user specifies these data-dependencies using conditional shape-lists at pattern-instance generation time. • Syntax: given an element p(c0, c1,…, ck - 1), if a certain condition is met, then, the computation of p requires the values of all the elements falling into the specified k-dimensional volumes of the k-dimensional mesh, each of which is specified relative to position (c0, c1,…, ck - 1). • Here is a simple example (expressing dependences for the green element): 0 1 2 3 {“x > 0 && y > 0”, {([“x - 1”, ”x - 1”], [“y - 1”, “y - 1”]),([“x - 1”, “x - 1”], [“y”, “y”]), ([“x”, “x”], [“y - 1”, ”y - 1”])}}; 0 1 2 3

Data-dependency specification • The strengths of conditional shape-lists: • user is not limited to pre-defined data-dependency specifications, • user is able to express irregular data-dependency specifications. 0 1 2 3 4 5 6 7 8 9 0 1 2 {"y<=x", {(["0","y-1"],["y","y"]),(["x","x"],["0","y-1"])}}; 3 4 5 6 7 8 9 • In this example, conditional shape-lists specify the data-dependencies of the Lower/Upper Matrix-Decomposition Wavefront computation.

Data-dependency specification • The strengths of conditional shape-lists: • user is not limited to pre-defined data-dependency specifications, • user is able to express irregular data-dependency specifications. 0 1 2 3 4 5 6 7 8 9 0 1 2 {"y>x", {(["0","x-1"],["y","y"]),(["x","x"],["0","x-1"]),(["x","x"], ["x","x"])}}; {"y<=x", {(["0","y-1"],["y","y"]),(["x","x"],["0","y-1"])}}; 3 4 5 6 7 8 9 • In this example, conditional shape-lists specify the data-dependencies of the Lower/Upper Matrix-Decomposition Wavefront computation.

Direct mesh-access • In the user-code all the mesh elements can be accessed directly. 0 1 2 3 4 5 void computeMeshBlock(mesh, xMin, xMax, yMin, yMax) { for(x = xMin; x<= xMax; x++) { for(y = yMin; y<= yMax; y++) { mesh[x][y] = f(mesh[x-1][y-1], mesh[x][y-1], mesh[x-1][y]); } } } 0 1 2 3 4 5 • With direct mesh-access, the user does not need to refactor their sequential-code w.r.t. mesh access. In contrast, with indirect mesh-access a refactoring is necessary, since input elements are accessed in auxiliary data-structures.

Engine-side code Engine-side code in the Wavefront pattern. 0 1 2 3 4 5 A 0 1 4 C 5 B 2 3 6 7 10 11 8 9 9 11 6 8 10 7 12 13 14 15 16 17 12 D 13 F 16 17 E 14 15 18 19 22 23 20 21 21 23 18 20 22 19 24 25 26 27 28 29 24 G 25 I 28 29 H 26 27 30 31 34 35 31 33 35 32 33 30 32 34 • Element-level data-dependencies --- specified by the user --- are automatically extended to mesh-block-level data-dependencies.

Engine-side code A 0 A 0 1 1 6 6 7 7 • The mesh-block-level data-dependencies are utilized to establish a parallel-computation schedule. 12 D 13 B 2 3 0 A 1 4 C 5 2 B 3 18 19 8 9 6 7 10 11 8 9 D 12 13 F 16 17 E 14 15 16 F 17 C 4 5 24 G 25 E 14 15 18 19 22 23 20 21 22 23 10 11 30 31 20 21 24 G 25 I 28 29 26 H 27 I 28 29 26 H 27 30 31 34 35 32 33 34 35 32 33

Engine-side code A 0 1 A 0 1 6 7 6 7 CPU 0 CPU 1 • The parallel computation-schedule is refined with mesh-blocks being assigned among the processors in a round-robin fashion (shown). • The parallel computation-schedule is then complemented with a parallel communication schedule (not shown). • The engine-side code executes user-side code in accordance to the parallel computation and communication schedule. D 12 13 2 B 3 18 19 8 9 F 16 17 C 4 5 24 G 25 14 E 15 22 23 10 11 30 31 20 21 I 28 29 H 26 27 34 35 32 33

Engine-side code 0 A 1 A 0 1 6 7 6 7 CPU 1 CPU 0 • Execution of user-side code by the engine-side code when using sequential prologue and epilogue. Prelude 2 B 3 D 12 13 8 9 18 19 Prologue(A) F 16 17 C 4 5 E 14 15 24 G 25 Body(A) 22 23 10 11 30 31 20 21 Time Body(D) Body(B) I 28 29 26 H 27 34 35 32 33 Body(E) Body(C) Body(G) Epilogue(A,B,C,D,E,G)

0 1 16 17 32 33 48 49 2 3 18 19 34 35 50 51 4 5 20 21 36 37 52 53 6 7 22 23 38 39 54 55 8 9 24 25 40 41 56 57 10 11 26 27 42 43 58 59 12 13 28 29 44 45 60 61 14 15 30 31 46 47 62 63 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 Mesh representation • The mesh can be represented using either the dense mesh-representation or the sparse mesh-representation. • Sparse representation can have better locality and can distribute the memory footprint of the mesh among nodes. 2D-mesh in dense mesh-representation. 2D mesh in sparse mesh-representation.

Mesh representation 0 A 1 2 B 3 4 C 5 0 A 1 B 2 3 C 4 5 0 A 1 2 B 3 4 C 5 • A mesh memory-footprint can be as much a problem as performance. The combination of parallel prologue and epilogue, and sparse-mesh representation, both minimizes and distributes the mesh-storage memory-footprint. 6 7 8 9 10 11 6 7 8 9 10 11 6 7 8 9 10 11 D 12 13 12 D 13 14 E 15 12 D 13 14 E 15 F 16 17 18 19 18 19 20 21 22 23 18 19 20 21 Only store non-owned mesh-blocks that are used by owned mesh-blocks. Original Do not store dead mesh-blocks 24 G 25 24 G 25 H 26 27 28 I 29 24 G 25 30 31 30 31 30 31 32 33 34 35 CPU 0 0 A 1 2 B 3 A 0 1 B 2 3 4 C 5 0 A 1 B 2 3 4 C 5 6 7 8 9 6 7 8 9 10 11 6 7 8 9 10 11 CPU 1 12 D 13 E 14 15 12 D 13 E 14 15 16 F 17 D 12 13 14 E 15 18 19 20 21 18 19 20 21 18 19 20 21 22 23 24 G 25 24 G 25 H 26 27 I 28 29 G 24 25 • Memory-footprint reduction varies. It is most effective for large simulation computations. 30 31 30 31 32 33 34 35 30 31

Experimental evaluation • Problems: • 2D problems: • GoL: game-of-life (Simulation) • LUMD: lower/upper matrix-decomposition (Wavefront) • 3D problems: • RTA: room-temperature annealing (Simulation) • MSA: multiple-sequence alignment (Wavefront) • Hardware: • GigE:a 16-node cluster with Gigabit Ethernet • IB: a 128-node cluster with InfiniBand (limited to 64)

Experimental evaluation • Speedup on GigE: • x-axis is # of nodes. • y-axis is speedup. • Performance gains on LUMD and MSA are worse than on GoL and RTA: • LUMD has non-uniform computation intensity, which limits parallelism. • MSA has limited computation granularity, which increases relative overhead of communication and synchronization. GoL (2D Simulation) RTA (3D Simulation) LUMD (2D Wavefront) MSA (3D Wavefront)

Experimental evaluation • Speedup on IB: • x-axis is # of nodes. • y-axis is speedup. • Performance gains on LUMD and MSA are worse than on GoL and RTA: • See GigE. GoL (2D Simulation) RTA (3D Simulation) LUMD (2D Wavefront) MSA (3D Wavefront)

Experimental evaluation • Capability: • Use of sparse-mesh representation distributes the mesh memory-footprint across multiple nodes. • Allows for handling of meshes whose memory-footprint exceeds memory capacity of a single node. • Using 16 nodes on GigE: • LUMD mesh memory-footprint effectiveness is limited due to computation of elements being dependent on a larger number of elements.

Experimental evaluation • What we learned: • Dense meshes + large computation granularity: • MAP3S delivers speedups in the range of 10 to 12 on 16 nodes; • an in the range of 10 to 43 on 64 nodes; • Sparse meshes: • smaller speedups • memory consumption is reduced by 20% to 50% (per node)

The End

The MAP 3 S Static-and-Regular Mesh Simulation and Wavefront Parallel-Programming Patterns

The MAP 3 S Static-and-Regular Mesh Simulation and Wavefront Parallel-Programming Patterns

Presentation Transcript

Patterns of Parallel Programming

Parallel and Distributed Simulation

Parallel and Distributed Simulation

Parallel and Distributed Simulation

Parallel and Distributed Simulation

Parallel and Distributed Simulation

Parallel and Distributed Simulation

Parallel and Distributed Simulation

Parallel and Distributed Simulation

Parallel and Distributed Simulation

Parallel and Distributed Simulation

Parallel and Distributed Simulation

Parallel and Distributed Simulation

Design Patterns for Parallel Programming

Parallel and Distributed Simulation

Parallel and Distributed Simulation

Parallel and Distributed Simulation

Parallel and Distributed Simulation