290 likes | 402 Views
The MAP 3 S Static-and-Regular Mesh Simulation and Wavefront Parallel-Programming Patterns. By Robert Niewiadomski, José Nelson Amaral, and Duane Szafron Department of Computing Science University of Alberta Edmonton, Alberta, Canada. Pattern-based parallel-programming. Observation:
E N D
The MAP3S Static-and-Regular Mesh Simulation and WavefrontParallel-Programming Patterns By Robert Niewiadomski, José Nelson Amaral, and Duane Szafron Department of Computing Science University of AlbertaEdmonton, Alberta, Canada
Pattern-based parallel-programming • Observation: • Many seemingly different parallel programs have a common parallel computation-communication-synchronization pattern. • A Parallel-programming pattern instance: • Is a parallel program that adheres to a certain parallel computation-communication-synchronization pattern. • Consists of engine-side code and user-side code: • Engine-side code: • Is complete and handles all communication and synchronization. • User-side code: • Is incomplete and handles all computation. • User completes the incomplete portions. • MAP3S targets distributed-memory systems.
MAP3S • MAP3S = MPI/C Advanced Pattern-based Parallel Programming System Technical expertise Domain knowledge Engine designer Pattern designer Application developer
Pattern-based parallel-programming • The MAP3S usage scheme: Select Pattern Create Specification File (p.e. dimensions of mesh, data dependences, etc) Generate Pattern Instance (automatic by pattern-instance generator) Write User-side Code (domain-specific computation code)
The Simulation and Wavefront computations • The computations operate on a k-dimensional mesh of elements. • Simulation: • Multiple mesh instances M0, M1, … are computed. • In iteration i = 0, elements of M0 are initialized. • In iteration i > 0, certain elements of Mi are computed using elements of Mi - 1 that were initialized/computed in previous iteration. • Execution proceeds until a terminating condition is met. • Example: cellular-automata computations. • Wavefront: • Single mesh instance M is computed. • In iteration i = 0,certain elements of M are initialized. • In iteration i > 0, elements of M whose data dependences are satisfied are computed. • Execution proceeds until there are no elements to compute. • Example: dynamic-programming computations.
Mesh-blocks 0 1 2 3 4 5 A 0 1 B 2 3 4 C 5 6 7 8 9 10 11 6 7 8 9 10 11 • A k-dimensional mesh is logically partitioned into k-dimensional sub-meshes called mesh-blocks. • Computation proceeds at granularity of mesh-blocks. 12 13 14 15 16 17 D 12 13 E 14 15 F 16 17 18 19 20 21 22 23 18 19 20 21 22 23 24 25 26 27 28 29 24 G 25 H 26 27 I 28 29 30 31 32 33 34 35 30 31 32 33 34 35
User-side code: Simulation Prelude: process command-line arguments. Prelude Prelude Prologue: initialize first mesh, possibly at granularity of mesh blocks. Prologue Prologue BodyLocal: compute next mesh at granularity of mesh blocks. BodyLocal BodyLocal BodyGlobal: decide whether to compute another mesh or to terminate. BodyGlobal BodyGlobal Epilogue: process last computed mesh, possibly at granularity of mesh blocks. Epilogue Epilogue
User-side code: Simulation Prelude: process command-line arguments. Prelude Prelude Prologue: initialize first mesh, possibly at granularity of mesh blocks. Prologue Prologue BodyLocal: compute next mesh at granularity of mesh blocks. BodyLocal BodyLocal BodyGlobal: decide whether to compute another mesh or to terminate. BodyGlobal BodyGlobal Epilogue: process last computed mesh, possibly at granularity of mesh blocks. Epilogue Epilogue
User-side code: Simulation Prelude: process command-line arguments. Prelude Prelude Prologue: initialize first mesh, possibly at granularity of mesh blocks. Prologue Prologue BodyLocal: compute next mesh at granularity of mesh blocks. BodyLocal BodyLocal BodyGlobal: decide whether to compute another mesh or to terminate. BodyGlobal BodyGlobal Epilogue: process last computed mesh, possibly at granularity of mesh blocks. Epilogue Epilogue
User-side code: Simulation Prelude: process command-line arguments. Prelude Prelude Prologue: initialize first mesh, possibly at granularity of mesh blocks. Prologue Prologue BodyLocal: compute next mesh at granularity of mesh blocks. BodyLocal BodyLocal BodyGlobal: decide whether to compute another mesh or to terminate. BodyGlobal BodyGlobal Epilogue: process last computed mesh, possibly at granularity of mesh blocks. Epilogue Epilogue
User-side code: Simulation Prelude: process command-line arguments. Prelude Prelude Prologue: initialize first mesh, possibly at granularity of mesh blocks. Prologue Prologue BodyLocal: compute next mesh at granularity of mesh blocks. BodyLocal BodyLocal BodyGlobal: decide whether to compute another mesh or to terminate. BodyGlobal BodyGlobal Epilogue: process last computed mesh, possibly at granularity of mesh blocks. Epilogue Epilogue
User-side code: Wavefront Prelude: process command-line arguments Prelude Prelude Prologue: initialize mesh, possibly at granularity of mesh blocks. Prologue Prologue Body Body Body: continue computing of mesh at granularity of mesh blocks Epilogue Epilogue Epilogue: process mesh, possibly at granularity of mesh blocks.
Data-dependency specification • The computation of an element depends on the values of certain other elements. • In MAP3S, the user specifies these data-dependencies using conditional shape-lists at pattern-instance generation time. • Syntax: given an element p(c0, c1,…, ck - 1), if a certain condition is met, then, the computation of p requires the values of all the elements falling into the specified k-dimensional volumes of the k-dimensional mesh, each of which is specified relative to position (c0, c1,…, ck - 1). • Here is a simple example (expressing dependences for the green element): 0 1 2 3 {“x > 0 && y > 0”, {([“x - 1”, ”x - 1”], [“y - 1”, “y - 1”]),([“x - 1”, “x - 1”], [“y”, “y”]), ([“x”, “x”], [“y - 1”, ”y - 1”])}}; 0 1 2 3
Data-dependency specification • The strengths of conditional shape-lists: • user is not limited to pre-defined data-dependency specifications, • user is able to express irregular data-dependency specifications. 0 1 2 3 4 5 6 7 8 9 0 1 2 {"y<=x", {(["0","y-1"],["y","y"]),(["x","x"],["0","y-1"])}}; 3 4 5 6 7 8 9 • In this example, conditional shape-lists specify the data-dependencies of the Lower/Upper Matrix-Decomposition Wavefront computation.
Data-dependency specification • The strengths of conditional shape-lists: • user is not limited to pre-defined data-dependency specifications, • user is able to express irregular data-dependency specifications. 0 1 2 3 4 5 6 7 8 9 0 1 2 {"y>x", {(["0","x-1"],["y","y"]),(["x","x"],["0","x-1"]),(["x","x"], ["x","x"])}}; {"y<=x", {(["0","y-1"],["y","y"]),(["x","x"],["0","y-1"])}}; 3 4 5 6 7 8 9 • In this example, conditional shape-lists specify the data-dependencies of the Lower/Upper Matrix-Decomposition Wavefront computation.
Direct mesh-access • In the user-code all the mesh elements can be accessed directly. 0 1 2 3 4 5 void computeMeshBlock(mesh, xMin, xMax, yMin, yMax) { for(x = xMin; x<= xMax; x++) { for(y = yMin; y<= yMax; y++) { mesh[x][y] = f(mesh[x-1][y-1], mesh[x][y-1], mesh[x-1][y]); } } } 0 1 2 3 4 5 • With direct mesh-access, the user does not need to refactor their sequential-code w.r.t. mesh access. In contrast, with indirect mesh-access a refactoring is necessary, since input elements are accessed in auxiliary data-structures.
Engine-side code Engine-side code in the Wavefront pattern. 0 1 2 3 4 5 A 0 1 4 C 5 B 2 3 6 7 10 11 8 9 9 11 6 8 10 7 12 13 14 15 16 17 12 D 13 F 16 17 E 14 15 18 19 22 23 20 21 21 23 18 20 22 19 24 25 26 27 28 29 24 G 25 I 28 29 H 26 27 30 31 34 35 31 33 35 32 33 30 32 34 • Element-level data-dependencies --- specified by the user --- are automatically extended to mesh-block-level data-dependencies.
Engine-side code A 0 A 0 1 1 6 6 7 7 • The mesh-block-level data-dependencies are utilized to establish a parallel-computation schedule. 12 D 13 B 2 3 0 A 1 4 C 5 2 B 3 18 19 8 9 6 7 10 11 8 9 D 12 13 F 16 17 E 14 15 16 F 17 C 4 5 24 G 25 E 14 15 18 19 22 23 20 21 22 23 10 11 30 31 20 21 24 G 25 I 28 29 26 H 27 I 28 29 26 H 27 30 31 34 35 32 33 34 35 32 33
Engine-side code A 0 1 A 0 1 6 7 6 7 CPU 0 CPU 1 • The parallel computation-schedule is refined with mesh-blocks being assigned among the processors in a round-robin fashion (shown). • The parallel computation-schedule is then complemented with a parallel communication schedule (not shown). • The engine-side code executes user-side code in accordance to the parallel computation and communication schedule. D 12 13 2 B 3 18 19 8 9 F 16 17 C 4 5 24 G 25 14 E 15 22 23 10 11 30 31 20 21 I 28 29 H 26 27 34 35 32 33
Engine-side code 0 A 1 A 0 1 6 7 6 7 CPU 1 CPU 0 • Execution of user-side code by the engine-side code when using sequential prologue and epilogue. Prelude 2 B 3 D 12 13 8 9 18 19 Prologue(A) F 16 17 C 4 5 E 14 15 24 G 25 Body(A) 22 23 10 11 30 31 20 21 Time Body(D) Body(B) I 28 29 26 H 27 34 35 32 33 Body(E) Body(C) Body(G) Epilogue(A,B,C,D,E,G)
0 1 16 17 32 33 48 49 2 3 18 19 34 35 50 51 4 5 20 21 36 37 52 53 6 7 22 23 38 39 54 55 8 9 24 25 40 41 56 57 10 11 26 27 42 43 58 59 12 13 28 29 44 45 60 61 14 15 30 31 46 47 62 63 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 Mesh representation • The mesh can be represented using either the dense mesh-representation or the sparse mesh-representation. • Sparse representation can have better locality and can distribute the memory footprint of the mesh among nodes. 2D-mesh in dense mesh-representation. 2D mesh in sparse mesh-representation.
Mesh representation 0 A 1 2 B 3 4 C 5 0 A 1 B 2 3 C 4 5 0 A 1 2 B 3 4 C 5 • A mesh memory-footprint can be as much a problem as performance. The combination of parallel prologue and epilogue, and sparse-mesh representation, both minimizes and distributes the mesh-storage memory-footprint. 6 7 8 9 10 11 6 7 8 9 10 11 6 7 8 9 10 11 D 12 13 12 D 13 14 E 15 12 D 13 14 E 15 F 16 17 18 19 18 19 20 21 22 23 18 19 20 21 Only store non-owned mesh-blocks that are used by owned mesh-blocks. Original Do not store dead mesh-blocks 24 G 25 24 G 25 H 26 27 28 I 29 24 G 25 30 31 30 31 30 31 32 33 34 35 CPU 0 0 A 1 2 B 3 A 0 1 B 2 3 4 C 5 0 A 1 B 2 3 4 C 5 6 7 8 9 6 7 8 9 10 11 6 7 8 9 10 11 CPU 1 12 D 13 E 14 15 12 D 13 E 14 15 16 F 17 D 12 13 14 E 15 18 19 20 21 18 19 20 21 18 19 20 21 22 23 24 G 25 24 G 25 H 26 27 I 28 29 G 24 25 • Memory-footprint reduction varies. It is most effective for large simulation computations. 30 31 30 31 32 33 34 35 30 31
Experimental evaluation • Problems: • 2D problems: • GoL: game-of-life (Simulation) • LUMD: lower/upper matrix-decomposition (Wavefront) • 3D problems: • RTA: room-temperature annealing (Simulation) • MSA: multiple-sequence alignment (Wavefront) • Hardware: • GigE:a 16-node cluster with Gigabit Ethernet • IB: a 128-node cluster with InfiniBand (limited to 64)
Experimental evaluation • Speedup on GigE: • x-axis is # of nodes. • y-axis is speedup. • Performance gains on LUMD and MSA are worse than on GoL and RTA: • LUMD has non-uniform computation intensity, which limits parallelism. • MSA has limited computation granularity, which increases relative overhead of communication and synchronization. GoL (2D Simulation) RTA (3D Simulation) LUMD (2D Wavefront) MSA (3D Wavefront)
Experimental evaluation • Speedup on IB: • x-axis is # of nodes. • y-axis is speedup. • Performance gains on LUMD and MSA are worse than on GoL and RTA: • See GigE. GoL (2D Simulation) RTA (3D Simulation) LUMD (2D Wavefront) MSA (3D Wavefront)
Experimental evaluation • Capability: • Use of sparse-mesh representation distributes the mesh memory-footprint across multiple nodes. • Allows for handling of meshes whose memory-footprint exceeds memory capacity of a single node. • Using 16 nodes on GigE: • LUMD mesh memory-footprint effectiveness is limited due to computation of elements being dependent on a larger number of elements.
Experimental evaluation • Capability: • Use of sparse-mesh representation distributes the mesh memory-footprint across multiple nodes. • Allows for handling of meshes whose memory-footprint exceeds memory capacity of a single node. • Using 16 nodes on GigE: • LUMD mesh memory-footprint effectiveness is limited due to computation of elements being dependent on a larger number of elements.
Experimental evaluation • What we learned: • Dense meshes + large computation granularity: • MAP3S delivers speedups in the range of 10 to 12 on 16 nodes; • an in the range of 10 to 43 on 64 nodes; • Sparse meshes: • smaller speedups • memory consumption is reduced by 20% to 50% (per node)