180 likes | 339 Views
MSc in High Performance Computing Computational Chemistry Module Lecture 8 Introducing One-sided Communications. Martyn F Guest, Huub van Dam and Paul Sherwood CCLRC Daresbury Laboratory p.sherwood@daresbury.ac.uk. Outline of the Lecture. One sided vs two sided communication strategies
E N D
MSc in High Performance ComputingComputational Chemistry ModuleLecture 8 Introducing One-sided Communications Martyn F Guest, Huub van Dam and Paul Sherwood CCLRC Daresbury Laboratory p.sherwood@daresbury.ac.uk
Outline of the Lecture • One sided vs two sided communication strategies • Implementation in the Global Arrays and ARMCI • An example, 1D data transpose • Practical Session • Programming a matrix multiply using one-sided communication primitives
Review of Message Passing Concepts • Messages are the only form of communication • all communication is therefore explicit • Most systems use the SPMD model • all processes run exactly the same code • each has a unique ID • processes can take different branches in the same codes • Basic form is point-to-point • collective communications implement more complicated patterns that often occur in many codes
Two-sided communication • Communication is broken down into messages, with a sending and receiving node, consequences are • Receiving node must enter a receive call (e.g. MPI_Recv) from the users own code • This must be designed into the parallelisation algorithm • E.g. alternating compute and communication phases, cf systolic loop algorithm in MD • If task sizes are unpredictable, this may lead to inefficiency due to load balancing issues. • Program complexity increased by need to ensure both sending and receiving nodes are available
One-Sided Communication • A one-sided communication is initiated by the node wishing to read or write from memory on a remote node. • Some typical one sided operations: • put - write to remote memory • get – read from remote memory • accumulate – read from remote memory, increment and write result • The use code running on the node owning the memory does not explicitly handle the request for data • One-sided operations are naturally supported by shared memory systems (they look the same as local memory access) • On distributed memory systems messages must be generated to transfer data.
One-Sided Communication • Implementation of one sided communications • SHMEM • Offered by Cray to provide a virtual shared memory environment • Active Messages • Framework for providing handler functions to be called when messages of particular types arrive • E.g. LAPI from IBM, as used on HPCx • MPI-2 • Standardised implementation • Computation is broken into phases (windows, phases…) • It is not possible to overlap one- and two-sided message passing phases • Global Arrays • Toolkit to support one-sided programming • Via calls to SHMEM, LAPI etc
One-Sided Approaches • Vendor provided tools • Cray T3D, and T3E systems provided SHMEM library • Subsequently implemented by other vendors (e.g. Quadrics) #include <stdio.h> #include <stdlib.h> #include "shmem.h” int main(int argc, char **argv){ int my_pe, num_pe, target, source, *value, *marker; value = (int*)calloc(1, sizeof(int)); marker = (int*)calloc(1, sizeof(int)); //initialise, prepare target and target processes shmem_init(); my_pe = _my_pe(); num_pe = _num_pes(); target = (my_pe==(num_pe-1)) ? 0 : (my_pe + 1); source = (my_pe==0) ? (num_pe-1) : (my_pe - 1); shmem_barrier_all(); //write to targets shmem_int_p(value, my_pe, target); shmem_fence(); shmem_int_p(marker, 1, target); if(*marker == 0) shmem_int_wait(marker, 0); printf("%d get value %d\n", my_pe, *value); shmem_barrier_all(); free(value); free(marker); return 0; }
Global Arrays Distributed dense arrays that can be accessed through a shared memory-like style Physically distributed data single, shared data structure/ global indexing e.g.,accessA(4,3) rather than buf(7) on task 2 Global Address Space
Global Arrays (cont.) • Shared memory model in context of distributed dense arrays • Much simpler than message-passing for many applications • Complete environment for parallel code development • Compatible with MPI • Data locality control similar to distributed memory/message passing model • Extensible • Scalable
Remote Data Access in GA • Message Passing: • identify size and location of data blocks • loop over processors: • if (me = P_N) then • pack data in local message buffer • send block of data to message buffer on P0 • else if (me = P0) then • receive block of data from P_N in message buffer • unpack data from message buffer to local buffer • endif • end loop • copy local data on P0 to local buffer Global Arrays: NGA_Get(g_a, lo, hi, buffer, ld); } } Global Array handle Global upper and lower indices of data patch Local buffer and array of strides P0 P2 P1 P3
GA Example: 1-D Transpose • Take a 1D array A, store it in a distributed fashion (g_a) • Perform the transpose operation Bi = An-i+1 for all i • Assume that each processor only needs to work with one patch to complete the operation. a1 a2 a3 a4 a5 a6 a7 a8 . . an an . . a8 a7 a6 a5 a4 a3 a2 a1
Example: 1-D Transpose (cont.) #define NDIM 1 #define TOTALELEMS 197 #define MAXPROC 128 program main implicit none #include "mafdecls.fh" #include "global.fh" integer dims(3), chunk(3), nprocs, me, i, lo(3), hi(3), lo1(3) integer hi1(3), lo2(3), hi2(3), ld(3), nelem integer g_a, g_b, a(MAXPROC*TOTALELEMS), b(MAXPROC*TOTALELEMS) integer heap, stack, ichk, ierr logical status heap = 300000 stack = 300000
Example: 1-D Transpose (cont.) c initialize communication library call mpi_init(ierr) c initialize ga library call ga_initialize() me = ga_nodeid() nprocs = ga_nnodes() dims(1) = nprocs*TOTALELEMS + nprocs/2 ! Unequal data distribution ld(1) = MAXPROC*TOTALELEMS chunk(1) = TOTALELEMS ! Minimum amount of data on each processor status = ma_init(MT_F_DBL, stack/nprocs, heap/nprocs) c create a global array status = nga_create(MT_F_INT, NDIM, dims, "array A", chunk, g_a) status = ga_duplicate(g_a, g_b, "array B") c initialize data in GA do i=1, dims(1) a(i) = i end do lo1(1) = 1 hi1(1) = dims(1) if (me.eq.0) call nga_put(g_a,lo1,hi1,a,ld) call ga_sync() ! Make sure data is distributed before continuing
Example: 1-D Transpose (cont.) c transpose data locally call nga_distribution(g_a, me, lo, hi) call nga_get(g_a, lo, hi, a, ld) ! Use locality nelem = hi(1)-lo(1)+1 do i = 1, nelem b(i) = a(nelem - i + 1) end do c transpose data globally lo2(1) = dims(1) - hi(1) + 1 hi2(1) = dims(1) - lo(1) + 1 call nga_put(g_b,lo2,hi2,b,ld) call ga_sync() ! Make sure transposition is complete
Example: 1-D Transpose (cont.) c check transpose call nga_get(g_a,lo1,hi1,a,ld) call nga_get(g_b,lo1,hi1,b,ld) ichk = 0 do i= 1, dims(1) if (a(i).ne.b(dims(1)-i+1).and.me.eq.0) then write(6,*) "Mismatch at ",i ichk = ichk + 1 endif end do if (ichk.eq.0.and.me.eq.0) write(6,*) "Transpose OK" status = ga_destroy(g_a) ! Deallocate memory for arrays status = ga_destroy(g_b) call ga_terminate() call mpi_finalize(ierr) stop end
Instrumenting single-sided memory access • Approach 1: Instrument the puts, gets and data server • Advantage: robust and accurate • Disadvantage: one does not always have access to the source of the data server • Approach 2: Instrument the puts and gets only, cheating on the source and destination of the messages • Advantage: no instrumentation of the data server required • Disadvantage: timings of the messages are inaccurate in case of non-blocking communications, flashing lines due to synchronisation corrections for timers of different processors In our work with the Global Arrays we have taken approach 2
GA vs. MPI-2 • MPI-2 now provides a portable mechanism for one-sided communications • Memory is associated with one-sided communications by defining windows • One-sided (put/get) operations occur in well-defined regions of the code separated by fence calls • There are restrictions on what a code can do between synchronisation points, e.g. (point-to-point) messages, local compute, etc. • Standard – vendors will implement it. • Global Arrays • Designed to make operations as light-weight as possible • Minimal synchronisation required • Work to exploit overlap of communications and computation • Not standard, portability problems on new platforms (will “the OpenFabrics Alliance” [www.openfabrics.org] cure this?)