MPI Communications

MPI Communications Point to Point Collective Communication Data Packaging

Point-to-Point CommunicationSend and Receive • MPI_Send/MPI_Recv provide point-to-point communication • synchronization protocol is not fully specified. • what are possibilities?

Send and Receive Synchronization • Fully Synchronized (Rendezvous) • Send and Receive complete simultaneously • whichever code reaches the Send/Receive first waits • provides synchronization point (up to network delays) • Buffered • Receive must wait until message is received • Send completes when message is moved to buffer clearing memory of message for reuse

Send and Receive Synchronization • Asynchronous (different API call) • Sending process may proceed immediately • does not need to wait until message is copied to buffer • must check for completion before using message memory • Receiving process may proceed immediately • will not have message to use until it is received • must check for completion before using message

MPI Send and Receive • MPI_Send/MPI_Recv are synchronous, but buffering is unspecified • MPI_Recv suspends until message is received • MPI_Send may be fully synchronous or may be buffered • implementation dependent • Variations allow synchronous or buffering to be specified

Asynchronous Send and Receive • MPI_Isend() / MPI_Irecv() are non-blocking. Control returns to program after call is made. • Syntax is the same as for Send and Recv, except a MPI_Request* parameter is added to Isend and replaces the MPI_Status* for receive.

Detecting Completion • MPI_Wait(&request, &status) • requestmatches request on Isend or Irecv • status returns status equivalent to status for Recv when complete • Blocks for send until message is buffered or sent so message variable is free • Blocks for receive until message is received and ready

Detecting Completion • MPI_Test(&request, flag, &status) • request,status as for MPI_Wait • does not block • flag indicates whether message is sent/received • enables code which can repeatedly check for communication completion

Collective Communications • One to Many (Broadcast, Scatter) • Many to One (Reduce, Gather) • Many to Many (All Reduce, Allgather) • In general, all processes (senders and receivers) call the same MPI function for collective communication

Collective Communications • In general, all processes (senders and receivers) call the same MPI function for collective communication • The senders and receivers are distinguished by the source and destination parameters of the call

Broadcast • A selected processor sends to all other processors in the communicator • Any type of message can be sent • Size of message should be known by all (it could be broadcast first) • Can be optimized within system for any given architecture

MPI_Bcast() Syntax MPI_Bcast(mess, count, MPI_INT, root, MPI_COMM_WORLD); mess pointer to message buffer count number of items sent MPI_INT type of item sent Note: count and type should be the same on all processors root sending processor MPI_COMM_WORLD communicator within which broadcast takes place

MPI_Barrier() MPI_Barrier(MPI_COMM_WORLD); MPI_COMM_WORLD communicator within which broadcast takes place provides for barrier synchronization without message of broadcast A barrier is a point in the code where all processes must stop and wait until every process has reached the barrier. Once all processes have executed the MPI_Barrier call, then all processes can continue.

Reduce • All Processors send to a single processor, the reverse of broadcast • Information must be combined at receiver • Several combining functions available • MAX, MIN, SUM, PROD (product), LAND (logical and), BAND (bitwise and), LOR (logical or), BOR (bitwise or), LXOR (logical xor), BXOR, MAXLOC (rank of the process that sent the maximum valued), MINLOC

MPI_Reduce() syntax MPI_Reduce(&dataIn, &result, count, MPI_DOUBLE, MPI_SUM, root, MPI_COMM_WORLD); dataIn data sent from each processor resultstores result of combining operation countnumber of items in each of dataIn, result MPI_DOUBLEdata type for dataIn, result MPI_SUMcombining operation rootrank of processor receiving data MPI_COMM_WORLDcommunicator

MPI_Reduce() • Data and result may be arrays -- combining operation applied element-by-element • Illegal to alias dataIn and result • avoids overhead of function checking for overlap

MPI_Scatter() • Spreads array to all processors • Source is an array on the sending processor • Each receiver, including sender, gets a piece of the array corresponding to their rank in the communicator

MPI_Gather() • Opposite of Scatter • Values on all processors (in the communicator) are collected into an array on the receiver • Array locations correspond to ranks of processors

Collective Communications, underneath the hood

Many to Many Communications • MPI_Allreduce • Syntax like reduce, except no root parameter • All nodes get result • MPI_Allgather • Syntax like gather, except no root parameter • All nodes get resulting array • Underneath -- virtual butterfly network

Data packaging • Needed to combine irregular, non-contiguous data into single message • pack -- unpack, explicitly pack data into a buffer, send, unpack data from buffer • Derived data types, MPI heterogeneous data types which can be sent as a message

MPI_Pack() syntax MPI_Pack(Aptr, count, MPI_DOUBLE, buffer, size, &pos, MPI_COMM_WORLD); Aptr pointer to data to pack countnumber of items to pack type of items bufferbuffer being packed sizesize of buffer (in bytes) posposition in buffer (in bytes), updated communicator

MPI_Unpack() • reverses operation of MPI_Pack() MPI_Unpack(buffer, size, &pos, Aptr, count, MPI_DOUBLE, MPI_COMM_WORLD);

MPI Communications