Introduction to MPI Programming (Part II) ‏ Michael Griffiths, Deniz Savas & Alan Real

Introduction to MPI Programming (Part II)‏ Michael Griffiths, Deniz Savas & Alan Real January 2006

Overview • Review point to point communications • Data types • Data packing • Collective Communication • Broadcast, Scatter & Gather of data • Reduction Operations • Barrier Synchronisation

Blocking operations • Relate to when the operation has completed • Only return from the subroutine call when the operation has completed

Non-blocking operations • Return straight away and allow the sub-program to return to perform other work. • At some time later the sub-program should test or wait for the completion of the non-blocking operation. • A non-blocking operation immediately followed by a matching wait is equivalent to a blocking operation. • Non-blocking operations are not the same as sequential subroutine calls as the operation continues after the call has returned.

Sender mode MPI Call (F/C)‏ Completion status Synchronous send MPI_Ssend Only completes when the receive has completed. Begins a nonblocking send MPI_Isend Always completes (unless an error occurs), irrespective of receiver. Buffered send MPI_Bsend Always completes (unless an error occurs), irrespective of receiver. Standard send MPI_Send Can be synchronous or buffered (often implementation dependent). Ready send MPI_Rsend Always completes (unless an error occurs), irrespective of whether the receive has completed. Blocking Send and Receive MPI_Sendrecv Completes when a message arrives and received by pair of processors.

Non-blocking communication • Separate communication into three phases: • Initiate non-blocking communication • Do some work: • Perhaps involving other communications • Wait for non-blocking communication to complete.

1 0 2 5 3 4 4 4 Non-blocking send • Send is initiated and returns straight away. • Sending process can do other things • Can test later whether operation has completed. Receive MPI_COMM_WORLD Send req Wait

1 2 2 2 0 5 3 4 Non-blocking receive • Receive is initiated and returns straight away. • Receiving process can do other things • Can test later whether operation has completed. Rec req Wait MPI_COMM_WORLD Send

The Request Handle • Same arguments as non-blocking call • Additional request handle • In C/C++ is of type MPI_Request/MPI::Request • In Fortran is an INTEGER • Request handle is allocated when a communication is initiated • Can query to test whether non-blocking operation has completed

Non-blocking synchronous send • Fortran: CALL MPI_ISSEND(buf, count, datatype, dest, tag, comm, request, error)‏ CALL MPI_WAIT(request, status, error)‏ • C: MPI_Issend(&buf, count, datatype, dest, tag, comm, &request); MPI_Wait(&request, &status); • C++: request = comm.Issend(&buf, count, datatype, dest, tag); request.Wait();

Non-blocking synchronous receive • Fortran: CALL MPI_IRECV(buf, count, datatype, src, tag, comm, request, error)‏ CALL MPI_WAIT(request, status, error)‏ • C: MPI_Irecv(&buf, count, datatype, src, tag, comm, &request); MPI_Wait(&request, &status); • C++: request = comm.Irecv(&buf, count, datatype, src, tag); request.Wait(status);

Synchronous mode affects completion, not initiation. A non-blocking call followed by an explicit wait is identical to the corresponding blocking communication. Operation MPI call Fortran/C C++ Standard send MPI_Send(…)‏ Comm.Send(…)‏ Synchronous send MPI_Ssend(…)‏ Comm.Ssend(…)‏ Buffered send MPI_Bsend(…)‏ Comm.Bsend(…)‏ Ready send MPI_Rsend(…)‏ Comm.Rsend(…)‏ Receive MPI_Recv(…)‏ Comm.Recv(…)‏ Blocking v Non-blocking • Send and receive can be blocking or non-blocking. • A blocking send can be used with a non-blocking receive, and vice versa. • Non-blocking sends can use any mode:

Completion • Can either wait or test for completion: • Fortran (LOGICAL flag): CALL MPI_WAIT(request, status, ierror)‏ CALL MPI_TEST(request, flag, status, ierror)‏ • C (int flag): MPI_Wait(&request, &status)‏ MPI_Test(&request, &flag, &status)‏ • C++ (bool flag): request.Wait()‏ flag = request.Test(); (for sends)‏ request.Wait(status); flag = request.Test(status); (for receives)‏

Other related wait and test routines • If multiple non-blocking calls are issued … • MPI_TESTANY : Tests if any one of ‘a list of requests’ (they could be send or receive requests) have been completed. • MPI_WAITANY : Waits until any one of the list of requests have been completed. • MPI_TESTALL : Test if all the requests in a list are completed. • MPI_WAITALL : Waits until all the requests in a list are completed. • MPI_PROBE , MPI_IPROBE : Allows for the incoming messages to be checked for without actually receiving them. Note that MPI_PROBE is blocking. It waits until there is something to probe for. • MPI_CANCEL : Cancels pending communication. Last resort, clean- up operation ! • All routines take an array of requests and can return an array of statuses. • ‘any’ routines return an index of the completed operation

Merging send and receive operations into a single unit • The following is the syntax of the MPI_Sendrecv command: • IN C: • int MPI_Sendrecv( void * sendbuf, int sendcount, MPI_Datatype sendtype, int dest, int sendtag, void* recvbuf, int recvcount, MPI_Datatye recvtype ,int source , int recvtag, MPI_Comm comm, MPI_Status *status )‏ • IN FORTRAN • <sendtype> sendbuf(:) • <recvtype> recvbuf(:)‏ • INTEGER sendcount,sendtype, dest, sendtag, recvcount, recvtype, • INTEGER source, recvtag, comm, status(MPI_STATUS_SIZE), ierror • MPI_SENDRECV( sendbuf,sendcount,sendtype, dest, sendtag, recvbuf, recvcount , recvtype , source, recvtag , comm , status , ierror )‏

Important Notes about MPI_Sendrecv • Beware! A message sent by MPI_sendrecv is receivable by a regular receive operation if the destination and tag match. • For the destination and source MPI_PROC_NULL can be specified to allow one directional working. (Useful in non-circular communication for the very end-nodes). • Any communication with MPI_PROC_NULL returns immediately with no effect but as if the operation has been successful. This can make programming easier. • The send and receive buffers must not overlap, they must be separate memory locations. This restriction can be avoided by using the MPI_Sendrecv_replace routine

Data Packing • Up until now we have only seen contiguous data of pre-defined data-types being communicated by MPI calls. This can be rather restricting if what we are intending to transfer involves structures of data made up of mixtures of primitive data types, such as integer count followed by a sequence of real numbers. • One solution to this problem is to use the MPI_PACK and MPI_UNPACK routines. The philosophy used is similar to the Fortran write/read to/from internal buffers and the scanf function in C. • MPI_PACK routine can be called consecutively to compress the data into a send_buffer, the resulting buffer of data can then be sent by using MPI_SEND ‘or equivalent’ with the data_type set to MPI_PACKED. • At the receiving-end it can be received by using MPI_RECV with the data type MPI_PACKED. The received data can then be unpacked by using MPI_UNPACK to recover the original packed data. This method of working can also improve communications efficiency by reducing the number of data transfer ‘send-receive’ calls. There are usually fixed overheads associated with setting up the communications that would cause inefficiencies if the sent/received messages are just too small.

MPI_Pack • Fortran : • <type> INBUF(:) , OUTBUF(:)‏ • INTEGER INCOUNT,DATATYPE,OUTSIZE,POSITION,COMM,IERROR • MPI_PACK(INBUF,INCOUNT,DATATYPE,OUTBUF, OUTSIZE,POSITION, COMM,IERROR )‏ • C : • int MPI_Pack(void* inbuf, int incount, MPI_Datatype datatype, void *outbuf ,int outsize, int *position, MPI_Comm comm )‏ • Packs the message in inbuf of type datatype and length=incount and stores it in outbuf . Outbuf size is specified in bytes. Outsize being the maximum length of outbuf ’in bytes’, rather than its actuaL size. • On entry position indicates the starting location at the outbuf where data will be written. On exit position points to the first free position in outbuf following the location occupied by the packed message. This can then be readily used as the position parameter for the next mpi_pack call.

MPI_Unpack • Fortran : • <type> INBUF(:) , OUTBUF(:)‏ • INTEGER INSIZE, POSITION, OUTCOUNT,DATATYPE, COMM,IERROR • MPI_UNPACK(INBUF,INSIZE,POSITION, OUTBUF,OUTCOUNT,DATATYPE, ,COMM,IERROR )‏ • C : • int MPI_Unpack(void* inbuf, int insize, int *position, void *outbuf ,int outcount, MPI_Datatype datatype, MPI_Comm comm )‏ • Unpacks the message which is in inbuf as data of type datatype and length of outcounts and stores it in outbuf . • On entry, position indicates the starting location of data in inbuf where data will be read from. On exit position points to the first position of the next set of data in inbuf. This can then be readily used as the position parameter for the next mpi_unpack call.

Collective Communication

Overview • Introduction & characteristics • Barrier Synchronisation • Global reduction operations • Predefined operations • Broadcast • Scatter • Gather • Partial sums • Exercise:

Collective communications • Are higher-level routines involving several processes at a time. • Can be built out of point-to-point communications. • Examples are: • Barriers • Broadcast • Reduction operations

Collective Communication • Communications involving a group of processes. • Called by all processes in a communicator. • Examples: • Broadcast, scatter, gather (Data Distribution)‏ • Global sum, global maximum, etc. (Reduction Operations)‏ • Barrier synchronisation • Characteristics • Collective communication will not interfere with point-to-point communication and vice-versa. • All processes must call the collective routine. • Synchronization not guaranteed (except for barrier)‏ • No non-blocking collective communication • No tags • Receive buffers must be exactly the right size

Collective Communications(one for all, all for one!!!)‏ • Collective communication is defined as that which involves all the processes in a group. Collective communication routines can be divided into the following broad categories: • Barrier synchronisation • Broadcast from one to all. • Scatter from one to all • Gather from all to one. • Scatter/Gather. From all to all. • Global reduction (distribute elementary operations)‏ • IMPORTANT NOTE: Collective Communication operations and point-to-point operations we have seen earlier are invisible to each other and hence do not interfere with each other. • This is important to avoid dead-locks due to interference.

BARRIER SYNCHRONIZATION T I M E B A R R I E R STATEMENT Here, there are seven processes running and three of them are waiting idle at the barrier statement for the other four to catch up.

A A A A B C D E A B A B C D E A C A B C D E A D A B C D E A E A B C D E A B C D E A B C D E A a f k p A a b c d e B b g l q f g h i j B C c h m r k l m n o C D d i n s p q r s t E e j o t D E Graphic Representations of Collective Communication Types P R O C E S S E S ALLGATHER BROADCAST SCATTER ALLTOALL GATHER D A T A D A T A D A T A D A T A

Barrier Synchronisation • Each processes in communicator waits at barrier until all processes encounter the barrier. • Fortran: INTEGER comm, error CALL MPI_BARRIER(comm, error)‏ • C: MPI_Barrier(MPI_Comm comm); • C++: Comm.Barrier(); • E.g.: MPI::COMM_WORLD.Barrier();

Global reduction operations • Used to compute a result involving data distributed over a group of processes: • Global sum or product • Global maximum or minimum • Global user-defined operation

MPI name (F/C)‏ MPI name (C++)‏ Function MPI_MAX MPI::MAX Maximum MPI_MIN MPI::MIN Minimum MPI_SUM MPI::SUM Sum MPI_PROD MPI::PROD Product MPI_LAND MPI::LAND Logical AND MPI_BAND MPI::BAND Bitwise AND MPI_LOR MPI::LOR Logical OR MPI_BOR MPI::BOR Bitwise OR MPI_LXOR MPI::LXOR Logical exclusive OR MPI_BXOR MPI::BXOR Bitwise exclusive OR MPI_MAXLOC MPI::MAXLOC Maximum and location MPI_MINLOC MPI::MINLOC Minimum and location Predefined operations

A B C A B C MPI_REDUCE D E F D E F G H I G H I MPI_Reduce • Performs count operations (o) on individual elements of sendbuf between processes Rank 0 1 BoEoH 2 AoDoG

MPI_Reduce syntax • Fortran INTEGER count, type, count, rtype, root, comm, error CALL MPI_REDUCE(sbuf, rbuf, count, rtype, op, root, comm, error)‏ • C: MPI_Reduce(void *sbuf, void *rbuf, int count, MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm comm); • C++: Comm::Reduce(const void* sbuf, void* recvbuf, int count, const MPI::Datatype& datatype, const MPI::Op& op, int root);

MPI_Reduce example • Integer global sum: • Fortran INTEGER x, result, error CALL MPI_REDUCE(x, result, 1, MPI_INTEGER, MPI_SUM, 0, MPI_COMM_WORLD, error)‏ • C: int x, result; MPI_Reduce(&x, &result, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD); • C++: int x, result; MPI::COMM_WORLD.Reduce(&x, &result, 1, MPI::INT, MPI::SUM);

MPI_ALLREDUCE A B C A B C D E F D E F G H I G H I MPI_Allreduce • No root process • All processes get results of reduction operation Rank 0 1 2 AoDoG

MPI_Allreduce syntax • Fortran INTEGER count, type, count, rtype, comm, error CALL MPI_ALLREDUCE(sbuf, rbuf, count, rtype, op, comm, error)‏ • C: MPI_Allreduce(void *sbuf, void *rbuf, int count, MPI_Datatype datatype, MPI_Op op, MPI_Comm comm); • C++: Comm.Allreduce(const void* sbuf, void* recvbuf, int count, const MPI::Datatype& datatype, const MPI::Op& op);

Practice Session 3 • Using reduction operations • This example shows the use of the continued fraction method of calculating pi and makes each processor calculate a different portion of the expansion series.

Broadcast • Duplicates data from root process to other processes in communicator A Broadcast A A A A A Rank 0 1 2 3

Broadcast syntax • Fortran: INTEGER count, datatype, root, comm, error CALL MPI_BCAST(buffer, count, datatype, root, comm, error)‏ • C: MPI_Bcast (void *buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm); • C++: Comm.Bcast(void* buffer, int count, const MPI::Datatype& datatype, int root); • E.g broadcasting 10 integers from rank 0 int tenints[10]; MPI::COMM_WORLD.Bcast(&tenints, 10, MPI::INT, 0);

A B C D A B C D B Scatter • Distributes data from root process amongst processors within communicator. Scatter A C D Rank 0 1 2 3

Scatter syntax scount (and rcount) is number of elements each process is sent (i.e. = no received)‏ • Fortran INTEGER scount, stype, rcount, rtype, root, comm, error CALL MPI_SCATTER(sbuf, scount, stype, rbuf, rcount, rtype, root, comm, error)‏ • C: MPI_Scatter(void *sbuf, int scount, MPI_Datatype stype, void *rbuf, int rcount, MPI_Datatype rtype, root, comm); • C++: Comm.Scatter(const void* sbuf, int scount, const MPI::Datatype& stype, void* rbuf, int rcount, const MPI::Datatype& rtype, int root);

A B C D B Gather • Collects data distributed amongst processes in communicator onto root process ( Collection done in rank order ) . A B C D Gather A C D Rank 0 1 2 3

Gather syntax • Takes same arguments as Scatter operation • Fortran INTEGER scount, stype, rcount, rtype, root, comm, error CALL MPI_GATHER(sbuf, scount, stype, rbuf, rcount, rtype, root, comm, error)‏ • C: MPI_Gather(void *sbuf, int scount, MPI_Datatype stype, void *rbuf, int rcount, MPI_Datatype rtype, root, comm); • C++: Comm.Gather(const void* sbuf, int scount, const MPI::Datatype& stype, void* rbuf, int rcount, const MPI::Datatype& rtype, int root);

A B C D A B C D A B C D A B C D All Gather • Collects all data on all processes in communicator A B C D Gather A B C D Rank 0 1 2 3

All Gather syntax • As Gather but no root defined. • Fortran INTEGER scount, stype, rcount, rtype, comm, error CALL MPI_GATHER(sbuf, scount, stype, rbuf, rcount, rtype, comm, error)‏ • C: MPI_Gather(void *sbuf, int scount, MPI_Datatype stype, void *rbuf, int rcount, MPI_Datatype rtype, comm); • C++: Comm.Gather(const void* sbuf, int scount, const MPI::Datatype& stype, void* rbuf, int rcount, const MPI::Datatype& rtype);

Introduction to MPI Programming (Part II) ‏ Michael Griffiths, Deniz Savas & Alan Real