Plan: I. Introduction: Programming Model II. Basic MPI Command III. Examples

Basic examples with MPI M. Garbey Reference: http://www.mcs.anl.gov/mpi/ Plan: I. Introduction: Programming Model II. Basic MPI Command III. Examples IV. Collective Communications V. More on Communication modes VI. References on MPI

I Introduction Definition Model for a sequential program The program is executed by one and only one processor All variables and constants of the program are allocated in central memory Programme Memory PE

I Introduction Message Passing Programming Model The program is written in a classical language (Fortran, C, C ++, ….) The computer is an ensemble of processors with an arbitrary interconnection topology Each processor has its own medium-size local memory. Each processor executes its own program. Processors communicate by message passing. Any processor can send a message to any other processor. There are no shared resources (CPU, Memory…)

I Introduction Message Passing Programming Model Memory 0 2 Processus 1 3 Network Program

I Introduction Message Passing Programming Model memory 0 2 Processus 1 3 Network Single Program Multiple Data

I Introduction Execution model: S P M D Single Program Multiple Data The same program is executed by all the processors Most of the computers can run this model. It is a particular case of MPMD, but SMPD can emulate MPMD. If processor is in set A then do piece of code A If processor is in set B then do piece of code B …….

I Introduction Process = Basic Unit of Computation A program written in a “standard” sequential language with library calls to implement message passing. A process executes on a node - other processes may execute simultaneously on other nodes A process communicates and synchronizes with other processes via messages. A process is uniquely identified by its label A process does not migrate …..

I Introduction Processes communicate and synchronize with each other by sending and receiving messages. (No global variables or shared memory) Processes execute independently and asynchronously (no global synchronizing clock) Processes my be unique and work on own data set Any process may communicate with any other process (A priori no limitation on message passing)

I Introduction • Common Communication Patterns • One processor to one processor • One processor to many processors • Input data • Many processors to one processor • Printing results • Global operations • Many processors to many processors • Algorithm step (FFT …..)

II. 6 basic functions of MPI MPI_Init Initialize MPI MPI_Comm_Size gives the number of processes MPI_Comm_Rank give the number of the process MPI_Send Send a message MPI_Recv Receive a message MPI_Finalize end MPI environment integer code c -- start MPI call MPI_INIT(code) call MPI_Finalize(code) c -- end MPI

II. 6 basic functions of MPI MPI_Comm_Size gives the number of processes MPI_Comm_Rank give the rank of the process integer nb_procs, rank, code c -- gives the number of processes running in the code: call MPI_COMM_SIZE(MPI_COMM_WORLD, nb_procs, code) c -- gives the rank of the process running this function: call MPI_COMM_RANK(MPI_COMM_WORLD, rank, code) NOTE: 0 =< rank =< nb_procs - 1 NOTE: MPI_COMM_WORLD is for the set of all processes running in the code

II. 6 basic functions of MPI Program who_i_am implicit none include ‘mpif.h’ integer nb_procs, rang, code call MPI_INIT(code) call MPI_COMM_SIZE(MPI_COMM_WORLD,nb_procs,code) call MPI_COMM_RANK(MPI_COMM_WORLD,rank,code) print *, ‘ I am the process ‘, rank, ‘among’ , nb_procs call MPI_FINALIZE(code) end program who_i_am > mpirun -np 4 who_i_am I am the process 3 among 7 I am the process 0 among 4 I am the process 2 among 4 I am the process 1 among 4

II. 6 basic functions of MPI MPI_Send Send a message MPI_Recv Receive a message 0 1 2 3 1000 5

II. 6 basic functions of MPI MPI_Send Send a message MPI_Recv Receive a message Program node_to_node implicit none include ‘mpif.h’ integer status(MPI_STATUS_SIZE) integer code, rank, value, tag parameter(tag=100) call MPI_INIT(code) call MPI_COMM_RANK(MPI_COMM_WORLD,rank,code) if (rank .eq. 1) then value=1000 call MPI_SEND(value,1,MPI_INTEGER, 5 , tag, MPI_COMM_WORLD, code) elseif (rank .eq. 5) then call MPI_RECEV(value,1, MPI_INTEGER, 1, tag, MPI_COMM_WORLD,statut,code) end if call MPI_FINALIZE(code) end program node_to_node

II. 6 basic functions of MPI MPI_Send Send a message MPI_Recv Receive a message value is the number of type MPI_INTEGER that is sent each message should have a tag This protocol of communication is a Synchronous send and a Synchronous receive. MPI_SEND(value,1,MPI_INTEGER, 5 , tag, MPI_COMM_WORLD, code) blocks the excecution of the code until the send is completed, value can be reused, but no guarantee that message has been received. MPI_RECEV(value,1, MPI_INTEGER, 1, tag, MPI_COMM_WORLD,status,code) blocks the execution of the code until the receive is completed NOTE: at the beginning, use print command to check that things are OK!

II. 6 basic functions of MPI For a communication to succeed: • sender must specify a valid destination rank • receiver must specify a valid source rank • may use wildcard: MPI_ANY_SOURCE • the communicator must be the same • Tags must match • may use wildcard: MPI_ANY_TAG • Message types must match • Receiver’s buffer must be large enough

II. 6 basic functions of MPI MPI Basic Datatypes in Fortran MPI Datatypes Fortran Datatypes MPI_INTEGER INTEGER MPI_REAL REAL MPI_DOUBLE_PRECISION DOUBLE PRECISION MPI_COMPLEXE COMPLEXE MPI_LOGICAL LOGICAL MPI_CHARACTER CHARACTER(1)

III. The matrix multiply example: Preliminary: TIMER • in Fortran: • double precision MPI_Wtime() • Time is measured in seconds • Time to perform a task is measured by consulting • the timerbefore and after. • Modify your program to measure its execution time and • print out. Example: tstart = mpi_wtime blabla blablaba….. tend = mpi_wtime print *, ‘ node ’, myid, ‘ ,time=‘ , tend-tstart, ‘ seconds ’

III. The matrix multiply example: Simple matrix multiply algorithm • Matrix A is copied to every processors j=1..np. • Matrix B is divided into blocks of columns B and distributed to processors • Performs matrix multiply simultaneously between A and B • Output solutions. j=1..np j 1,2,3,4 1 2 3 4 * = 1 2 3 4 C A B

III. The matrix multiply example: • Master: distribute the work to workers, collect results, and output solution. • Master sends a copy of A to every worker • do dest=1, numworkers • call MPI_SEND(a, nra*nca, mpi_double_precision, dest,mtype, • mpi_comm_world, ierr) • end do • Worker: receive a copy of A from master • call mpi_recv(a, nra*nca, mpi_double_precision, master, mtype, • mpi_comm_world, status, ierr)

III. The matrix multiply example: • Master: distribute block of columns of B to workers • Master sends column length (cols) and column identifier (offset) • do dest=1, numworkers • call MPI_SEND(offset, 1, mpi_integer, dest,mtype, • mpi_comm_world,ierr) • call MPI_SEND(cols, 1, mpi_integer, dest,mtype, • mpi_comm_world,ierr) • end do • Master sends corresponding values to workers: • do dest=1, numworkers • call MPI_SEND(b(1,offset), cols*nca, mpi_double_precision • , dest, mtype, mpi_comm_world,ierr) • end do

III. The matrix multiply example: • Workers receive the data: • call MPI_RECV(offset, 1, mpi_integer, master, mtype, • mpi_comm_world, status, ierr) • call MPI_RECV(cols, 1, mpi_integer, master, mtype, • mpi_comm_world, status, ierr) • call MPI_RECV(b, cols*nca, mpi_double_precision • , master, mtype, mpi_comm_world, status, ierr) • Workers do matrix multiply: • do k=1, cols • c(i,k)=0.0 d0 • do j=1, nca • c(i,k) = c(i,k) + a(i,j) * b(j,k) • end do • end do

III. The matrix multiply example: • Workers send the results for their block back to the master: • call MPI_SEND(c, cols*nca, mpi_double_precision • , master, mtype, mpi_comm_world, ierr) • Master receives results from workers: • do i= 1, numworkers • call MPI_RECV(c(1,offset), cols*nca, mpi_double_precision • , master, mtype, mpi_comm_world, status, ierr) • end do • Remark: Fortran is not case sensitive

IV. Collective Communications: • Substitute for a more complex sequence of calls • Involve all the processes in a process group • Called by all processes in a communicator • all routines block until they are locally complete • Receive buffers must be exactly the right size • No message tags are needed • Collective calls are divided into three subsets: • synchronization • data movement • global computation

IV. Collective Communications: Barrier Synchronization Routines • To synchronize all processes within a communicator • A communicator is a group of processes and a context of • communication • The base group is the group that contains all processes, • which is associated with the MPI_COMM_WORLD • communicator. • A node calling it will be blocked until all nodes within the group • have called it. • Call MPI_BARRIER(comm,ierr)

IV. Collective Communications: Broadcast • One processor sends some data to all processors in a group • call MPI_BCAST(buffer, count, datatype, root, comm, ierr) • The MPI_BCAST must be called by each node in a group,specifying the same communicator and root. The message is sent from theroot process to all processes in the group, including the rootprocess. • Scatter • Data are distributed into n equal segments, where the ith segment • is sent to the ith process in the group which has n processes. • Call MPI_SCATTER(sbuff,scount, sdatatype, rbuf, rcount, • rdatatype, root, comm, ierr)

IV. Collective Communications: • . Gather • Data are collected into a specified process in the order of process rank, • Gather is the reverse process of scatter. • Call MPI_Gather(sbuff,scount, sdatatype, rbuf, rcount, • rdatatype, root, comm, ierr) • Example: • datas in Proc. 0 are: {1,2}, in Proc. 1: {3,4}, in Proc.2: {5,6}, …. • in Proc. 5 are {11,12}, then • real rbuf(2), sbuf(2) • call MPI_Gather (sbuf,2,MPI_INIT,rbuf,2,MPI_INIT,3, • MPI_COMM_WORLD,ierr) • will bring {1,2,3,4,5,6,….,11,12} into Proc. 3. • Similarly, the inverse transfer is: • call MPI_Scatter (sbuf,2,MPI_INIT,rbuf,2,MPI_INIT,3, • MPI_COMM_WORLD,ierr)

p0 p1 p0 p1 p0 p1 p0 p1 p0 p1 p0 p1 p0 p1 p0 p1 IV. Collective Communications: • . Two more MPI functions: • MPI_Allgather and MPI_Alltoall: • MPI_Alltoall(sbuf,scount,stype,rbuf,rcount,rtype,comm,ierr) • MPI_Alltoall(sbuf,scount,stype,rbuf,rcount,rtype,comm,ierr) • sbuf: starting address of send buffer • scount: number of elements sent to each process • stype: data type to send buffer • rbuff: address of receive buffer • rcount: number of elements received from any process • rtype: data type of receive buffer elements • comm: communicator • To summarize: Broadcast All Gather a a a a b a b a b Scatter All to All a b a a b a c b c d b d Gather

IV. Collective Communications: • Global Reduction Routines • The partial result in each process in the group is combined together using some desired function. • The operation function passed to a global computation routine • is either a predefined MPI function or a user supplied function. • Examples: • Global sum or product. • Global maximum or minimum. • Global user-defined operation. • MPI_Reduce(sbuf,rbuf,count,stype,op,root, comm,ierr) • MPI_Allreduce(sbuf,rbuf,count,stype,op, comm,ierr)

IV. Collective Communications: • Global Reduction Routines • sbuf : address of send buffer • rbuf: address of receive buffer • count: the number of elements in the send buffer • stype: the data type of elements of send buffer • op: the reduce operation function, predefined or user-defined • root: the rank of the root process • comm: communicator • mpi_reduce returns results to single process • mpi_allreduce returns results to all processes in the group.

0 1 2 2 1 2 MPI_Reduce( sendbuf,recvbuf, 4 MPI_INT, MPI_MAX ,0 ,comm) 5 4 3 5 4 3 7 6 8 6 8 8 9 0 1 9 0 1 p0 p1 p2 p0 p1 p2 0 1 2 MPI_Allreduce( sendbuf,recvbuf, 4 MPI_INT, MPI_SUM ,comm) 5 4 3 7 6 8 0 p0 p1 p2 p0 p1 p2 IV. Collective Communications: • Global Reduction Routines

IV. Collective Communications: • Global Reduction Routines • Examples • c A subroutine that computes the dot product of two vectors that are distributed across • c a group of processes and return the answer at node zero: • subroutine PAR_BLAS1(N, a, b, scalar_product, comm) • real a(N), b(N), sum, scalar_product • sum=0.0 • do I = 1, N • sum = sum + a(I) * b(I) • end do • call MPI_Reduce(sum, scalar_product, 1, MPI_REAL, 0, MPI_SUM, comm, ierr) • return

IV. Collective Communications: • Global Reduction Routines • Predefined Reduce Operations • MPI NAME FUNCTION MPI NAME FUNCTION • MPI_MAX Maximum MPI_LOR Logical OR • MPI_MIN Minimum MPI_LAND Logical AND • MPI_SUM Sum MPI_PROD Product

V. More on Communication Mode: • So far, we have seen standard standard SEND and RECEIVE • functions, however we do need to know more in order to overlap communications • by computations….and more generally optimized the code. • Blocking Calls • A blocking send or receive call suspends execution of user’s program until the message buffer being sent:received is safe to use. • In case of a blocking send, this means the data to be sent have been copied out of the send buffer, but they have not necessarly been received in the receiving task. The contents of the send buffer can be modified without affecting the message that was sent. • The blocking receive implies that the data in the receive buffer are valid.

V. More on Communication Mode: • Blocking Communication Modes: • Synchronous Send: MPI_SSEND: Return when the message buffer can be safely reused. The sending tasks tells the receiver that a message is ready for it and waits for the receiver to acknowledge. • System overhead: buffer to network and vice versa. • Synchronization overhead: handshake + waiting. • Safe and Portable. • Buffered Send: MPI_BSEND: Return when message is copied to the system buffer. • Standard Send: MPI_SEND: Either synchronous or buffered, implemented by vendor to give good performance for most programs. • In MPICH: we do have buffered send

V. More on Communication Mode: • Non-Blocking Calls • Non-blocking calls return immediately after initiating the communication. • In order to reuse the send message buffer, the programmer must check for its status. • The programmer can choose to block before the message buffer is used or test for the status of the message buffer. • A blocking or non_blocking send can be paired to a blocking or non blocking receive. • Syntax: • call MPI_Isend(buf,count,datatype,dest,tag,comm,handle,ierr) • call MPI_Irecv (buf,count,datatype,src,tag,comm,handle,ierr)

V. More on Communication Mode: • Non-Blocking Calls • The programmer can block or check for the status of the message buffer: • MPI_Wait(request,status) • this routine blocks until the communication has completed. They are useful when the data from the communication buffer is about to be re-used. • MPI_Test(request,flag,status) • This routine blocks until the communication specified by the handle request has completed. The request handle will have been returned by an earlier call to a non_blocking communication routine. The routine queries completion of the communication and the result (True or False) is returned in flag.

V. More on Communication Mode: • Deadlock • All tasks are waiting for events that haven’t been initiated • Common to SPMD program with blocking communication, • e.g every task sends, but none receives • Insufficient system buffer space is available • Remedies : • Arrange one task to receive • Use MPI_Ssendrecv • Use non-blocking communication

V. More on Communication Mode: Examples : Deadlock c Improper use of blocking calls results in deadlock, run on two nodes c author : Roslyn Leibensperger, (CTC) program deadlock implicit none include ‘mpif.h’ integer MSGLEN, ITAG_A,ITAG_B parameter (MSGLEN = 2048,ITAG_A=100,ITAG_B=200) real rmsg1(MSGLEN), rmsg2(MSGLEN) integer, irank, idest, isrc, istag, iretag, istatus(MPI_STATUS_SIZE), ierr,I call MPI_Init (ierr) call MPI_Comm_rank(MPI_COMM_WORLD, irank, ierr) do I = 1, MSGLEN rmsg1(I)=100 rmsg2(I)= -100 end do

V. More on Communication Mode: Example : Deadline (Cont’d) if (irank.eq.0) then Idest = 1 Isrc = 1 Istag = ITAG_A Iretag = ITAG_B end if (irank.eq.1) then idest = 0 isrc = 0 istag = ITAG_B iretag = ITAG_A end if print*, ‘’ Task ‘’,irank, ‘’has sent the message ‘’ call MPI_Ssend (rmsg1,MSGLEN, MPI_REAL,isrc, iretag, MPI_COMM_WORLD, ierr) call MPI_Recev(rmsg2,MSGLEN,MPI_REAL,isrc, iretag, MPI_COMM_WORLD,istatus,ierr) print*, ‘’Task ‘’,irank, ‘’has received the message ‘’ call MPI_Finalize (ierr) end

V. More on Communication Mode: Examples : Deadlock (fixed) c Solution program showing the use of a non-blocking send to eliminate deadlock c author : Roslyn Leibensperger (CTC) program fixed implicit none include ‘mpif.h’ ----------------------- ----------------------- print*, ‘’Task ‘’, irank, ‘’has started the send ‘’ call MPI_isend(rmsg1,MSGLEN, MPI_REAL,idest, istag,MPI_COMM_WORLD,irequest,ierr) call MPI_Recev(rmsg2,MSGLEN,MPI_REAL,isrc, iretag,MPI_COMM_WORLD,irstatus,ierr) call MPI_Wait (irequest,isstatus,ierr) print*, ‘’Task ‘’,irank, ‘’ has completed the send ‘’ call MPI_Finalize(ierr) end

V. More on Communication Mode: • Sendrecv • Useful for executing a shift operation across a chain of processes. • System take care of possible deadlock due to blocking call • MPI_Sendrecv (sbuf,scount,stype,dest, • stag,rbuf,rcount,rtype,rtag,comm,status) • sbuf (rbuf): initial address of send (receive) buffer. • scount (rcount): number of elements in send (receive) buffer. • stype (rtype) : type of elements in send (receive) buffer. • stag (rtag): send (receive) tag • dest: rank of destination. • source: rank of source. • comm: communicator • status: status object.

1: program sendrecv • 2: implicit none • 3: include ‘mpif.h’ • 4: integer, dimension(MPI_STATUS_SIZE) :: status • 5: integer, parameter :: tag • 6: integer :: rank, value, num_proc, code • 7: • 8: call MPI_INIT(code) • 9: call MPI_COMM_RANK(MPI_COMM_WORLD,rank,code) • 10: • 11: ! one suppose that we have only two processes. • 12: num_proc=mod(rank+1,2) • 13: • 14: call MPI_SENDRECV(rank+1000,1,MPI_INTEGER,num_proc, • tag,value,1,MPI_INTEGER,num_proc,tag,MPI_COMM_WORLD,status,code) • 15: • 16: print *,’me, process’,rank, ‘ i have received’, value,’from process’,num_proc • 17: call MPI_FINALIZE(code) • 18: end program sendrecv • mpirun –np 2 send recv • me, process 1 , i have received 1000 from process 0 • me, process 0 , i have received 1001 from process 1 • Remark: if Blocking MPI_SEND are implemented in this code, we will have a • deadlock because each process will wait for an order of reception that will never come!

V. More on Communication Mode: • Optimizations • Optimization must be a main concern when communications time become a significant part compare to computations time • Optimization of communications may be accomplished at different levels, the main ones are : • Overlap communication by computation • Avoid, if possible, copy of the message in a temporary memory (buffering), • Minimize additional costs induced by calling subroutines of communication too often

Plan: I. Introduction: Programming Model II. Basic MPI Command III. Examples