MPI Part III NPACI Parallel Computing Institute August 19 - 23, 2002

MPI Part III NPACI Parallel Computing Institute August 19 - 23, 2002 San Diego Supercomputer Center

Point to Point Communications in MPI • Basic operations of Point to Point (PtoP) communication and issues of deadlock • Several steps are involved in the PtoP communication • Sending process • data is copied to the user buffer by the user • User calls one of the MPI send routines • System copies the data from the user buffer to the system buffer • System sends the data from the system buffer to the destination processor

Point to Point Communications in MPI • Receiving process • User calls one of the MPI receive subroutines • System receives the data from the source process, and copies it to the system buffer • System copies the data from the system buffer to the user buffer • User uses the data in the user buffer

Process 0 : User mode Kernel mode sendbuf sysbuf Call send routine Copying data from sendbuf to systembuf Now sendbuf can be reused Send data from sysbuf to dest data Process 1 : User mode Kernel mode Call receive routine receive data from src to systembuf Copying data from sysbufto recvbuf sysbuf Now recvbuf contains valid data recvbuf

Unidirectional communication • Blocking send and blocking receive • if (myrank == 0) then call MPI_Send(…)elseif (myrank == 1) then call MPI_Recv(….) endif • Non-blocking send and blocking receive • if (myrank == 0) then call MPI_ISend(…) call MPI_Wait(…)else if (myrank == 1) then call MPI_Recv(….) endif

Unidirectional communication • Blocking send and non-blocking recv • if (myrank == 0 ) then call MPI_Send(…..)elseif (myrank == 1) then call MPI_Irecv (…) call MPI_Wait(…)endif • Non-blocking send and non-blocking recv if (myrank == 0 ) then call MPI_Isend (…) call MPI_Wait (…) elseif (myrank == 1) then call MPI_Irecv (….) call MPI_Wait(..) endif

Bidirectional communication • Need to be careful about deadlock when two processes exchange data with each other • Deadlock can occur due to incorrect order of send and recv or due to limited size of the system buffer Rank 1 Rank 0 sendbuf recvbuf sendbuf recvbuf

Bidirectional communication • Case 1 : both processes call send first, then recv if (myrank == 0 ) then call MPI_Send(….) call MPI_Recv (…) elseif (myrank == 1) then call MPI_Send(….) call MPI_Recv(….) endif • No deadlock as long as system buffer is larger than send buffer • Deadlock if system buffer is smaller than send buf • If you replace MPI_Send with MPI_Isend and MPI_Wait, it is still the same • Moral : there may be error in coding that only shows up for larger problem size

Bidirectional communication • Following is free from deadlock if (myrank == 0 ) then call MPI_Isend(….) call MPI_Recv (…) call MPI_Wait(…) elseif (myrank == 1) then call MPI_Isend(….) call MPI_Recv(….) call MPI_Wait(….) endif

Bidirectional communication • Case 2 : both processes call recv first, then send if (myrank == 0 ) then call MPI_Recv(….) call MPI_Send (…) elseif (myrank == 1) then call MPI_Recv(….) call MPI_Send(….) endif • The above will always lead to deadlock (even if you replace MPI_Send with MPI_Isend and MPI_Wait)

Bidirectional communication • The following code can be safely executed if (myrank == 0 ) then call MPI_Irecv(….) call MPI_Send (…) call MPI_Wait(…) elseif (myrank == 1) then call MPI_Irecv(….) call MPI_Send(….) call MPI_Wait(….) endif

Bidirectional communication • Case 3 : one process call send and recv in this order, and the other calls in the opposite order if (myrank == 0 ) then call MPI_Send(….) call MPI_Recv(…) elseif (myrank == 1) then call MPI_Recv(….) call MPI_Send(….) endif • The above is always safe • You can replace both send and recv on both processor with Isend and Irecv

p0 p0 p0 p0 p0 p1 p1 p1 p1 p1 p2 p2 p2 p2 p2 p3 p3 p3 p3 p3 Scatter and Gather A p0 A broadcast p1 A p2 A p3 A A B C D scatter A B C gather D A B C D A B C D all gather A B C D A B C D A B C D

Scatter Operation using MPI_Scatter • Similar to Broadcast but sends a section of an array to each processors Data in an array on root node: A(0) A(1) A(2) . . ………. A(N-1) Goes to processors: P0 P1 P2 . . . Pn-1

MPI_Scatter • C • int MPI_Scatter(&sendbuf, sendcnts, sendtype, &recvbuf, recvcnts, recvtype, root, comm ); • Fortran • MPI_Scatter(sendbuf,sendcnts,sendtype, recvbuf,recvcnts,recvtype,root,comm,ierror) • Parameters • sendbuf is an array of size (number processors*sendcnts) • sendcnts number of elements sent to each processor • recvcnts number of element(s) obtained from the root processor • recvbuf contains element(s) obtained from the root processor, may be an array

Scatter Operation using MPI_Scatter • Scatter with Sendcnts = 2 Data in an array on root node: A(0) A(2) A(4) . . . A(2N-2) A(1) A(3) A(5) . . . A(2N-1) Goes to processors: P0 P1 P2 . . . Pn-1 B(0) B(0) B(0) B(0) B(1) B(1) B(1) B(1)

Gather Operation using MPI_Gather • Used to collect data from all processors to the root, inverse of scatter • Data is collected into an array on root processor Data from various Processors: P0 P1 P2 . . . Pn-1 A0 A1 A2 . . . An-1 Goes to an array on root node: A(0) A(1) A(2) . . . A(N-1)

MPI_Gather • C • int MPI_Gather(&sendbuf,sendcnts, sendtype, &recvbuf, recvcnts,recvtype,root, comm ); • Fortran • MPI_Gather(sendbuf,sendcnts,sendtype, recvbuf,recvcnts,recvtype,root,comm,ierror) • Parameters • sendcnts number of elements sent from each processor • sendbuf is an array of size sendcnts • recvcnts number of elements obtained from each processor • recvbuf of size recvcnts*number of processors

Code for Scatter and Gather • A parallel program to scatter data using MPI_Scatter • Each processor sums the data • Use MPI_Gather to get the data back to the root processor • Root processor prints the global data • See attached Fortran and C code

module mpi!DEC$ NOFREEFORM include "mpif.h“!DEC$ FREEFORM end module! This program shows how to use MPI_Scatter and MPI_Gather! Each processor gets different data from the root processor! by way of mpi_scatter. The data is summed and then sent back! to the root processor using MPI_Gather. The root processor! then prints the global sum. module global integer numnodes,myid,mpi_err integer, parameter :: mpi_root=0end modulesubroutine init use mpi use global implicit none! do the mpi init stuff call MPI_INIT( mpi_err ) call MPI_COMM_SIZE( MPI_COMM_WORLD, numnodes, mpi_err ) call MPI_Comm_rank(MPI_COMM_WORLD, myid, mpi_err)

end subroutine initprogram test1 use mpi use global implicit none integer, allocatable :: myray(:),send_ray(:),back_ray(:) integer count integer size,mysize,i,k,j,total call init! each processor will get count elements from the root count=4 allocate(myray(count))! create the data to be sent on the root if(myid == mpi_root)then size=count*numnodes allocate(send_ray(0:size-1)) allocate(back_ray(0:numnodes-1)) do i=0,size-1 send_ray(i)= i enddo endif

call MPI_Scatter( send_ray, count, MPI_INTEGER, & myray, count, MPI_INTEGER, & mpi_root, MPI_COMM_WORLD,mpi_err)! each processor does a local sum total=sum(myray) write(*,*)"myid= ",myid," total= ",total! send the local sums back to the root call MPI_Gather( total, 1, MPI_INTEGER, & back_ray, 1, MPI_INTEGER, & mpi_root, MPI_COMM_WORLD,mpi_err)! the root prints the global sum if(myid == mpi_root)then write(*,*)"results from all processors= ",sum(back_ray) endif call mpi_finalize(mpi_err) end program

#include <mpi.h>#include <stdio.h>#include <stdlib.h>/*! This program shows how to use MPI_Scatter and MPI_Gather! Each processor gets different data from the root processor! by way of mpi_scatter. The data is summed and then sent back! to the root processor using MPI_Gather. The root processor! then prints the global sum. *//* globals */int numnodes,myid,mpi_err;#define mpi_root 0/* end globals */void init_it(int *argc, char ***argv);void init_it(int *argc, char ***argv) { mpi_err = MPI_Init(argc,argv); mpi_err = MPI_Comm_size( MPI_COMM_WORLD, &numnodes ); mpi_err = MPI_Comm_rank(MPI_COMM_WORLD, &myid); }

int main(int argc,char *argv[]){ int *myray,*send_ray,*back_ray; int count; int size,mysize,i,k,j,total; init_it(&argc,&argv);/* each processor will get count elements from the root */ count=4; myray=(int*)malloc(count*sizeof(int));/* create the data to be sent on the root */ if(myid == mpi_root){ size=count*numnodes; send_ray=(int*)malloc(size*sizeof(int)); back_ray=(int*)malloc(numnodes*sizeof(int)); for(i=0;i<size;i++) send_ray[i]=i; } /* send different data to each processor */

mpi_err = MPI_Scatter( send_ray, count, MPI_INT, myray, count, MPI_INT, mpi_root, MPI_COMM_WORLD);/* each processor does a local sum */ total=0; for(i=0;i<count;i++) total=total+myray[i]; printf("myid= %d total= %d\n ",myid,total);/* send the local sums back to the root */ mpi_err = MPI_Gather(&total, 1, MPI_INT, back_ray, 1, MPI_INT, mpi_root, MPI_COMM_WORLD);/* the root prints the global sum */ if(myid == mpi_root){ total=0; for(i=0;i<numnodes;i++) total=total+back_ray[i]; printf("results from all processors= %d \n ",total); } mpi_err = MPI_Finalize();}

Output of previous fortran code on 4 procs ultra:/work/majumdar/examples/mpi % bsub -q hpc -m ultra -I -n 4 ./a.out Job <48051> is submitted to queue <hpc>. <<Waiting for dispatch ...>> <<Starting on ultra>> myid= 1 total= 22 myid= 2 total= 38 myid= 3 total= 54 myid= 0 total= 6 results from all processors= 120 ( 0 through 15 added up = (15) (15 + 1) /2 = 120)

Global Sum with MPI_Reduce2d array spread across processors

All Gather and All Reduce • Gather and Reduce come in an "ALL" variation • Results are returned to all processors • The root parameter is missing from the call • Similar to a gather or reduce followed by a broadcast

Global Sum with MPI_AllReduce2d array spread across processors

All to All communication with MPI_Alltoall • Each processor sends and receives data to/from all others • C • int MPI_Alltoall(&sendbuf,sendcnts, sendtype, &recvbuf, recvcnts, recvtype, MPI_Comm); • Fortran • call MPI_Alltoall(sendbuf,sendcnts,sendtype, recvbuf,recvcnts,recvtype,comm,ierror)

a0 a1 a2 a3 a0 b0 c0 d0 b0 b1 b2 b3 a1 b1 c1 d1 c0 c1 c2 c3 a2 b2 c2 d2 d0 d1 d2 d3 a3 b3 c3 d3 All to All

All to All with MPI_Alltoall • Parameters • sendcnts # of elements sent to each processor • sendbuf is an array of size sendcnts • recvcnts # of elements obtained from each processor • recvbuf of size recvcnts • Note that both send buffer and receive buffer must be an array of size of the number of processors • See attached Fortran and C codes

module mpi!DEC$ NOFREEFORM include "mpif.h“!DEC$ FREEFORM end module! This program shows how to use MPI_Alltoall. Each processor! send/rec a different random number to/from other processors. module global integer numnodes,myid,mpi_err integer, parameter :: mpi_root=0end modulesubroutine init use mpi use global implicit none! do the mpi init stuff call MPI_INIT( mpi_err ) call MPI_COMM_SIZE( MPI_COMM_WORLD, numnodes, mpi_err ) call MPI_Comm_rank(MPI_COMM_WORLD, myid, mpi_err)end subroutine init

program test1 use mpi use global implicit none integer, allocatable :: scounts(:),rcounts(:) integer ssize,rsize,i,k,j real z call init ! counts and displacement arrays allocate(scounts(0:numnodes-1)) allocate(rcounts(0:numnodes-1)) call seed_random! find data to send do i=0,numnodes-1 call random_number(z) scounts(i)=nint(10.0*z)+1 Enddo write(*,*)"myid= ",myid," scounts= ",scounts

! send the data call MPI_alltoall( scounts,1,MPI_INTEGER, & rcounts,1,MPI_INTEGER, MPI_COMM_WORLD,mpi_err) write(*,*)"myid= ",myid," rcounts= ",rcounts call mpi_finalize(mpi_err)end program subroutine seed_random use global implicit none integer the_size,j integer, allocatable :: seed(:) real z call random_seed(size=the_size) ! how big is the intrisic seed? allocate(seed(the_size)) ! allocate space for seed do j=1,the_size ! create the seed seed(j)=abs(myid*10)+(j*myid*myid)+100 ! abs is generic enddo call random_seed(put=seed) ! assign the seed deallocate(seed)end subroutine

#include <mpi.h>#include <stdio.h>|#include <stdlib.h>/*! This program shows how to use MPI_Alltoall. Each processor! send/rec a different random number to/from other processors. *//* globals */int numnodes,myid,mpi_err;#define mpi_root 0/* end module */void init_it(int *argc, char ***argv);void seed_random(int id);void random_number(float *z);void init_it(int *argc, char ***argv) { mpi_err = MPI_Init(argc,argv); mpi_err = MPI_Comm_size( MPI_COMM_WORLD, &numnodes ); mpi_err = MPI_Comm_rank(MPI_COMM_WORLD, &myid);}

int main(int argc,char *argv[]){ int *sray,*rray; int *scounts,*rcounts; int ssize,rsize,i,k,j; float z; init_it(&argc,&argv); scounts=(int*)malloc(sizeof(int)*numnodes); rcounts=(int*)malloc(sizeof(int)*numnodes); /*! seed the random number generator with a! different number on each processor*/ seed_random(myid);/* find data to send */ for(i=0;i<numnodes;i++){ random_number(&z); scounts[i]=(int)(10.0*z)+1; } printf("myid= %d scounts=",myid); for(i=0;i<numnodes;i++) printf("%d ",scounts[i]); printf("\n");

/* send the data */ mpi_err = MPI_Alltoall( scounts,1,MPI_INT, rcounts,1,MPI_INT, MPI_COMM_WORLD); printf("myid= %d rcounts=",myid); for(i=0;i<numnodes;i++) printf("%d ",rcounts[i]); printf("\n"); mpi_err = MPI_Finalize();} void seed_random(int id){ srand((unsigned int)id);} void random_number(float *z){ int i; i=rand(); *z=(float)i/32767; }

Output of previous fortran code on 4 procs ultra:/work/majumdar/examples/mpi % bsub -q hpc -m ultra -I -n 4 a.out Job <48059> is submitted to queue <hpc>. <<Waiting for dispatch ...>> <<Starting on ultra>> myid= 1 scounts= 5 2 4 4 myid= 1 rcounts= 1 2 2 3 myid= 2 scounts= 9 2 2 6 myid= 2 rcounts= 7 4 2 11 myid= 3 scounts= 3 3 11 8 myid= 3 rcounts= 2 4 6 8 myid= 0 scounts= 11 1 7 2 myid= 0 rcounts= 11 5 9 3-------------------------------------------- 11 1 7 2 11 5 9 35 2 4 4 1 2 2 39 2 2 6 7 4 2 113 3 11 8 2 4 6 8

The variable or “V” operators • A collection of very powerful but difficult to setup global communication routines • MPI_Gatherv: Gather different amounts of data from each processor to the root processor • MPI_Alltoallv: Send and receive different amounts of data form all processors • MPI_Allgatherv: Gather different amounts of data from each processor and send all data to each • MPI_Scatterv: Send different amounts of data to each processor from the root processor • We discuss MPI_Gatherv and MPI_Alltoallv

MPI_Gatherv • C • int MPI_Gatherv (&sendbuf, sendcnts, sendtype, &recvbuf, &recvcnts, &rdispls,recvtype, comm); • Fortran • MPI_Gatherv (sendbuf, sendcnts, sendtype, recvbuf, recvcnts, rdispls, recvtype, comm, ierror) • Parameters: • Recvcnts is now an array • Rdispls is a displacement • See attached codes

MPI GatherV rank 0 = root rank 1 rank 2 1 2 3 sendbuf 2 3 sendbuf 3 sendbuf Recvcounts(0) 1 0 = displs(0) Recvcounts(1) 2 1 = displs(1) 2 2 Recvcounts(2) 3 3 = displs(2) 3 4 3 5 Recvbuf

MPI Gatherv code Sample program: include ‘mpif.h’integer isend(3), irecv(3)integer ircnt(0:2), idisp(0:2)data icrnt/1,2,3/ idisp/0,1,3/call mpi_init(ierr)call mpi_comm_size(MPI_COMM_WORLD, nprocs,ierr)call mpi_comm_rank(MPI_COMM_WORLD,myrank,ierr)do I = 1,myrank+1 isend(I) = myrank+1enddoiscnt = myrank + 1call MPI_GATHERV(isend,iscnt,MPI_INTEGER,irecv,ircnt,idisp,MPI_INTEGER & 0,MPI_COMM_WORLD, ierr)if (myrank .eq. 0) then print *, ‘irecv =‘, irecvendif call MPI_FINALIZE(ierr)end Sample execution:% bsub –q hpc –m ultra –I –n 3 ./a.out% 0: irecv = 1 2 2 3 3 3

MPI_Alltoallv • Send and receive different amounts of data form all processors • C • int MPI_Alltoallv (&sendbuf, &sendcnts, &sdispls, sendtype, &recvbuf, &recvcnts, &rdispls, recvtype, comm ); • Fortran • Call MPI_Alltoallv(sendbuf, sendcnts, sdispls, sendtype, recvbuf, recvcnts, rdispls,recvtype, comm,ierror); • See attached code

MPI ALLTOALLV rank0 rank1 rank2 Sendcounts(0) 1 4 7 0=sdispls(0) Sendcounts(1) 2 5 8 1=sdispls(1) 2 5 8 2 Sendcounts(3) 3 6 9 3=sdispls(2) 3 6 9 4 3 6 9 5 sendbuf sendbuf Recvcounts(0) 1 2 3 0=rdispls(0) Recvcounts(1) 4 2 3 1 Recvcounts(3) 7 5 3 2 recvbuf 5 6 3=rdispls(1) 8 6 4 8 6 5 recvbuf 9 6=rdispls(2) 9 7 9 8

MPI ALLTOALLV proc ircnt 0 1 2 0 1 2 3 1 1 2 3 2 1 2 3 proc irdsp 0 1 2 0 0 0 0 1 1 2 3 2 2 4 6

MPI ALLTOALLV Program alltoallv include ‘mpif.h’ integer isend(6), irecv(9) integer iscnt(0:2), isdsp(0:2), ircnt(0), irdsp(0:2) data isend/1,2,2,3,3,3/ data iscnt/1,2,3/ isdsp/0,1,3/ call MPI_INIT(ierr) call MPI_COMM_SIZE(MPI_COMM_WORLD,nprocs, ierr)call MPI_COMM_RANK(MIP_COMM_WORLD,myrank, ierr)do i = 1,6 isend(i) = isend(i) + nprocs*myrankenddo do i = 0, nprocs – 1 ircnt(i) = myrank + 1 irdsp(i) = i* (myrank + 1) enddo print*, ‘isend=‘, isendcall MP_FLUSH(1)call MPI_ALLTOALLV(isend,iscnt,isdsp,MPI_INTEGER,irecv, ircnt, irdsp,MPI_INTEGER, MPI_COMM_WORLD, ierr)print*, ‘irecv=‘,irecvcall MPI_FINALIZE(ierr)end

MPI ALLTOALLV Sample execution of mpialltoallv program: % bsub –q hpc –m ultra –I –n 3 % 0: isend = 1 2 2 3 3 3 1: isend = 4 5 5 6 6 6 2: isend = 7 8 8 9 9 9 0: irecv = 1 4 7 0 0 0 0 0 0 1: irecv = 2 2 5 5 8 8 0 0 0 2: irecv = 3 3 3 6 6 6 9 9 9

Derived types • C and Fortran 90 have the ability to define arbitrary data types that encapsulate reals, integers, and characters. • MPI allows you to define message data types corresponding to your data types • Can use these data types just as default types

Derived types, Three main classifications: • Contiguous Vectors: enable you to send contiguous blocks of the same type of data lumped together • Noncontiguous Vectors: enable you to send noncontiguous blocks of the same type of data lumped together • Abstract types: enable you to (carefully) send C or Fortran 90 structures, don't send pointers

MPI Part III NPACI Parallel Computing Institute August 19 - 23, 2002