MPI Collective Communications Overview

Amit Majumdar Scientific Computing Applications Group San Diego Supercomputer Center Tim Kaiser (now at Colorado School of Mines) Message Passing Interface (MPI) 3

MPI 3 Lecture Overview • Collective Communications • Advanced Topics • “V” Operations • Derived Data Types • Communicators

Broadcast Operation: MPI_Bcast • All nodes call MPI_Bcast • One node (root) sends a message all others receive the message • C • MPI_Bcast(&buffer, count, datatype, root, COMM); • Fortran • call MPI_Bcast(buffer, count, datatype, root, COMM, ierr) • Root is node that sends the message

Broadcast Example • Write a parallel program to broadcast data using MPI_Bcast • Initialize MPI • Have processor 0 broadcast an integer • Have all processors print the data • Quit MPI

/************************************************************/************************************************************ This is a simple broadcast program in MPI ************************************************************/ int main(argc,argv) int argc; char *argv[]; { int i,myid, numprocs; int source,count; int buffer[4]; MPI_Status status; MPI_Request request; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&numprocs); MPI_Comm_rank(MPI_COMM_WORLD,&myid);

source=0; count=8; if(myid == source){ for(i=0;i<count;i++) buffer[i]=i; } MPI_Bcast(buffer,count,MPI_INT,source,MPI_COMM_WORLD); for(i=0;i<count;i++) printf("%d ",buffer[i]); printf("\n"); MPI_Finalize(); }

Output of broadcast program ds100 % more LL_out.478780 0:0 1 2 3 4 5 6 7 1:0 1 2 3 4 5 6 7 2:0 1 2 3 4 5 6 7 3:0 1 2 3 4 5 6 7 4:0 1 2 3 4 5 6 7 5:0 1 2 3 4 5 6 7 6:0 1 2 3 4 5 6 7 7:0 1 2 3 4 5 6 7

Reduction Operations • Used to combine partial results from all processors • Result returned to root processor • Several types of operations available • Works on single elements and arrays

MPI_Reduce • C • int MPI_Reduce(&sendbuf, &recvbuf, count, datatype, operation,root, communicator) • Fortran • call MPI_Reduce(sendbuf, recvbuf, count, datatype, operation,root, communicator, ierr) • Parameters • Like MPI_Bcast, a root is specified. • Operation is a type of mathematical operation

Operations for MPI_Reduce MPI_MAX Maximum MPI_MIN Minimum MPI_PROD Product MPI_SUM Sum MPI_LAND Logical and MPI_LOR Logical or MPI_LXOR Logical exclusive or MPI_BAND Bitwise and MPI_BOR Bitwise or MPI_BXOR Bitwise exclusive or MPI_MAXLOC Maximum value and location MPI_MINLOC Minimum value and location

Global Sum with MPI_Reduce C double sum_partial, sum_global; sum_partial = ...; ierr = MPI_Reduce(&sum_partial, &sum_global, 1, MPI_DOUBLE, MPI_SUM,root, MPI_COMM_WORLD); Fortran double precision sum_partial, sum_global sum_partial = ... call MPI_Reduce(sum_partial, sum_global, 1, MPI_DOUBLE_PRECISION, MPI_SUM,root, MPI_COMM_WORLD, ierr)

Global Sum with MPI_Reduce2d array spread across processors

p0 p0 p0 p0 p0 p1 p1 p1 p1 p1 p2 p2 p2 p2 p2 p3 p3 p3 p3 p3 Broadcast, Scatter and Gather A p0 A broadcast p1 A p2 A p3 A A B C D scatter A B C gather D A B C D A B C D all gather A B C D A B C D A B C D

Scatter Operation using MPI_Scatter • Similar to Broadcast but sends a section of an array to each processors Data in an array on root node: A(0) A(1) A(2) . . ………. A(N-1) Goes to processors: P0 P1 P2 . . . Pn-1

MPI_Scatter • C • int MPI_Scatter(&sendbuf, sendcnts, sendtype, &recvbuf, recvcnts, recvtype, root, comm ); • Fortran • MPI_Scatter(sendbuf,sendcnts,sendtype, recvbuf,recvcnts,recvtype,root,comm,ierror) • Parameters • sendbuf is an array of size (number processors*sendcnts) • sendcnts number of elements sent to each processor • recvcnts number of element(s) obtained from the root processor • recvbuf contains element(s) obtained from the root processor, may be an array

Scatter Operation using MPI_Scatter • Scatter with Sendcnts = 2 Data in an array on root node: A(0) A(2) A(4) . . . A(2N-2) A(1) A(3) A(5) . . . A(2N-1) Goes to processors: P0 P1 P2 . . . Pn-1 B(0) B(0) B(0) B(0) B(1) B(1) B(1) B(1)

Global Sum Example with MPI_Reduce • Example program to sum data from all processors

#include <mpi.h> #include <stdio.h> #include <stdlib.h> /* ! This program shows how to use MPI_Scatter and MPI_Reduce ! Each processor gets different data from the root processor by way of MPI_Scatter. ! The data is summed and then sent back ! to the root processor using MPI_Reduce. The root processor then prints the global sum. */ /* globals */ int numnodes,myid,mpi_err; #define mpi_root 0 /* end globals */

void init_it(int *argc, char ***argv); void init_it(int *argc, char ***argv) { mpi_err = MPI_Init(argc,argv); mpi_err = MPI_Comm_size( MPI_COMM_WORLD, &numnodes ); mpi_err = MPI_Comm_rank(MPI_COMM_WORLD, &myid); } int main(int argc,char *argv[]){ int *myray,*send_ray,*back_ray; int count; int size,mysize,i,k,j,total,gtotal; init_it(&argc,&argv); /* each processor will get count elements from the root */ count=4; myray=(int*)malloc(count*sizeof(int));

/* create the data to be sent on the root */ if(myid == mpi_root){ size=count*numnodes; send_ray=(int*)malloc(size*sizeof(int)); for(i=0;i<size;i++) send_ray[i]=i; } /* send different data to each processor */ mpi_err = MPI_Scatter(send_ray, count, MPI_INT, myray, count, MPI_INT, mpi_root, MPI_COMM_WORLD); /* each processor does a local sum */ total=0; for(i=0;i<count;i++) total=total+myray[i]; printf("myid= %d total= %d\n ",myid,total);

/* send the local sums back to the root */ mpi_err = MPI_Reduce(&total, &gtotal, 1, MPI_INT, MPI_SUM, mpi_root, MPI_COMM_WORLD); /* the root prints the global sum */ if(myid == mpi_root){ printf("results from all processors= %d \n ",gtotal); } mpi_err = MPI_Finalize(); }

Gather Operation using MPI_Gather • Used to collect data from all processors to the root, inverse of scatter • Data is collected into an array on root processor Data from various Processors: P0 P1 P2 . . . Pn-1 A0 A1 A2 . . . An-1 Goes to an array on root node: A(0) A(1) A(2) . . . A(N-1)

MPI_Gather • C • int MPI_Gather(&sendbuf,sendcnts, sendtype, &recvbuf, recvcnts,recvtype,root, comm ); • Fortran • MPI_Gather(sendbuf,sendcnts,sendtype, recvbuf,recvcnts,recvtype,root,comm,ierror) • Parameters • sendcnts number of elements sent from each processor • sendbuf is an array of size sendcnts • recvcnts number of elements obtained from each processor • recvbuf of size recvcnts*number of processors

Code for Scatter and Gather • A parallel program to scatter data using MPI_Scatter • Each processor sums the data • Use MPI_Gather to get the data back to the root processor • Root processor prints the global data • See attached Fortran and C code

module mpi include "mpif.h“end module! This program shows how to use MPI_Scatter and MPI_Gather! Each processor gets different data from the root processor! by way of mpi_scatter. The data is summed and then sent back! to the root processor using MPI_Gather. The root processor! then prints the global sum. module global integer numnodes,myid,mpi_err integer, parameter :: mpi_root=0end modulesubroutine init use mpi use global implicit none! do the mpi init stuff call MPI_INIT( mpi_err ) call MPI_COMM_SIZE( MPI_COMM_WORLD, numnodes, mpi_err ) call MPI_Comm_rank(MPI_COMM_WORLD, myid, mpi_err)

end subroutine initprogram test1 use mpi use global implicit none integer, allocatable :: myray(:),send_ray(:),back_ray(:) integer count integer size,mysize,i,k,j,total call init! each processor will get count elements from the rootcount=4 allocate(myray(count))! create the data to be sent on the root if(myid == mpi_root)thensize=count*numnodes allocate(send_ray(0:size-1)) allocate(back_ray(0:numnodes-1)) do i=0,size-1 send_ray(i)= i enddo endif

call MPI_Scatter (send_ray, count, MPI_INTEGER, &myray, count, MPI_INTEGER, & mpi_root, MPI_COMM_WORLD,mpi_err)! each processor does a local sumtotal=sum(myray)write(*,*)"myid= ",myid," total= ",total! send the local sums back to the rootcall MPI_Gather ( total, 1, MPI_INTEGER, &back_ray, 1, MPI_INTEGER, & mpi_root, MPI_COMM_WORLD,mpi_err)! the root prints the global sum if(myid == mpi_root)then write(*,*)"results from all processors= ",sum(back_ray) endif call mpi_finalize(mpi_err) end program

#include <mpi.h>#include <stdio.h>#include <stdlib.h>/*! This program shows how to use MPI_Scatter and MPI_Gather! Each processor gets different data from the root processor! by way of mpi_scatter. The data is summed and then sent back! to the root processor using MPI_Gather. The root processor! then prints the global sum. *//* globals */int numnodes,myid,mpi_err;#define mpi_root 0/* end globals */void init_it(int *argc, char ***argv);void init_it(int *argc, char ***argv) { mpi_err = MPI_Init(argc,argv); mpi_err = MPI_Comm_size( MPI_COMM_WORLD, &numnodes ); mpi_err = MPI_Comm_rank(MPI_COMM_WORLD, &myid); }

int main(int argc,char *argv[]){ int *myray,*send_ray,*back_ray; int count; int size,mysize,i,k,j,total; init_it(&argc,&argv);/* each processor will get count elements from the root */ count=4; myray=(int*)malloc(count*sizeof(int));/* create the data to be sent on the root */ if(myid == mpi_root){ size=count*numnodes; send_ray=(int*)malloc(size*sizeof(int)); back_ray=(int*)malloc(numnodes*sizeof(int)); for(i=0;i<size;i++) send_ray[i]=i; } /* send different data to each processor */

mpi_err = MPI_Scatter( send_ray, count, MPI_INT, myray, count, MPI_INT, mpi_root, MPI_COMM_WORLD);/* each processor does a local sum */ total=0; for(i=0;i<count;i++) total=total+myray[i]; printf("myid= %d total= %d\n ",myid,total);/* send the local sums back to the root */ mpi_err = MPI_Gather(&total, 1, MPI_INT, back_ray, 1, MPI_INT, mpi_root, MPI_COMM_WORLD);/* the root prints the global sum */ if(myid == mpi_root){ total=0; for(i=0;i<numnodes;i++) total=total+back_ray[i]; printf("results from all processors= %d \n ",total); } mpi_err = MPI_Finalize();}

Output of previous code on 4 procs myid= 1 total= 22 myid= 2 total= 38 myid= 3 total= 54 myid= 0 total= 6 results from all processors= 120 ( 0 through 15 added up = (15) (15 + 1) /2 = 120)

MPI_Allgather and MPI_Allreduce • Gather and Reduce come in an "ALL" variation • Results are returned to all processors • The root parameter is missing from the call • Similar to a gather or reduce followed by a broadcast

All to All communication with MPI_Alltoall • Each processor sends and receives data to/from all others • C • int MPI_Alltoall(&sendbuf,sendcnts, sendtype, &recvbuf, recvcnts, recvtype, MPI_Comm); • Fortran • call MPI_Alltoall(sendbuf,sendcnts,sendtype, recvbuf,recvcnts,recvtype,comm,ierror)

a0 a1 a2 a3 a0 b0 c0 d0 P0 P1 P2 P3 P0 P1 P2 P3 b0 b1 b2 b3 a1 b1 c1 d1 c0 c1 c2 c3 a2 b2 c2 d2 d0 d1 d2 d3 a3 b3 c3 d3 MPI_Alltoall

All to All with MPI_Alltoall • Parameters • sendcnts number of elements sent to each processor • sendbuf is an array of size sendcnts • recvcnts number of elements obtained from each processor • recvbuf is an array of size recvcnts • Note that both send buffer and receive buffer must be an array of size of the number of processors • See attached Fortran and C codes

module mpi include "mpif.h“ end module! This program shows how to use MPI_Alltoall. Each processor! send/rec a different random number to/from other processors. module global integer numnodes,myid,mpi_err integer, parameter :: mpi_root=0end modulesubroutine init use mpi use global implicit none! do the mpi init stuff call MPI_INIT( mpi_err ) call MPI_COMM_SIZE( MPI_COMM_WORLD, numnodes, mpi_err ) call MPI_Comm_rank(MPI_COMM_WORLD, myid, mpi_err)end subroutine init

program test1 use mpi use global implicit none integer, allocatable :: scounts(:),rcounts(:) integer ssize,rsize,i,k,j real z call init ! counts and displacement arraysallocate(scounts(0:numnodes-1)) allocate(rcounts(0:numnodes-1)) call seed_random! find data to send do i=0,numnodes-1 call random_number(z) scounts(i)=nint(10.0*z)+1 Enddo write(*,*)"myid= ",myid," scounts= ",scounts

! send the datacall MPI_alltoall ( scounts,1,MPI_INTEGER, & rcounts,1,MPI_INTEGER, MPI_COMM_WORLD,mpi_err) write(*,*)"myid= ",myid," rcounts= ",rcounts call mpi_finalize(mpi_err)end program subroutine seed_random use global implicit none integer the_size,j integer, allocatable :: seed(:) real z call random_seed(size=the_size) ! how big is the intrisic seed? allocate(seed(the_size)) ! allocate space for seed do j=1,the_size ! create the seed seed(j)=abs(myid*10)+(j*myid*myid)+100 ! abs is generic enddo call random_seed(put=seed) ! assign the seed deallocate(seed)end subroutine

#include <mpi.h>#include <stdio.h>|#include <stdlib.h>/*! This program shows how to use MPI_Alltoall. Each processor! send/rec a different random number to/from other processors. *//* globals */int numnodes,myid,mpi_err;#define mpi_root 0/* end module */void init_it(int *argc, char ***argv);void seed_random(int id);void random_number(float *z);void init_it(int *argc, char ***argv) { mpi_err = MPI_Init(argc,argv); mpi_err = MPI_Comm_size( MPI_COMM_WORLD, &numnodes ); mpi_err = MPI_Comm_rank(MPI_COMM_WORLD, &myid);}

int main(int argc,char *argv[]){ int *sray,*rray; int *scounts,*rcounts; int ssize,rsize,i,k,j; float z; init_it(&argc,&argv); scounts=(int*)malloc(sizeof(int)*numnodes); rcounts=(int*)malloc(sizeof(int)*numnodes); /*! seed the random number generator with a! different number on each processor*/ seed_random(myid);/* find data to send */ for(i=0;i<numnodes;i++){ random_number(&z); scounts[i]=(int)(10.0*z)+1; } printf("myid= %d scounts=",myid); for(i=0;i<numnodes;i++) printf("%d ",scounts[i]); printf("\n");

/* send the data */ mpi_err = MPI_Alltoall( scounts,1,MPI_INT, rcounts,1,MPI_INT, MPI_COMM_WORLD); printf("myid= %d rcounts=",myid); for(i=0;i<numnodes;i++) printf("%d ",rcounts[i]); printf("\n"); mpi_err = MPI_Finalize();} void seed_random(int id){ srand((unsigned int)id);} void random_number(float *z){ int i; i=rand(); *z=(float)i/32767; }

Output of previous code on 4 procs myid= 1 scounts= 6 2 4 6 myid= 1 rcounts= 7 2 7 3 myid= 2 scounts= 1 7 4 4 myid= 2 rcounts= 4 4 4 4 myid= 3 scounts= 6 3 4 3 myid= 3 rcounts= 7 6 4 3 myid= 0 scounts= 1 7 4 7 myid= 0 rcounts= 1 6 1 6 -------------------------------------------- 1 7 4 7 1 6 1 66 2 4 6 7 2 7 31 7 4 4 4 4 4 46 3 4 3 7 6 4 3

The variable or “V” operators • A collection of very powerful but difficult to setup global communication routines • MPI_Gatherv: Gather different amounts of data from each processor to the root processor • MPI_Alltoallv: Send and receive different amounts of data form all processors • MPI_Allgatherv: Gather different amounts of data from each processor and send all data to each • MPI_Scatterv: Send different amounts of data to each processor from the root processor • We discuss MPI_Gatherv and MPI_Alltoallv

MPI_Gatherv • C • int MPI_Gatherv (&sendbuf, sendcnts, sendtype, &recvbuf, &recvcnts, &rdispls,recvtype, comm); • Fortran • MPI_Gatherv (sendbuf, sendcnts, sendtype, recvbuf, recvcnts, rdispls, recvtype, comm, ierror) • Parameters: • Recvcnts is now an array • Rdispls is a displacement • See attached codes

p0 p1 p2 p3 MPI_Gatherv rank 0 = root rank 1 rank 2 1 2 3 sendbuf 2 3 sendbuf 3 sendbuf recvcnts[0] 1 0 = rdispls[0] recvcnts[1] 2 1 = rdispls[1] 2 2 recvcnts[2] 3 3 = rdispls[2] 3 4 3 5 recvbuf A A B C D regular gather B C D

MPI_Gatherv code Sample program: include ‘mpif.h’integer isend(3), irecv(6)integer ircnt(0:2), idisp(0:2)data ircnt/1,2,3/ idisp/0,1,3/call mpi_init(ierr)call mpi_comm_size(MPI_COMM_WORLD, nprocs,ierr)call mpi_comm_rank(MPI_COMM_WORLD,myrank,ierr)do I = 1,myrank+1isend(I) = myrank+1enddoiscnt = myrank + 1call MPI_GATHERV(isend,iscnt,MPI_INTEGER,irecv,ircnt,idisp,MPI_INTEGER & 0,MPI_COMM_WORLD, ierr)if (myrank .eq. 0) then print *, ‘irecv =‘, irecvendif call MPI_FINALIZE(ierr)end Sample execution:% bsub –q hpc –m ultra –I –n 3 ./a.out% 0: irecv = 1 2 2 3 3 3

#include <mpi.h> #include <stdio.h> #include <stdlib.h> /*! This program shows how to use MPI_Gatherv. Each processor sends a ! different amount of data to the root processor. We use MPI_Gather ! first to tell the root how much data is going to be sent.*/ /* globals */ int numnodes,myid,mpi_err; #define mpi_root 0 /* end of globals */ void init_it(int *argc, char ***argv); void init_it(int *argc, char ***argv) { mpi_err = MPI_Init(argc,argv); mpi_err = MPI_Comm_size( MPI_COMM_WORLD, &numnodes ); mpi_err = MPI_Comm_rank(MPI_COMM_WORLD, &myid); }

int main(int argc,char *argv[]){ int *will_use; int *myray,*displacements,*counts,*allray; int size,mysize,i; init_it(&argc,&argv); mysize=myid+1; myray=(int*)malloc(mysize*sizeof(int)); for(i=0;i<mysize;i++) myray[i]=myid+1; /* counts and displacement arrays are only required on the root */ if(myid == mpi_root){ counts=(int*)malloc(numnodes*sizeof(int));

displacements=(int*)malloc(numnodes*sizeof(int)); } /* we gather the counts to the root */ mpi_err = MPI_Gather((void*)myray,1,MPI_INT, (void*)counts, 1,MPI_INT, mpi_root,MPI_COMM_WORLD); /* calculate displacements and the size of the recv array */ if(myid == mpi_root){ displacements[0]=0; for( i=1;i<numnodes;i++){ displacements[i]=counts[i-1]+displacements[i-1]; } size=0; for(i=0;i< numnodes;i++) size=size+counts[i]; allray=(int*)malloc(size*sizeof(int));}

/* different amounts of data from each processor */ /* is gathered to the root */ mpi_err = MPI_Gatherv(myray, mysize, MPI_INT, allray,counts,displacements,MPI_INT, mpi_root, MPI_COMM_WORLD); if(myid == mpi_root){ for(i=0;i<size;i++) printf("%d ",allray[i]); printf("\n"); } mpi_err = MPI_Finalize(); } ultra% bsub –q hpc –m ultra –I –n 3 ./a.out 1 2 2 3 3 3

MPI Collective Communications Overview

MPI Collective Communications Overview

Presentation Transcript

Message Passing Interface

Message Passing Interface (MPI)

MPI Message Passing Interface

Message Passing Interface (MPI) 1

MPI: Message-Passing Interface

Message Passing Interface

MPI: Message Passing Interface

Message Passing Interface (MPI)

Message Passing Interface (MPI) 1

MPI Message Passing Interface

Message Passing Interface (MPI)

Message Passing Interface (MPI) 2

Message Passing Interface

MPI – Message Passing Interface

Message Passing Interface (MPI)

Message Passing Interface (MPI) 2

Message Passing Interface (MPI) 1

MPI – Message Passing Interface

Message Passing Interface MPI