300 likes | 408 Views
Introduction to Message Passing Interface (MPI) Part II. SoCal Annual AAP workshop October 30, 2006 San Diego Supercomputer Center. Overview. Review of Basic Routines Global Communications in MPI Parallel Monte Carlo simulation example Point to Point Communications
E N D
Introduction toMessage Passing Interface (MPI)Part II SoCal Annual AAP workshop October 30, 2006 San Diego Supercomputer Center
Overview • Review of Basic Routines • Global Communications in MPI • Parallel Monte Carlo simulation example • Point to Point Communications • MPI tracing and performance tools • Parallel Poisson solver using semi implicit method • Overview of some advanced MPI calls
Review of Basic MPI routines • MPI is used to create parallel programs based on message passing • Usually the same program is run on multiple processors • The 6 basic calls in MPI are: • MPI_INIT( ierr ) • MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr ) • MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierr ) • MPI_Send(buffer, count,MPI_INTEGER,destination, tag, MPI_COMM_WORLD, ierr) • MPI_Recv(buffer, count, MPI_INTEGER,source,tag, MPI_COMM_WORLD, status,ierr) • MPI_FINALIZE(ierr)
Global Communications in MPI: Broadcast • All nodes call MPI_Bcast • One node (root) sends a message all others receive the message • C • MPI_Bcast(&buffer, count, datatype, root, communicator); • Fortran • call MPI_Bcast(buffer, count, datatype, root, communicator, ierr) • Root is node that sends the message
Global Communications in MPI: Broadcast • broadcast.c is a parallel program to broadcast data using MPI_Bcast • Initialize MPI • Have processor 0 broadcast an integer • Have all processors print the data • Quit MPI
Global Communications in MPI: Broadcast /************************************************************ This is a simple broadcast program in MPI ************************************************************/ #include <stdio.h> #include "mpi.h" int main(argc,argv) int argc; char *argv[]; { int i,myid, numprocs; int source,count; int buffer[4]; MPI_Status status; MPI_Request request; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&numprocs); MPI_Comm_rank(MPI_COMM_WORLD,&myid); source=0; count=4; if(myid == source){ for(i=0;i<count;i++) buffer[i]=i; } MPI_Bcast(buffer,count,MPI_INT,source,MPI_COMM_WORLD); for(i=0;i<count;i++) printf("%d ",buffer[i]); printf("\n"); MPI_Finalize(); }
Global Communications in MPI: Reduction • Used to combine partial results from all processors • Result returned to root processor • Several types of operations available. For example summation, maximum etc • Works on single elements and arrays
Global Communications in MPI: MPI_Reduce • C • int MPI_Reduce(&sendbuf, &recvbuf, count, datatype, operation,root, communicator) • Fortran • call MPI_Reduce(sendbuf, recvbuf, count, datatype, operation,root, communicator, ierr) • Parameters • Like MPI_Bcast, a root MPI process is specified. • Operation is mathematical operation
Global Communications in MPI: MPI_Reduce MPI_MAX Maximum MPI_MIN Minimum MPI_PROD Product MPI_SUM Sum MPI_LAND Logical and MPI_LOR Logical or MPI_LXOR Logical exclusive or MPI_BAND Bitwise and MPI_BOR Bitwise or MPI_BXOR Bitwise exclusive or MPI_MAXLOC Maximum value and location MPI_MINLOC Minimum value and location
Monte Carlo simulation to calculate Pi • Monte Carlo methods are statistical simulation methods that utilize a sequences of random numbers to perform simulations. As seen from the figure below the ratio of the area of the circle to the area of the square can be determined by this method (The ratio = Pi/4). • In the workshop directory the serial version of the program is pi-serial.f and the parallel version is pi-mpi.f. We use the ESSL random number generator (use –lessl while compiling)
pi-serial.f !cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc! ! This code illustrates a parallel Monte Carlo method of ! ! calculating pi. ! !cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc! PROGRAM pical IMPLICIT NONE INTEGER I, N, SUM PARAMETER (N=2560000) REAL*8 SEED, X(2*N), R REAL*8 PI1, PI2 PI1 = 4D0*DATAN(1D0) SEED = 573.0 SUM = 0 CALL DURAND(SEED,2*N,X) DO I = 1, 2*N-1, 2 R = DSQRT(X(I)*X(I)+X(I+1)*X(I+1)) IF (R.LE.1D0) THEN SUM = SUM+1 ENDIF SEED = R*477.0 ENDDO PI2 = DFLOAT(SUM)/DFLOAT(N)*4D0 WRITE(*,*)"SUM=",SUM WRITE(*,*)"PI=",PI1 WRITE(*,*)"Computed=",PI2 END PROGRAM pical
pi-mpi.f !cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc! ! This code illustrates a parallel Monte Carlo method of ! ! calculating pi ! !cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc! PROGRAM pical IMPLICIT NONE INCLUDE 'mpif.h' INTEGER:: my_id, ntasks, ierr INTEGER:: I, N, SUM, NLOCAL, LOCALSUM PARAMETER (N=2560000) REAL*8 SEED, X(2*N), R REAL*8 PI1, PI2 CALL MPI_INIT( ierr ) CALL MPI_COMM_RANK( MPI_COMM_WORLD, my_id, ierr ) CALL MPI_COMM_SIZE( MPI_COMM_WORLD, ntasks, ierr ) WRITE(*,*) "I am task number", my_id, + ". The total number of tasks is",ntasks PI1 = 4D0*DATAN(1D0) SEED = 573.0 SUM = 0 LOCALSUM = 0 NLOCAL = 2*N/ntasks CALL DURAND(SEED,2*N,X) DO I = my_id*NLOCAL+1, my_id*NLOCAL+NLOCAL-1, 2 R = DSQRT(X(I)*X(I)+X(I+1)*X(I+1)) IF (R.LE.1D0) THEN LOCALSUM = LOCALSUM+1 ENDIF SEED = R*477.0 ENDDO WRITE(*,*)"MY ID is", my_id, "; LOCALSUM=", LOCALSUM CALL MPI_REDUCE(LOCALSUM, SUM, 1, MPI_INTEGER, + MPI_SUM, 0, MPI_COMM_WORLD, ierr) IF (my_id.eq.0) THEN PI2 = DFLOAT(SUM)/DFLOAT(N)*4D0 WRITE(*,*)"SUM=",SUM WRITE(*,*)"PI=",PI1 WRITE(*,*)"Computed=",PI2 ENDIF CALL MPI_FINALIZE(ierr) END PROGRAM pical
Monte Carlo simulation to calculate Pi 0: I am task number 0 . The total number of tasks is 4 1: I am task number 1 . The total number of tasks is 4 2: I am task number 2 . The total number of tasks is 4 3: I am task number 3 . The total number of tasks is 4 3: MY ID is 3 ; LOCALSUM= 503186 2: MY ID is 2 ; LOCALSUM= 502378 1: MY ID is 1 ; LOCALSUM= 502711 0: MY ID is 0 ; LOCALSUM= 502842 0: SUM= 2011117 0: PI= 3.14159265358979312 0: Computed= 3.14237031250000021
Point to Point Communications in MPI • Basic operations of Point to Point (PtoP) communication and issues of deadlock • Several steps are involved in the PtoP communication • Sending process • data is copied to the user buffer by the user • User calls one of the MPI send routines • System copies the data from the user buffer to the system buffer • System sends the data from the system buffer to the destination processor
Point to Point Communications in MPI • Receiving process • User calls one of the MPI receive subroutines • System receives the data from the source process, and copies it to the system buffer • System copies the data from the system buffer to the user buffer • User uses the data in the user buffer
Point to Point Communications in MPI Process 0 : User mode Kernel mode sendbuf sysbuf Call send routine Copying data from sendbuf to systembuf Now sendbuf can be reused Send data from sysbuf to dest data Process 1 : User mode Kernel mode receive data from src to systembuf Call receive routine Copying data from sysbufto recvbuf Now recvbuf contains valid data sysbuf recvbuf
Unidirectional Communication • Blocking send and blocking receive • if (myrank == 0) then call MPI_Send(…)elseif (myrank == 1) then call MPI_Recv(….) endif • Non-blocking send and blocking receive • if (myrank == 0) then call MPI_ISend(…) call MPI_Wait(…)else if (myrank == 1) then call MPI_Recv(….) endif
Unidirectional Communication • Blocking send and non-blocking recv if (myrank == 0 ) then call MPI_Send(…..) elseif (myrank == 1) then call MPI_Irecv (…) call MPI_Wait(…) endif • Non-blocking send and non-blocking recv if (myrank == 0 ) then call MPI_Isend (…) call MPI_Wait (…) elseif (myrank == 1) then call MPI_Irecv (….) call MPI_Wait(..) endif
Bidirectional Communication • Need to be careful about deadlock when two processes exchange data with each other • Deadlock can occur due to incorrect order of send and recv or due to limited size of the system buffer Rank 1 Rank 0 sendbuf recvbuf sendbuf recvbuf
Bidirectional Communication • Case 1 : both processes call send first, then recv if (myrank == 0 ) then call MPI_Send(….) call MPI_Recv (…) elseif (myrank == 1) then call MPI_Send(….) call MPI_Recv(….) endif • No deadlock as long as system buffer is larger than send buffer • Deadlock if system buffer is smaller than send buf • If you replace MPI_Send with MPI_Isend and MPI_Wait, it is still the same • Moral : there may be error in coding that only shows up for larger problem size
Bidirectional Communication • Case 2 : both processes call recv first, then send if (myrank == 0 ) then call MPI_Recv(….) call MPI_Send (…) elseif (myrank == 1) then call MPI_Recv(….) call MPI_Send(….) endif • The above will always lead to deadlock (even if you replace MPI_Send with MPI_Isend and MPI_Wait)
Bidirectional Communication • The following code can be safely executed if (myrank == 0 ) then call MPI_Irecv(….) call MPI_Send (…) call MPI_Wait(…) elseif (myrank == 1) then call MPI_Irecv(….) call MPI_Send(….) call MPI_Wait(….) endif
Bidirectional Communication • Case 3 : one process call send and recv in this order, and the other calls in the opposite order if (myrank == 0 ) then call MPI_Send(….) call MPI_Recv(…) elseif (myrank == 1) then call MPI_Recv(….) call MPI_Send(….) endif • The above is always safe • You can replace both send and recv on both processor with Isend and Irecv
MPI_Barrier • Blocks the caller until all members in the communicator have called it. • Used as a synchronization tool. • C • MPI_Barrier(comm ) • Fortran • Call MPI_BARRIER(COMM, IERROR) • Parameter • Comm: communicator (often MPI_COMM_WORLD)
Parallel Poisson Solver (Semi Implicit) • ∂2u/∂x2+ ∂2u/∂y2=0; u(x=0) = u_xo(y); u(x=1) = u_x1(y); u(y=0) = u_y0(x); u(y=1) = u_y1(x) u(x,y) is know as an initial condition. • For a steady state solution we add a time derivative term and advance the solution to convergence. • Discretizing for numerical solution we get and equation of the form: b_m* Δ u(n+1)ij-1 + b * Δ u(n+1)ij + bp * Δ u(n+1)ij+1 = RHSij(n) (n is the index in time and i,j are the indices in space) • The example code is called yblock.f. We use basic MPI calls in the code.
Parallel Poisson Solver (Semi Implicit) Processor 0 Processor 2 4 1 2 3 4 5 6 7 8 9 10 11 3 2 1 Processor 1 • The implicit part of the solver is in the j-direction • The parallelization is done in the i-direction as shown in the figure above
Poisson Solver: Compiling and Running Compiling with MPI trace libraries and ESSL mpxlf yblock.f -L/usr/local/apps/mpitrace -lmpiprof –lessl Output 0: Process 0 of 4 is GO! 1: Process 1 of 4 is GO! 2: Process 2 of 4 is GO! 3: Process 3 of 4 is GO! 0: Done Step # 1 1: Done Step # 1 2: Done Step # 1 3: Done Step # 1 0: Done Step # 2
MPI Trace Output ----------------------------------------------------------------- MPI Routine #calls avg. bytes time(sec) ----------------------------------------------------------------- MPI_Comm_size 1 0.0 0.000 MPI_Comm_rank 1 0.0 0.000 MPI_Send 500 1024.0 0.001 MPI_Recv 500 1024.0 0.008 MPI_Barrier 500 0.0 0.013 ----------------------------------------------------------------- total communication time = 0.022 seconds. total elapsed time = 3.510 seconds. user cpu time = 3.500 seconds. system time = 0.010 seconds. maximum memory size = 15856 KBytes. ----------------------------------------------------------------- Message size distributions: MPI_Send #calls avg. bytes time(sec) 500 1024.0 0.001 MPI_Recv #calls avg. bytes time(sec) 500 1024.0 0.008 ----------------------------------------------------------------- Call Graph Section: communication time = 0.022 sec, parent = poisson MPI Routine #calls time(sec) MPI_Send 500 0.001 MPI_Recv 500 0.008 MPI_Barrier 500 0.013 communication time = 0.000 sec, parent = dot MPI Routine #calls time(sec) MPI_Comm_size 1 0.000 MPI_Comm_rank 1 0.000
Overview of Some Advanced MPI Routines • Can split MPI communicators (MPI_Comm_split) • Probe incoming messages (MPI_Probe) • Asynchronous communication (MPI_Isend, MPI_Irecv, MPI_Wait, MPI_Test etc) • Scatter different data to different processors (MPI_Scatter), Gather (MPI_Gather) • MPI_AllReduce, MPI_Alltoall • MPI_Gatherv, MPI_Alltoallv etc • Derived data types (MPI_TYPE_STRUCT etc) • MPI I/O