Message-Passing Computing

Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007.

Message-Passing Programming using User-level Message-Passing Libraries Programming by using a normal high-level language such as C, augmented with message-passing library calls that perform direct process-to-process message passing. • A method of creating separate processes for execution on different computers • A method of sending and receiving messages

Multiple program, multiple data (MPMD) model Source Source fi le fi le Compile to suit processor Executable Processor 0 Processor p - 1

Multiple Program Multiple Data (MPMD) Model with dynamic process creation Processes started from within master process - dynamic process creation. Process 1 Star t e x ecution of process 2 spawn(); Process 2 Time PVM used this form. MPI-2 has dynamic process creation although we not use it. Potentially very flexible but incurs overhead of dynamically starting processes

Single Program Multiple Data (SPMD) model. Different processes merged into one program. Control (IF) statements select different parts for each processor to execute. All executables started together - static process creation Source fi le Compile to suit processor Ex ecutab les Processor 0 Processor p 1 -

Single Program Multiple Data (SPMD) model. Typically coded for one master and a set of identical slave processes rather than each process different. if (processID == 0) { … // do this code } else if (processID == 1) { … //do this code } else … Source fi le Compile to suit processor Ex ecutab les MPI uses this form Processor 0 Processor p 1 -

Advantages/disadvantages of MPMD and SPMD • MPMD with dynamic process creation • Flexible – can start up processes on demand during execution, for example when searching through a search space. • Has process start-up overhead • SPMD (with static process creation) • Easier to code • Just one program to write • Collective message passing routines in each process the same (see later) • Efficient process execution

Basic “point-to-point”Send and Receive Routines Passing a message between processes using send() and recv() library calls: Process 1 Process 2 x y Mo v ement of data send(&x, 2); recv(&y, 1); Generic syntax (actual formats later)

Synchronous Message Passing Routines that actually return when message transfer completed. Synchronous send routine • Waits until complete message has been accepted by the receiving process before returning. Synchronous receive routine • Waits until the message it is expecting arrives. Synchronous routines intrinsically perform two actions: They transfer data and they synchronize processes. Neither can proceed until the message has been passed from the source to the destination. So no message buffer storage is needed.

Synchronous send() and recv() using 3-way protocol Process 1 Process 2 Time Request to send send(); Suspend Ac kno wledgment recv(); process Both processes Message contin ue (a) When occurs bef ore send() recv() Process 1 Process 2 Time recv(); Suspend Request to send process send(); Message Both processes contin ue Ac kno wledgment (b) When occurs bef ore recv() send()

Asynchronous Message Passing • Blocking - has been used to describe routines that do not return until the transfer is completed. • The routines are “blocked” from continuing. • Non-blocking - has been used to describe routines that return whether or not the message had been received. Usually require local storage for messages. • In general, they do not synchronize processes but allow processes to move forward sooner. Must be used with care. • In that sense, the terms synchronous and blocking were synonymous.

MPI Definitions of Blocking and Non-Blocking • Locally Blocking - return after their local actions complete, though the message transfer may not have been completed. • Non-blocking - return immediately. Assumes that data storage used for transfer not modified by subsequent statements prior to being used for transfer, and it is left to the programmer to ensure this. These terms may have different interpretations in other systems.

How message-passing routines return before message transfer completed Message buffer needed between source and destination to hold message: Process 1 Process 2 Message b uff er Time send(); Contin ue recv(); Read process message b uff er

Message Buffer • For a receive routine, the message has to have been received if we want the message. • If recv() is reached before send(), the message buffer will be empty and recv() waits for the message. • For a send routine, once the local actions have been completed and the message is safely on its way, the process can continue with subsequent work. • In this way, using such send routines can decrease the overall execution time. • In practice, buffers can only be of finite length and a point could be reached when the send routine is held up because all the available buffer space has been exhausted. • It may be necessary to know at some point if the message has actually been received, which will require additional message passing.

Message Tag • Used to differentiate between different types of messages being sent. • Message tag is carried within message. • If special type matching is not required, a wild card message tag is used, so that the recv() will match with any send().

Message Tag Example To send a message, x, with message tag 5 from a source process, 1, to a destination process, 2, and assign to y: Process 1 Process 2 x y Mo v ement of data send(&x,2, 5 ); recv(&y,1, 5 ); W aits f or a message from process 1 with a tag of 5

“Group” message passing routines Have routines that send message(s) to a group of processes or receive message(s) from a group of processes Higher efficiency than separate point-to-point routines although not absolutely necessary.

Broadcast Sending same message to all processes concerned with problem.Multicast - sending same message to defined group of processes. Process 0 Process 1 Process p - 1 data data data Action buf bcast(); bcast(); bcast(); Code SPMD (MPI) form Broadcast action does not occur until all the processes have executed their broadcast routine, and the broadcast operation will have the effect of synchronizing the processes.

Scatter Sending each element of an array in root process to a separate process. Contents of ith location of array sent to ith process. Can send more than one element. Process 0 Process 1 Process p 1 - data data data Action buf scatter(); scatter(); scatter(); Code SPMD (MPI) form

Gather Having one process collect individual values from set of processes. Process 0 Process 1 Process p 1 - data data data Action buf gather(); gather(); gather(); Code SPMD (MPI) form

Reduce Gather operation combined with specified arithmetic/logical operation to a single value. Example: Values could be gathered and then added together by root: Process 0 Process 1 Process p 1 - data data data Action buf + reduce(); reduce(); reduce(); Code SPMD (MPI) form

Software Tools for Clusters Late 1980’s Parallel Virtual Machine (PVM) - developed Became very popular. Mid 1990’s - Message-Passing Interface (MPI) - standard defined. Based upon Message Passing Parallel Programming model. Both provide a set of user-level libraries for message passing. Use with sequential programming languages (C, C++, ...).

PVM(Parallel Virtual Machine) Perhaps first widely adopted attempt at using a workstation cluster as a multicomputer platform, developed by Oak Ridge National Laboratories. Available at no charge. Programmer decomposes problem into separate programs (usually master and group of identical slave programs). Programs compiled to execute on specific types of computers. Set of computers used on a problem first must be defined prior to executing the programs (in a hostfile).

Message routing between computers done by PVM daemon processes installed by PVM on computers that form the virtual machine. W or kstation PVM Can ha v e more than one process daemon ru nning on each computer . Application prog r am (e x ecutab le) Messages sent through W or kstation netw or k W or kstation PVM daemon Application PVM prog r am daemon (e x ecutab le) Application prog r am (e x ecutab le) MPI implementation w e use is similar .

MPI(Message Passing Interface) • Message passing library standard developed by group of academics and industrial partners to foster more widespread use and portability. • Defines routines, not implementation. • Several free implementations exist.

MPIProcess Creation and Execution • Purposely not defined - Will depend upon implementation. • Only static process creation supported in MPI version 1. All processes must be defined prior to execution and started together. • Originally SPMD model of computation. • MPMD also possible with static creation

Using SPMD Computational Model main (int argc, char *argv[]) { MPI_Init(&argc, &argv); . . MPI_Comm_rank(MPI_COMM_WORLD, &myrank); /*find process rank */ if (myrank == 0) master(); else slave(); . . MPI_Finalize(); } where master() and slave() are to be executed by master process and slave process, respectively.

Communicators • Defines scope of a communication operation. • Processes have ranks associated with communicator. • Initially, all processes enrolled in a “universe” called MPI_COMM_WORLD, and each process is given a unique rank, a number from 0 to p - 1, with p processes. • Other communicators can be established for groups of processes. • Two types of communicator • Intracommunicators for communication within a defined groups • Intercommunicators for communication between defined groups

Reasoning for Communicators • Provides a solution to unsafe message passing, • Message tags alone are not sufficient. • Enables basic error checking of message passing code by allowing programmer to define communication domains. • Messages cannot be sent to destinations outside defined communication domain

Unsafe message passing - Example Process 0 Process 1 Destination send(…,1,…); send(…,1,…); lib() Source recv(…,0,…); lib() (a) Intended beha vior recv(…,0,…); Process 0 Process 1 send(…,1,…); send(…,1,…); lib() (b) P ossib le beha vior recv(…,0,…); lib() recv(…,0,…);

MPI Blocking Routines • Return when “locally complete” - when location used to hold message can be used again or altered without affecting message being sent. • Blocking send will send message and return - does not mean that message has been received, just that process free to move on without adversely affecting message.

Parameters of blocking send MPI_Send(buf, count, datatype, dest, tag, comm) Address of Datatype of Message tag send b uff er each item Number of items Rank of destination Comm unicator to send process

Parameters of blocking receive MPI_Recv(buf, count, datatype, src, tag, comm, status) Status Address of Datatype of Message tag after oper ation receiv e b uff er each item Maxim um n umber Rank of source Comm unicator of items to receiv e process

Example To send an integer x from process 0 to process 1, MPI_Comm_rank(MPI_COMM_WORLD,&myrank); /* find rank */ if (myrank == 0) { int x; MPI_Send(&x, 1, MPI_INT, 1, msgtag, MPI_COMM_WORLD); } else if (myrank == 1) { int x; MPI_Recv(&x, 1, MPI_INT, 0,msgtag,MPI_COMM_WORLD,status); }

MPI Nonblocking Routines • Nonblocking send - MPI_Isend() - will return “immediately” even before source location is safe to be altered. • Nonblocking receive - MPI_Irecv() - will return even if no message to accept.

Nonblocking Routine Formats MPI_Isend(buf,count,datatype,dest,tag,comm,request) MPI_Irecv(buf,count,datatype,source,tag,comm, request) Completion detected by MPI_Wait() and MPI_Test(). MPI_Wait() waits until operation completed and returns then. MPI_Test() returns with flag set indicating whether operation completed at that time. Need to know whether particular operation completed. Determined by accessing request parameter.

Example To send an integer x from process 0 to process 1 and allow process 0 to continue, MPI_Comm_rank(MPI_COMM_WORLD, &myrank); /* find rank */ if (myrank == 0) { int x; MPI_Isend(&x,1,MPI_INT, 1, msgtag, MPI_COMM_WORLD, req1); compute(); MPI_Wait(req1, status); } else if (myrank == 1) { int x; MPI_Recv(&x,1,MPI_INT,0,msgtag, MPI_COMM_WORLD, status); }

Send Communication Modes • Standard Mode Send - Not assumed that corresponding receive routine has started. Amount of buffering not defined by MPI. If buffering provided, send could complete before receive reached. • Buffered Mode - Send may start and return before a matching receive. Necessary to specify buffer space via routine MPI_Buffer_attach(). • Synchronous Mode - Send and receive can start before each other but can only complete together. • Ready Mode - Send can only start if matching receive already reached, otherwise error. Use with care.

Parameters of synchronous send(same as blocking send) MPI_Ssend(buf, count, datatype, dest, tag, comm) Address of Datatype of Message tag send b uff er each item Number of items Rank of destination Comm unicator to send process

Collective Communication Involves set of processes, defined by an intra-communicator. Message tags not present. Principal collective operations: • MPI_Bcast() - Broadcast from root to all other processes • MPI_Gather() - Gather values for group of processes • MPI_Scatter() - Scatters buffer in parts to group of processes • MPI_Alltoall() - Sends data from all processes to all processes • MPI_Reduce() - Combine values on all processes to single value • MPI_Reduce_scatter() - Combine values and scatter results • MPI_Scan() - Compute prefix reductions of data on processes • MPI_Barrier() - A means of synchronizing processes by stopping each one until they all have reached a specific “barrier” call.

Barrier: Block process until all processes have called it MPI_Barrier(comm) communicator

Broadcast message from root process to all processes in comm and itself. MPI_Bcast(*buf, count, datatype, root, comm) Parameters: *buf message buffer (loaded) count number of entries in buffer datatype data type of buffer root rank of root

Gather values for group of processes MPI_Gather(*sendbuf, sendcount, sendtype, *recvbuf, recvcount, recvtype, root, comm) Parameters: *sendbuf send buffer sendcount number of send buffer elements sendtype data type of send elements *recvbuf receive buffer (loaded) recvcount number of elements each receive recvtype data type of receive elements root rank of receiving process comm communicator

Example To gather items from group of processes into process 0, using dynamically allocated memory in root process: int data[10]; /*data to be gathered from processes*/ MPI_Comm_rank(MPI_COMM_WORLD, &myrank); /* find rank */ if (myrank == 0) { MPI_Comm_size(MPI_COMM_WORLD, &grp_size); /*find group size*/ buf = (int *)malloc(grp_size*10*sizeof (int)); /*allocate memory*/ } MPI_Gather(data,10,MPI_INT,buf,grp_size*10,MPI_INT,0,MPI_COMM_WORLD) ; MPI_Gather() gathers from all processes, including root.

Scatter a buffer from root in parts to group of processes MPI_Scatter(*sendbuf, sendcount, sendtype, *recvbuf, recvcount, recvtype, root, comm) Parameters: *sendbuf send buffer sendcount number of elements sent (each process) sendtype data type of elements *recvbuf receive buffer (loaded) recvcount number of recv buffer elements recvtype type of recv elements root root process rank comm communicator

Combine values on all processes to single value MPI_Reduce(*sendbuf,*recvbuf,count,datatype,op,root,comm) Parameters: *sendbuf send buffer address *recvbuf receive buffer address count number of send buffer elements datatype data type of send elements op reduce operation. Several operations, including MPI_MAX Maximum MPI_MIN Minimum MPI_SUM Sum MPI_PROD Product root root process rank for result comm communicator

Hello World #include <stdio.h> #include <mpi.h> int main(int argc, char ** argv) { int size,rank; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD,&rank); MPI_Comm_size(MPI_COMM_WORLD,&size); printf("Hello MPI! Process %d of %d\n", rank, size); MPI_Finalize(); }

Sample MPI program #include “mpi.h” #include <stdio.h> #include <math.h> #define MAXSIZE 1000 void main(int argc, char *argv) { int myid, numprocs; int data[MAXSIZE], i, x, low, high, myresult, result; char fn[255]; char *fp; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&numprocs); MPI_Comm_rank(MPI_COMM_WORLD,&myid); if (myid == 0) { /* Open input file and initialize data */ strcpy(fn,getenv(“HOME”)); strcat(fn,”/MPI/rand_data.txt”); if ((fp = fopen(fn,”r”)) == NULL) { printf(“Can’t open the input file: %s\n\n”, fn); exit(1); } for(i = 0; i < MAXSIZE; i++) fscanf(fp,”%d”, &data[i]); } MPI_Bcast(data, MAXSIZE, MPI_INT, 0, MPI_COMM_WORLD); /* broadcast data */ x = n/nproc; /* Add my portion Of data */ low = myid * x; high = low + x; for(i = low; i < high; i++) myresult += data[i]; printf(“I got %d from %d\n”, myresult, myid); /* Compute global sum */ MPI_Reduce(&myresult, &result, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD); if (myid == 0) printf(“The sum is %d.\n”, result); MPI_Finalize(); }

MPI Groups, Communicators • Collective communication needed to be performed by subsets of the processes in the computation • MPI provides routines for • defining new process groups from subsets of existing process groups like MPI_COMM_WORLD • creating a new communicator for a new process group • performing collective communication within that process group

Groups and communicators

Message-Passing Computing