Message-Passing Computing

Message-Passing Computing Chapter 2

Software Tools for Clusters Late 1980’s Parallel Virtual Machine PVM (distributed operating system) developed - became very popular – used dynamic process creation from start – message passing between homogeneous or heterogeneous computers. 1994 - Message-Passing Interface (MPI) - standard defined after 2 years of discussions. Based upon Message Passing Parallel Programming model. Both provide a set of user-level libraries for message passing. Use with sequential programming languages (C, C++, Fortran, ...).

MPI (Message Passing Interface) • Message passing library standard developed by group of academics and industrial partners (MPI forum) to foster more widespread use and portability. • Defines routines, not implementation. • Several free open source implementations exist: MPICH and Open MPI

Message passing concept using library routines

Message-Passing Programming using User-level Message-Passing Libraries Two primary mechanisms needed: 1. A method for creating processes for execution on different computers 2. A method for sending and receiving messages

Creating processes on different computers

Multiple program, multiple data (MPMD) model • Different programs executed by each processor Source file Source file Compile to suit processor type Executable Processor 0 Processor p - 1

Single Program Multiple Data (SPMD) model • Same program executed by each processor • Control statements select different parts for each processor to execute Source file Basic MPI way Compile to suit processor type Executables Processor 0 Processor p - 1

Static process creation • All executables start together • Done when one starts running the compiled program • Normal MPI way • Possible to dynamically start processes from within an executing process (fork) in MPI-2 with MPI_Comm_Spawn(), which might find applicability if the number of processes needed is not initially known

MPI program structure #include <mpi.h> // or could be #include “mpi.h” int main(intargc, char * argv[]) { MPI_Init(&argc, &argv); // Code executed by all processes MPI_Finalize(); return EXIT_SUCCESS; } Takes command line arguments

How is the number of processes determined? • When you run your MPI program, you can specify the number of processes you want on the command line: mpiexec–n 8 program_name • –n option tells mpiexecto run your parallel program using the specified number of processes

In MPI, processes within a defined“communicating group” are given a number/ID called a rank starting from zero onwards. Program uses control constructs, typically IF statements, to direct processes to perform specific actions. Example if (rank == 0) ... // do this if (rank == 1) ... // do this . . .

Master-Slave approach Usually computation is constructed as a master-slave model: • One process (the master) performs one set of actions and • all the other processes (the slaves) perform identical actions although on different data, i.e. if (rank == 0) ... // master do this else ... // all slaves do this

Methods of sending and receiving messages

Basic “point-to-point”Send and Receive Routines Passing a message between processes using send() and recv() library calls: Process 1 Process 2 x y Movement of data send(&x, 2); recv(&y, 1); Generic syntax (actual formats later)

Incorrect Message Passing - Example Process 0 Process 1 Destination send(…,1,…); send(…,1,…); lib() Source recv(…,0,…); lib() (a) Intended behavior recv(…,0,…); Process 0 Process 1 send(…,1,…); send(…,1,…); lib() (b) Possible behavior recv(…,0,…); lib() recv(…,0,…);

Message Tag • Used to differentiate between different types of messages being sent. • Message tag is carried within message. • If special type matching is not required, a wild card message tag used. Then recv() will match with any send().

Message Tag Example To send a message, x, with message tag 5 from a source process, 1, to a destination process, 2, and assign to y: Process 1 Process 2 x y Movement of data send(&x,2, 5) recv(&y,1, 5) Waits for a message from process 1 with a tag of 5

MPI “Communicators” • Defines a communication domain - a set of processes that are allowed to communicate between themselves. • Communication domains of libraries can be separated from that of a user program. • Used in all point-to-point and collective MPI message-passing communications. Note: Intracommunicator– for communicating within a single group of processes. Intercommunicator - for communicating within two or more groups of processes

Default CommunicatorMPI_COMM_WORLD • Exists as first communicator for all processes initiated in the application. • A set of MPI routines exists for forming communicators. (advanced usage) • Processes have a rank (ID) in a communicator.

MPI Hello Word • Write a MPI “Hello World” program that prints the following sentence from each processor: Hello world from processor xyz, rank 1 out of 4processes • To compile mpicc –o hello mpi_hello_world.c • To run mpiexec –n 4 hello or mpirun –np 4 hello Note: mpiexecreplaces earlier mpirun command although mpirun still exists

In MPI, standard output is automatically redirected from remote computers to the user’ s console; so possible final output could be: Hello world from processor compute-0-0.local, rank 1 out of 4processes Hello world from processor compute-0-1.local, rank 0 out of 4 processes Hello world from processor compute-0-3.local, rank 3 out of 4 processes Hello world from processor compute-0-2.local, rank 2 out of 4 processes • The order of messages might be different but is unlikely to be in ascending order of process ID; it will depend upon how the processes are scheduled.

Using SPMD Computational Model #include <mpi.h> main (intargc, char * argv[]) { MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); if (myrank == 0) master(); else slave(); MPI_Finalize(); } wheremaster() andslave() are to be executed by master process and slave processes, respectively.

Parameters of blocking send Address of send buffer Datatypeof each item Message tag MPI_Send(buffer, count, datatype, dest, tag, comm) Number of items to send Rank of destination process Communicator

Parameters of blocking receive Status after operation Address of receive buffer Datatypeof each item Message tag MPI_Recv(buffer, count, datatype, src, tag, comm, &status) Communicator Maximum number of items to receive Rank of source process

MPI Datatypes(defined in mpi.h)

Any source or tag • In MPI_Recv, source can be MPI_ANY_SOURCE • Tag can be MPI_ANY_TAG • This causes the Recv() to take any message destined for current process regardless of source and/or tag. Example MPI_Recv(message, 256, MPI_CHAR, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &status); Maximum number of characters to receive

Program Examples To send an integer x from process 0 to process 1, MPI_Comm_rank(MPI_COMM_WORLD,&myrank); /* find rank */ intmsgtag = 10; if (myrank == 0) { intx; MPI_Send(&x, 1, MPI_INT, 1, msgtag, MPI_COMM_WORLD); } else if(myrank == 1) { int y; MPI_Recv(&y, 1, MPI_INT, 0, msgtag, MPI_COMM_WORLD, &status); } To process rank 1 Send 1 integer Notice that you can use a different variable name (i.e. not x) Maximum number of integers to receive From process rank 0

MPI Hello World: Send() - Recv() • Write a basic MPI program that sends the message“Hello World” from a master process (rank = 0) to each of the other processes (rank != 0). Then, all of the processes print a message as follows: Message “hello world” printed from process 1 Message “hello world” printed from process 2 Message “hello world” printed from process 4 Message “hello world” printed from process 0 Message “hello world” printed from process 3

Semantics of MPI_Send()and MPI_Recv() Called blocking, which means in MPI that routine waits until all its local actions have taken place before returning. After returning, any local variables used can be altered without affecting message transfer. MPI_Send()– Does not mean message has been received by destination, but process can continue in the knowledge that message safely on its way. MPI_Recv()– Returns when message received and data collected. Will cause process to stall until message received. Other versions of MPI_Send() and MPI_Recv() have different semantics.

Exercise: Ring Write a MPI program that forms a ring of processes where each process sends, in order, a token to the following process. The token is formed of an array of 2 equal integers: • Each process (excepts the master) receives the array (initialized by the master) from the process with rank 1 less than its own rank (rank – 1). • Then each process increments each element of the array and sends it to the following process (rank + 1). • The last process sends the token back to the master.

Process 1 received token {101, 101} from Process 0 Process 2 received token {102, 102} from Process 1 Process 3 received token {103, 103} from Process 2 Process 4 received token {104, 104} from Process 3 Process 5 received token {105, 105} from Process 4 Process 6 received token {106, 106} from Process 5 Process 7 received token {107, 107} from Process 6 Process 0 received token {108, 108} from Process 7 0 1 7 2 6 3 5 4

MPI_Status • MPI_Recv() takes the address of an MPI_Status structure as an argument • MPI_Status is populated with additional information about receive operation: • Rank of sender: stored in MPI_SOURCE • Tag of message: stored in MPI_TAG • Length of message: no predefined element in MPI_Status – can be retrieved using MPI_Get_count: MPI_Get_count(&status, datatype, &count); • MPI_Statusis helpful when used with MPI_ANY_SOURCE and/or MPI_ANY_TAG

MPI_Probe • Instead of simply providing a really large buffer to handle all possible sizes inMPI_Recv() • UseMPI_Probe() to query the message size before actually receiving it MPI_Probe(source, tag, MPI_COMM_WORLD, &status) • Similar toMPI_Recv(),MPI_Probe() blocks for a message with a matching tag and sender • Think ofMPI_Probe() asMPI_Recv() that does everything except actually receiving the message • Used in conjunction withMPI_Get_count()

Exercise: Send Random Numbers Write a MPI program that sends a random number of integers: • The sender should abort the program if more than 2 processes are used. • The sender should create a random number of integers and send it to the receiver. • The receiver should dynamically create a buffer of the exact size of the numbers sent , and populate it with the sent numbers. • Sample output should be as follows: Process 0 send 85 integers to Process 1 Process 1 dynamically received 85 integers from Process 0

Measuring Execution Time MPI provides the routineMPI_Wtime()for returning time (in seconds) of the processor. It is typically used to measure the execution time between 2 points (L1 and L2 in code), use something similar to: double start_time, end_time, exe_time; start_time= MPI_Wtime(); .. . end_time= MPI_Wtime(); exe_time = end_time - start_time;

Using C time routines To measure execution time between point L1 and point L2 in code, use something similar to: float t1, t2, elapsed_time; time(&t1); // start timer . . time(&t2); // stop timer . elapsed_time= difftime(t2, t1); printf(“Elapsed time=%5.2f secs”,elapsed_time); Usingtime() andgettimeofday() routines may be useful if you want to compare with a sequential C version of the program using the same libraries.

Barrier Block process until all processes have called it.Synchronous operation. MPI_Barrier(comm) Communicator

MPI_Barrier use with time stamps A common example of using a barrier is to synchronize the processors before taking a time stamp. MPI_Barrier(MPI_COMM_WORLD); start_time = MPI_Wtime(); … \\ Do work MPI_Barrier(MPI_COMM_WORLD); end_time = MPI_Wtime(); 2nd barrier not always needed if there is a gather (or a synchronous collective call) Once the master has the correct data, who cares what the other processes are doing. We have the answer.

MPI-Barrier Example double local_start, local_finish; double local_elapsed, elapsed; MPI_Barrier(MPI_COMM_WORLD); Local_start = MPI_Wtime(); ... ... ... Local_finish = MPI_Wtime(); Local_elapsed = local_finish – local_start; MPI_Reduce(&local_elapsed, &elapsed, 1, MPI_DOUBLE, MPI_MAX, 0, MPI_COMM_WORLD); If (my_rank == 0) printf(“elapsed time is %f seconds\n, elapsed);

Message-Passing Computing