High Performance Computing Course Notes 2007-2008 Message Passing Programming I

High Performance ComputingCourse Notes 2007-2008Message Passing Programming I

Message Passing Programming • Message Passing is the most widely used parallel programming model • Message passing works by creating a number of tasks, uniquely named, that interact by sending and receiving messages to and from one another (hence the message passing) • Generally, processes communicate through sending the data from the address space of one process to that of another • Communication of processes (via files, pipe, socket) • Communication of threads within a process (via global data area) • Programs based on message passing can be based on standard sequential language programs (C/C++, Fortran), augmented with calls to library functions for sending and receiving messages

Message Passing Interface (MPI) • MPI is a specification, not a particular implementation • Does not specify process startup, error codes, amount of system buffer, etc • MPI is a library, not a language • The goals of MPI: functionality, portability and efficiency • Message passing model > MPI specification > MPI implementation

OpenMP vs MPI • In a nutshell MPI is used on distributed-memory systems OpenMP is used for code parallelisation on shared-memory systems • Both are explicit parallelism • High-level control (OpenMP), lower-level control (MPI)

A little history • Message-passing libraries developed for a number of early distributed memory computers • By 1993 there were loads of vendor specific implementations • By 1994 MPI-1 came into being • By 1996 MPI-2 was finalized

The MPI programming model • MPI standards - • MPI-1 (1.1, 1.2), MPI-2 (2.0) • Forwards compatibility preserved between versions • Standard bindings - for C, C++ and Fortran. Have seen MPI bindings for Python, Java etc (all non-standard) • We will stick to the C binding, for the lectures and coursework. More info on MPI www.mpi-forum.org • Implementations - For your laptop pick up MPICH (free portable implementation of MPI (http://www-unix.mcs.anl. gov/mpi/mpich/index.htm) • Coursework will use MPICH

MPI • MPI is a complex system comprising of 129 functions with numerous parameters and variants • Six of them are indispensable, but can write a large number of useful programs already • Other functions add flexibility (datatype), robustness (non-blocking send/receive), efficiency (ready-mode communication), modularity (communicators, groups) or convenience (collective operations, topology). • In the lectures, we are going to cover most commonly encountered functions

The MPI programming model • Computation comprises one or more processes that communicate via library routines and sending and receiving messages to other processes • (Generally) a fixed set of processes created at outset, one process per processor • Different from PVM

Intuitive Interfaces for sending and receiving messages • Send(data, destination), Receive(data, source) • minimal interface • Not enough in some situations, we also need • Message matching – add message_id at both send and receive interfaces • they become Send(data, destination, msg_id), receive(data, source, msg_id) • Message_id • Is expressed using an integer, termed as message tag • Allows the programmer to deal with the arrival of messages in an orderly fashion (queue and then deal with

How to express the data in the send/receive interfaces • Early stages: • (address, length) for the send interface • (address, max_length) for the receive interface • They are not always good • The data to be sent may not be in the contiguous memory locations • Storing format for data may not be the same or known in advance in heterogeneous platform • Enventually, a triple (address, count, datatype) is used to express the data to be sent and (address, max_count, datatype) for the data to be received • Reflecting the fact that a message contains much more structures than just a string of bits, For example, (vector_A, 300, MPI_REAL) • Programmers can construct their own datatype • Now, the interfaces become send(address, count, datatype, destination, msg_id) and receive(address, max_count, datatype, source, msg_id)

How to distinguish messages • Message tag is necessary, but not sufficient • So, communicator is introduced …

Communicators • Messages are put into contexts • Contexts are allocated at run time by the system in response to programmer requests • The system can guarantee that each generated context is unique • The processes belong to groups • The notions of context and group are combined in a single object, which is called a communicator • A communicator identifies a group of processes and a communication context • The MPI library defines a initial communicator, MPI_COMM_WORLD, which contains all the processes running in the system • The messages from different process groups can have the same tag • So the send interface becomes send(address, count, datatype, destination, tag, comm)

Status of the received messages • The structure of the message status is added to the receive interface • Status holds the information about source, tag and actual message size • In the C language, source can be retrieved by accessing status.MPI_SOURCE, • tag can be retrieved by status.MPI_TAG and • actual message size can be retrieved by calling the function MPI_Get_count(&status, datatype, &count) • The receive interface becomes receive(address, maxcount, datatype, source, tag, communicator, status)

How to express source and destination • The processes in a communicator (group) are identified by ranks • If a communicator contains n processes, process ranks are integers from 0 to n-1 • Source and destination processes in the send/receive interface are the ranks

Some other issues • In the receive interface, tag can be a wildcard, which means any message will be received • In the receive interface, source can also be a wildcard, which match any source

MPI basics • First six functions (C bindings) • MPI_Send (buf, count, datatype, dest, tag, comm) • Send a message • buf address of send buffer • count no. of elements to send (>=0) • datatype of elements • dest process id of destination • tag message tag • comm communicator (handle)

MPI basics • First six functions (C bindings) • MPI_Send (buf, count, datatype, dest, tag, comm) • Calculating the size of the data to be send … • buf address of send buffer • count * sizeof (datatype) bytes of data

MPI basics • First six functions (C bindings) • MPI_Send (buf, count, datatype, dest, tag, comm) • Send a message • buf address of send buffer • count no. of elements to send (>=0) • datatype of elements • dest process id of destination • tag message tag • comm communicator (handle)

MPI basics • First six functions (C bindings) • MPI_Recv (buf, count, datatype, source, tag, comm, status) • Receive a message • buf address of receive buffer (var param) • count max no. of elements in receive buffer (>=0) • datatype of receive buffer elements • source process id of source process, or MPI_ANY_SOURCE • tag message tag, or MPI_ANY_TAG • comm communicator • status status object

MPI basics • First six functions (C bindings) • MPI_Init (int *argc, char ***argv) • Initiate a computation • argc (number of arguments) and argv (argument vector) are main program’s arguments • Must be called first, and once per process • MPI_Finalize ( ) • Shut down a computation • The last thing that happens

MPI basics • First six functions (C bindings) • MPI_Comm_size (MPI_Comm comm, int *size) • Determine number of processes in comm • comm is communicator handle, MPI_COMM_WORLD is the default (including all MPI processes) • size holds number of processes in group • MPI_Comm_rank (MPI_Comm comm, int *pid) • Determine id of current (or calling) process • pid holdsid of current process

MPI basics – a basic example • #include "mpi.h" #include <stdio.h> int main(int argc, char *argv[]) { int rank, nprocs;MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&nprocs); MPI_Comm_rank(MPI_COMM_WORLD,&rank); printf("Hello, world. I am %d of %d\n", rank, nprocs); MPI_Finalize(); } mpirun –np 4 myprog Hello, world. I am 1 of 4 Hello, world.I am 3 of 4 Hello, world. I am 0 of 4 Hello, world. I am 2 of 4

MPI basics – send and recv example (1) #include "mpi.h"#include <stdio.h>int main(int argc, char *argv[]){ int rank, size, i; int buffer[10]; MPI_Status status;MPI_Init(&argc, &argv);MPI_Comm_size(MPI_COMM_WORLD, &size);MPI_Comm_rank(MPI_COMM_WORLD, &rank); if (size < 2) { printf("Please run with two processes.\n"); MPI_Finalize(); return 0; } if (rank == 0) { for (i=0; i<10; i++) buffer[i] = i;MPI_Send(buffer, 10, MPI_INT, 1, 123, MPI_COMM_WORLD); }

MPI basics – send and recv example (2) if (rank == 1) { for (i=0; i<10; i++) buffer[i] = -1;MPI_Recv(buffer, 10, MPI_INT, 0, 123, MPI_COMM_WORLD, &status); for (i=0; i<10; i++) { if (buffer[i] != i) printf("Error: buffer[%d] = %d but is expected to be %d\n", i, buffer[i], i); } }MPI_Finalize();}

MPI language bindings • Standard (accepted) bindings for Fortran, C and C++ • Java bindings are work in progress • JavaMPI Java wrapper to native calls • mpiJava JNI wrappers • jmpi pure Java implementation of MPI library • MPIJ same idea • Java Grande Forum trying to sort it all out • We will use the C bindings

High Performance ComputingCourse Notes 2007-2008 • Message Passing Programming II

Modularity • MPI supports modular programming via communicators • Provides information hiding by encapsulating local communications and having local namespaces for processes • All MPI communication operations specify a communicator (process group that is engaged in the communication)

Forming new communicators – one approach • MPI_Comm world, workers; • MPI_Group world_group, worker_group; • int ranks[1]; • MPI_Init(&argc, &argv); • world=MPI_COMM_WORLD; • MPI_Comm_size(world, &numprocs); • MPI_Comm_rank(world, &myid); • server=numprocs-1; • MPI_Comm_group(world, &world_group); • ranks[0]=server; • MPI_Group_excl(world_group, 1, ranks, &worker_group); • MPI_Comm_create(world, worker_group, &workers); • MPI_Group_free(&world_group); • MPI_Comm_free(&workers);

Forming new communicators - functions • int MPI_Comm_group(MPI_Comm comm, MPI_Group *group) • int MPI_Group_excl(MPI_Group group, int n, int *ranks, MPI_Group *newgroup) • Int MPI_Group_incl(MPI_Group group, int n, int *ranks, MPI_Group *newgroup) • int MPI_Comm_create(MPI_Comm comm, MPI_Group group, MPI_Comm *newcomm) • int MPI_Group_free(MPI_Group *group) • int MPI_Comm_free(MPI_Comm *comm)

Forming new communicators – another approach (1) • MPI_Comm_split (comm, colour, key, newcomm) • Creates one or more new communicators from the original comm • comm communicator (handle) • colour control of subset assignment (processes with same colour are in same new communicator) • key control of rank assignment • newcomm new communicator • Is a collective communication operation (must be executed by all processes in the process group comm) • Is used to (re-) allocate processes to communicator (groups)

Forming new communicators – another approach (2) • MPI_Comm_split (comm, colour, key, newcomm) • MPI_Comm comm, newcomm; int myid, color; • MPI_Comm_rank(comm, &myid); // id of current process • color = myid%3; • MPI_Comm_split(comm, colour, myid, *newcomm); 0 4 5 6 7 1 2 3 0 0 1 0: 0 1 2 1: 1 2 2:

Forming new communicators – another approach (3) • MPI_Comm_split (comm, colour, key, newcomm) • New communicator created for each new value of colour • Each new communicator (sub-group) comprises those processes that specify its value in colour • These processes are assigned new identifiers (ranks, starting at zero) with the order determined by the value of key (or by their ranks in the old communicator in event of ties)

Communications • Point-to-point communications: involving exact two processes, one sender and one receiver • For example, MPI_Send() and MPI_Recv() • Collective communications: involving a group of processes

Collective operations • i.e. coordinated communication operations involving multiple processes • Programmer could do this by hand (tedious), MPI provides a specialized collective communications • barrier – synchronize all processes • broadcast – sends data from one to all processes • gather – gathers data from all processes to one process • scatter – scatters data from one process to all processes • reduction operations – sums, multiplies etc. distributed data • all executed collectively (on all processes in the group, at the same time, with the same parameters)

Collective operations • MPI_Barrier (comm) • Global synchronization • comm is the communicator handle • No processes return from function until all processes have called it • Good way of separating one phase from another

Barrier synchronizations • You are only as quick as your slowest process Barrier sync. Barrier sync.

Collective operations • MPI_Bcast (buf, count, type, root, comm) • Broadcast data from root to all processes • buf address of input buffer or output buffer (root) • count no. of entries in buffer (>=0) • type datatype of buffer elements • root process id of root process • comm communicator data One to all broadcast proc. A0 A0 A0 A0 MPI_BCAST A0

Example of MPI_Bcast • Broadcast 100 ints from process 0 to every process in the group • MPI_Comm comm; • int array[100]; • int root = 0; • … • MPI_Bcast (array, 100, MPI_INT, root, comm);

Collective operations • MPI_Gather (inbuf, incount, intype, outbuf, outcount, outtype, root, comm) • Collective data movement function • inbuf address of input buffer • incount no. of elements sent from each (>=0) • intype datatype of input buffer elements • outbuf address of output buffer (var param) • outcount no. of elements received from each • outtype datatype of output buffer elements • root process id of root process • comm communicator data All to one gather proc. A0 A0 A1 A2 A3 A1 A2 MPI_GATHER A3

Collective operations • MPI_Gather (inbuf, incount, intype, outbuf, outcount, outtype, root, comm) • Collective data movement function • inbuf address of input buffer • incount no. of elements sent from each (>=0) • intype datatype of input buffer elements • outbuf address of output buffer • outcount no. of elements received from each • outtype datatype of output buffer elements • root process id of root process • comm communicator Input to gather data All to one gather proc. A0 A0 A1 A2 A3 A1 A2 MPI_GATHER A3

Collective operations • MPI_Gather (inbuf, incount, intype, outbuf, outcount, outtype, root, comm) • Collective data movement function • inbuf address of input buffer • incount no. of elements sent from each (>=0) • intype datatype of input buffer elements • outbuf address of output buffer (var param) • outcount no. of elements received from each • outtype datatype of output buffer elements • root process id of root process • comm communicator Output gather data All to one gather proc. A0 A0 A1 A2 A3 A1 A2 MPI_GATHER A3

Collective operations • MPI_Gather (inbuf, incount, intype, outbuf, outcount, outtype, root, comm) • Collective data movement function • inbuf address of input buffer • incount no. of elements sent from each (>=0) • intype datatype of input buffer elements • outbuf address of output buffer (var param) • outcount no. of elements received from each • outtype datatype of output buffer elements • root process id of root process • comm communicator Receiving proc. data All to one gather proc. A0 A0 A1 A2 A3 A1 A2 MPI_GATHER A3

MPI_Gather example • Gather 100 ints from every process in group to root • MPI_Comm comm; • int gsize, sendarray[100]; • int root, myrank, *rbuf; • ... • MPI_Comm_rank( comm, myrank); // find proc. id • If (myrank == root) { • MPI_Comm_size( comm, &gsize); // find group size • rbuf = (int *) malloc(gsize*100*sizeof(int)); // calc. receive buffer • } • MPI_Gather( sendarray, 100, MPI_INT, rbuf, 100, MPI_INT, root, comm);

Collective operations • MPI_Scatter (inbuf, incount, intype, outbuf, outcount, outtype, root, comm) • Collective data movement function • inbuf address of input buffer • incount no. of elements sent to each (>=0) • intype datatype of input buffer elements • outbuf address of output buffer • outcount no. of elements received by each • outtype datatype of output buffer elements • root process id of root process • comm communicator data One to all scatter proc. A1 A0 A0 A2 A3 A1 A2 MPI_SCATTER A3

Example of MPI_Scatter • MPI_Scatter is reverse of MPI_Gather • It is as if the root sends using • MPI_Send(inbuf+i*incount * sizeof(intype), incount, intype, i, …) • MPI_Comm comm; • int gsize, *sendbuf; • int root, rbuff[100]; • … • MPI_Comm_size (comm, &gsize); • sendbuf = (int *) malloc (gsize*100*sizeof(int)); • … • MPI_Scatter (sendbuf, 100, MPI_INT, rbuf, 100, MPI_INT, root, comm);

Collective operations • MPI_Reduce (inbuf, outbuf, count, type, op, root, comm) • Collective reduction function • inbuf address of input buffer • outbuf address of output buffer • count no. of elements in input buffer (>=0) • type datatype of input buffer elements • op operation • root process id of root process • comm communicator data proc. 2 4 0 2 Using MPI_MIN Root = 0 5 7 0 3 MPI_REDUCE 6 2

Collective operations • MPI_Reduce (inbuf, outbuf, count, type, op, root, comm) • Collective reduction function • inbuf address of input buffer • outbuf address of output buffer • count no. of elements in input buffer (>=0) • type datatype of input buffer elements • op operation • root process id of root process • comm communicator data proc. 2 4 Using MPI_SUM Root = 1 5 7 13 16 0 3 MPI_REDUCE 6 2

High Performance Computing Course Notes 2007-2008 Message Passing Programming I