470 likes | 475 Views
This article explores the basics of Message Passing Interface (MPI) in multi-processor systems, covering concepts such as point-to-point and collective communication operations, communicators, and the use of MPI functions. C code examples are provided.
E N D
Message Passing • Based on multi-processor • Set of independent processors • Connected via some communication net • All communication between processes is done via a message sent from one to the other
MPI • Message Passing Interface • Computation is made of: • One or more processes • Communicate by calling library routines • MIMD programming model • SPMD most common.
MPI • Processes use point-to-point communication operations • Collective communication operations are also available. • Communication can be modularized by the use of communicators. • MPI_COMM_WORLD is the base. • Used to identify subsets of processors
MPI • Complex, but most problems can be solved using the 6 basic functions. • MPI_Init • MPI_Finalize • MPI_Comm_size • MPI_Comm_rank • MPI_Send • MPI_Recv
MPI Basics • Most all calls require a communicator handle as an argument. • MPI_COMM_WORLD • MPI_Init and MPI_Finalize • don’t require a communicator handle • used to begin and end and MPI program • MUST be called to begin and end
MPI Basics • MPI_Comm_size • determines the number of processors in the communicator group • MPI_Comm_rank • determines the integer identifier assigned to the current process • zero based
MPI Basics #include <stdio.h> #include <mpi.h> main(int argc, char *argv[]) { int iproc, nproc; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &nproc); MPI_Comm_rank(MPI_COMM_WORLD, &iproc); printf("I am processor %d of %d\n", iproc, nproc); MPI_Finalize(); }
MPI Communication • MPI_Send • Sends an array of a given type • Requires a destination node, size, and type • MPI_Recv • Receives an array of a given type • Same requirements as MPI_Send • Extra parameter • MPI_Status variable.
MPI Basics • Made for both FORTRAN and C • Standards for C • MPI_ prefix to all calls • First letter of function name is capitalized • Returns MPI_SUCCESS or error code • MPI_Status structure • MPI data types for each C type • OUT parameters passed using & operator
Using MPI • Based on rsh or ssh • requires a .rhosts file or ssh key setup • hostname login • Path to compiler (CS open labs) • MPI_HOME /users/faculty/snell/mpich • MPI_CC MPI_HOME/bin/mpicc • Marylou5 • Use mpicc • mpicc hello.c –o hello
Using MPI • Write program • Compile using mpicc • Write process file (linux cluster) • host nprocs full_path_to_prog • 0 for nprocs on first line, 1 for all others • Run program (linux cluster) • prog -p4pg process_file args • mpirun –np #procs –machinefile machines prog • Run program (scheduled on marylou5 using pbs) • mpirun -np #procs -machinefile $PBS_NODEFILE prog • mpiexec prog
#include “mpi.h” #include <stdio.h> #include <math.h> #define MAXSIZE 1000 void main(int argc, char *argv) { int myid, numprocs; int data[MAXSIZE], i, x, low, high, myresult, result; char fn[255]; char *fp; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&numprocs); MPI_Comm_rank(MPI_COMM_WORLD,&myid); if (myid == 0) { /* Open input file and initialize data */ strcpy(fn,getenv(“HOME”)); strcat(fn,”/MPI/rand_data.txt”); if ((fp = fopen(fn,”r”)) == NULL) { printf(“Can’t open the input file: %s\n\n”, fn); exit(1); } for(i = 0; i < MAXSIZE; i++) fscanf(fp,”%d”, &data[i]); } /* broadcast data */ MPI_Bcast(data, MAXSIZE, MPI_INT, 0, MPI_COMM_WORLD); /* Add my portion Of data */ x = n/nproc; low = myid * x; high = low + x; for(i = low; i < high; i++) myresult += data[i]; printf(“I got %d from %d\n”, myresult, myid); /* Compute global sum */ MPI_Reduce(&myresult, &result, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD); if (myid == 0) printf(“The sum is %d.\n”, result); MPI_Finalize(); }
MPI • Message Passing programs are non-deterministic because of concurrency • Consider 2 processes sending messages to third • MPI only guarantees that 2 messages sent from a single process to another will arrive in order. • It is the programmer's responsibility to ensure computation determinism
MPI & Determinism • MPI • A Process may specify the source of the message • A Process may specify the type of message • Non-Determinism • MPI_ANY_SOURCE or MPI_ANY_TAG
Example for (n = 0; n < nproc/2; n++) { MPI_Send(buff, BSIZE, MPI_FLOAT, rnbor, 1, MPI_COMM_WORLD); MPI_Recv(buff, BSIZE, MPI_FLOAT, MPI_ANY_SOURCE, 1, MPI_COMM_WORLD, &status); /* Process the data */ }
Global Operations • Coordinated communication involving multiple processes. • Can be implemented by the programmer using sends and receives • For convenience, MPI provides a suite of collective communication functions. • All participating processes must call the same function.
Collective Communication • Barrier • Synchronize all processes • Broadcast • Gather • Gather data from all processes to one process • Scatter • Reduction • Global sums, products, etc.
Distribute Problem Size Distribute Input data Exchange Boundary values Find Max Error Collect Results
MPI_Reduce MPI_Reduce(inbuf, outbuf, count, type, op, root, comm)
MPI_Allreduce MPI_Allreduce(inbuf, outbuf, count, type, op, comm)
MPI Collective Routines • Several routines: MPI_ALLGATHER MPI_ALLGATHERV MPI_BCAST MPI_ALLTOALL MPI_ALLTOALLV MPI_REDUCE MPI_GATHER MPI_GATHERV MPI_SCATTER MPI_REDUCE_SCATTER MPI_SCAN MPI_SCATTERV MPI_ALLREDUCE • Allversions deliver results to all participating processes • “V” versions allow the chunks to have different sizes • MPI_ALLREDUCE, MPI_REDUCE, MPI_REDUCE_SCATTER, and MPI_SCAN take both built-in and user-defined combination functions
Example: PI in C -1 #include "mpi.h" #include <math.h> int main(int argc, char *argv[]) {int done = 0, n, myid, numprocs, i, rc;double PI25DT = 3.141592653589793238462643;double mypi, pi, h, sum, x, a;MPI_Init(&argc,&argv);MPI_Comm_size(MPI_COMM_WORLD,&numprocs);MPI_Comm_rank(MPI_COMM_WORLD,&myid);while (!done) { if (myid == 0) { printf("Enter the number of intervals: (0 quits) "); scanf("%d",&n); } MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD); if (n == 0) break;
Example: PI in C - 2 h = 1.0 / (double) n; sum = 0.0; for (i = myid + 1; i <= n; i += numprocs) { x = h * ((double)i - 0.5); sum += 4.0 / (1.0 + x*x); } mypi = h * sum; MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD); if (myid == 0) printf("pi is approximately %.16f, Error is %.16f\n", pi, fabs(pi - PI25DT));}MPI_Finalize(); return 0; }
MPI Datatypes • Data in messages are described by: • Address, Count, Datatype • MPI predefines many datatypes • MPI_INT, MPI_FLOAT, MPI_DOUBLE, etc. • There is an analog for each primitive type • Can also construct custom data types for structured data
MPI_Recv • Blocks until message is received • Message is matched based on source & tag • The MPI_Status argument gets filled with information about the message • Source & Tag • Receiving fewer elements than specified is OK • Receiving more elements is an error • Use MPI_Get_count to get number of elements received
MPI_Recv int recvd_tag, recvd_from, recvd_count; MPI_Status status; MPI_Recv(..., MPI_ANY_SOURCE, MPI_ANY_TAG, ..., &status ) recvd_tag = status.MPI_TAG; recvd_from = status.MPI_SOURCE; MPI_Get_count( &status, datatype, &recvd_count );
Non-blocking communication • MPI_Send and MPI_Recv are blocking • MPI_Send does not complete until the buffer is available to be modified • MPI_Recv does not complete until the buffer is filled • Blocking communication can lead to deadlocks for(int p = 0; p < nproc; p++) { MPI_Send(… p ….) MPI_Recv(… p ….) }
Non-blocking communiction • MPI_Isend & MPI_Irecv return immediately (non-blocking) MPI_Request request; MPI_Status status; MPI_Isend( start, count, datatype, dest, tag, comm, &request ) MPI_Irecv( start, count, datatype, src, tag, comm, &request ) MPI_WAIT( &request, &status ) • Used to overlap communication with computation • Anywhere you use MPI_Send or MPI_Recv, you can use the pair of MPI_Isend/MPI_Wait or MPI_Irecv/MPI_Wait • Also can use MPI_Waitall, MPI_Waitany, MPI_Waitsome • Can also check to see if you have any messages without actually receiving them – MPI_Probe & MPI_Iprobe • MPI_Probe blocks until there is a message – MPI_Iprobe sets a flag
Communicators • All MPI communication is based on a communicator which contains a context and a group • Contexts define a safe communication space for message-passing • Contexts can be viewed as system-managed tags • Contexts allow different libraries to co-exist • The group is just a set of processes • Processes are always referred to by unique rank in group
Uses of MPI_COMM_WORLD • Contains all processes available at the time the program was started • Provides initial safe communication space • Simple programs communicate with MPI_COMM_WORLD • Even complex programs will use MPI_COMM_WORLD for most communications • Complex programs duplicate and subdivide copies of MPI_COMM_WORLD • Provides a global communicator for forming smaller groups or subsets of processors for specific tasks 4 0 1 2 3 5 6 7 MPI_COMM_WORLD
Subdividing a Communicator with MPI_COMM_SPLIT • MPI_COMM_SPLIT partitions the group associated with the given communicator into disjoint subgroups • Each subgroup contains all processes having the same value for the argument color • Within each subgroup, processes are ranked in the order defined by the value of the argument key, with ties broken according to their rank in old communicator • intMPI_Comm_split( MPI_Commcomm, int color, • int key, MPI_Comm *newcomm) • MPI_COMM_SPLIT( COMM, COLOR, KEY, NEWCOMM, IERR ) • INTEGER COMM, COLOR, KEY, NEWCOMM, IERR
Subdividing a Communicator • To divide a communicator into two non-overlapping groups • color = (rank < size/2) ? 0 : 1 ; • MPI_Comm_split(comm, color, 0, &newcomm) ; comm 4 0 1 2 3 5 6 7 0 1 2 3 0 1 2 3 newcomm newcomm
Subdividing a Communicator • To divide a communicator such that • all processes with even ranks are in one group • all processes with odd ranks are in the other group • maintain the reverse order by rank • color = (rank % 2 == 0) ? 0 : 1 ; • key = size - rank ; • MPI_Comm_split(comm, color, key, &newcomm) ; comm 4 0 1 2 3 5 6 7 5 4 3 2 6 1 7 0 0 1 2 3 0 1 2 3 newcomm newcomm
program main include 'mpif.h' integer ierr, row_comm, col_comm integer myrank, size, P, Q, myrow, mycol P = 4 Q = 3 call MPI_INIT(ierr) call MPI_COMM_RANK(MPI_COMM_WORLD, myrank, ierr) call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierr) C Determine row and column position myrow = myrank/Q mycol = mod(myrank,Q) C Split comm into row and column comms call MPI_Comm_split(MPI_COMM_WORLD, myrow, mycol, row_comm, ierr) call MPI_Comm_split(MPI_COMM_WORLD, mycol, myrow, col_comm, ierr) print*, "My coordinates are[",myrank,"] ",myrow. mycol call MPI_Finalize(ierr) stop end
0 8 7 6 5 4 9 3 2 11 1 10 (1,0) (2,1) (0,0) (0,1) (0,2) (1,1) (3,1) (2,0) (2,2) (3,0) (3,2) (1,2) MPI_COMM_WORLD row_comm col_comm
An ounce of prevention… • Defensive programming • Check function return codes • Verify send and receive sizes • Incremental programming • Modular programming • Test modules – keep test code in place • Identify all shared data and think carefully about how it is accessed • Correctness first – then speed
Debugging • Characterize the bug • Run code serially • Run in parallel on one core (2-4 processes) • Run in parallel (2-4 processes on 2-4 cores) • Play around with inputs and other data and data sizes • Find smallest data size that exposes the bug • Remove as much non-determinism as you can • Print statements – use stderr (non buffered) • Before and after communication or shared variable access • Print all information – source, sizes, data, tag, etc. • Identify process number – first thing in print (helps sorting) • Leave the prints in your code - #ifdef
Debugging • Learn about C constructs __FILE__, __LINE__, and __FUNCTION__ • Make one logical change at a time and then test • Learn how to attach debuggers • You will probably need some sort of stall code – ie. Wait for input on master then do a barrier – all others just do barrier
Common problems • Not all processes call collective call • Be very careful about putting collective calls inside conditionals • Be sure the communicator is correct • Deadlock (everybody on recv) • Use non-blocking calls • Use MPI_Sendrecv • Process waiting for data that is never sent • Use collective calls where you can • Use simple communication patterns
Best Advice • Program incrementally and modularly • Characterize the bug and leave yourself time to walk away from it and think about it • Never underestimate the value of a second set of eyes • Sometimes just explaining your code to someone else helps you help yourself