MPI Message Passing Interface

MPIMessage Passing Interface Mehmet Balman Cmpe 587 Dec, 2001

Parallel Computing • Separate workers or processes • Interact by exchanging information • Types of parallel computing: • SIMD (single instruction multiple data) • SPMD (single program multiple data) • MPMD (multiple program multiple data) • Hardware models: • Distributed memory (Paragon, IBM SPx, workstation network) • Shared memory (SGI Power Challenge, Cray T3D)

Communication with other processes • One sided • one worker performs transfer of data • Cooperative • all parties agree to transfer data

What is MPI? • A message-passing library specification • Multiple processors by message passing • Library of functions and macros that can be used in C FORTRAN and C programs • For parallel computers, clusters, and heterogeneous networks • Who designed MPI? • Vendors IBM, Intel, TMC, Meiko, Cray, • Convex, Ncube • Library writers PVM, p4, Zipcode, TCGMSG, • Chameleon, Express, Linda • Broad Participation

Development history: (1993-1994) • Began at Williamsburg Workshop in April, 1992 • Organized at Supercomputing '92 (November) • Met every six weeks for two days • Pre-final draft distributed at Supercomputing '93 • Final version of draft in May, 1994 • Public and vendor implementations available

Features of MPI • Point-to-point communication • blocking, nonblocking • synchronous, asynchronous • ready,buffered • Collective routines • built-in, user defined • Large # of data movement routines • Built-in support for grids and graphs • 125 functions (MPI is large) • 6 basic functions (MPI is small) • Communicators combine context and groups for message security

example #include "mpi.h" #include <stdio.h> int main( int argc, char **argv){ int rank, size; MPI_Init( &argc, &argv ); MPI_Comm_rank( MPI_COMM_WORLD, &rank ); MPI_Comm_size( MPI_COMM_WORLD, &size ); printf( "Hello world! I'm %d of %d\n", rank, size ); MPI_Finalize(); return 0; }

What happens when an MPI job is run • The user issues a directive to the operating system which has the effect of placing a copy of the executable program on each processor • Each processor begins execution of its copy of the executable • Different processes can execute different statements by branching within the program Typically the branching will be based on process ranks • Envelope of a message: (control block) • the rank of the receiver • the rank of the sender • a tag • a communicator

Two mechanisms for partitioning message space Tags(0-32767) Communicators(MPI_COMM_WORLD)

MPI_InitMPI_Finalize MPI_Comm_sizeMPI_Comm_rank MPI_SendMPI_Recv MPI_Send( start, count, datatype, dest, tag, comm ) MPI_Recv(start, count, datatype, source, tag, comm, status) MPI_Bcast(start, count, datatype, root, comm) MPI_Reduce(start, result, count, datatype, operation, root, comm)

Collective patterns

Collective Computation Operations Operation Name Meaning MPI MAX Maximum MPI MIN Minimum MPI SUM Sum MPI PROD Product MPI LAND Logical And MPI BAND Bitwise And MPI LOR Logical Or MPI BOR Bitwise Or MPI LXOR Logical Exclusive Or MPI BXOR Bitwise Exclusive Or MPI MAXLOC Maximum and Location of Maximum MPI MINLOC Minimum and Location of Minimum MPI_Op_create( user_function, commute—true if commutative, op) MPI_Op_free(op)

User defined communication groups Communicator: contains a context and a group. Group : just a set of processes. • MPI_Comm_create( oldcomm, group, &newcomm ) • MPI_Comm_group( oldcomm, &group ) • MPI_Group_free( &group ) • MPI_Group_incl MPI_Group_excl • MPI_Group_range_incl MPI_Group_range_excl • MPI_Group_union MPI_Group_intersection

Non-blocking operations Non-blocking operations return immediately. • MPI_Isend(start, count, datatype, dest, tag, comm, request) • MPI_Irecv(start, count, datatype, dest, tag, comm, request) • MPI_Wait(request, status) • MPI_Waitall • MPI_Waitany • MPI_Waitsome • MPI_Test( request, flag, status)

Communication Modes • Synchronous mode ( MPI_Ssend): the send does not complete until a matching receive has begun. • Buffered mode ( MPI_Bsend): the user supplies the buffer to system for its use. • Ready mode ( MPI_Rsend): user guarantees that matching receive has been posted. Non-blocking versions: MPI_Issend • MPI_Irsend • MPI_Ibsend int bufsize; char *buf = malloc(bufsize); MPI_Buffer_attach( buf, bufsize ); ... MPI_Bsend( ... same as MPI_Send ... ); ... MPI_Buffer_detach( &buf, &bufsize );

Datatypes • Two main purpose: • Heterogenity --- parallel programs between different processors • Noncontiguous data --- structures, vectors with non-unit stride MPI datatype C datatype MPI CHAR signed char MPI SHORT signed short int MPI INT signed int MPI LONG signed long int MPI UNSIGNED CHAR unsigned char MPI UNSIGNED SHORT unsigned short int MPI UNSIGNED unsigned int MPI UNSIGNED LONG unsigned long int MPI FLOAT foat MPI DOUBLE ouble MPI LONG DOUBLE long double MPI BYTE MPI PACKED

Build derived type void Build_derived_type(INDATA_TYPE* indata,MPI_Datatype* message_type_ptr){ int block_lengths[3]; MPI_Aint displacements[3]; MPI_Aint addresses[4]; MPI_Datatype typelist[3]; typelist[0] = MPI_FLOAT; typelist[1] = MPI_FLOAT; typelist[2] = MPI_INT; block_lengths[0] = block_lengths[1] = block_lengths[2] = 1; MPI_Address(indata, &addresses[0]); MPI_Address(&(indata>a), &addresses[1]); MPI_Address(&(indata>b), &addresses[2]); MPI_Address(&(indata>n), &addresses[3]); displacements[0] = addresses[1] addresses[0]; displacements[1] = addresses[2] addresses[0]; displacements[2] = addresses[3] addresses[0]; MPI_Type_struct(3, block_lengths, displacements, typelist, message_type_ptr); MPI_Type_commit(message_type_ptr); }

Other derived data types • int MPI_Type_contiguous(int count, MPI_Datatype oldtype,MPI_Datatype *newtype) • elements are contiguous entries in an array • int MPI_Type_vector(int count, int block_length,int stride,MPI_Datatype element_type,MPI_Datatype *new_type) • elements are equally spaced entries of an array • int MPI_Type_indexed(int count,int array_of_blocklengths, int array_of_displacements,MPI_Datatype element_type,MPI_Datatype *new_type) • elements are arbitrary entries of an array

Pack/unpack void Get_data4(int my_rank, float* a_ptr, float* b_ptr, int* n_ptr){ int root = 0; char buffer[10];int position; if (my_rank == 0){ printf(''Enter a, b, and n``n''); scanf(''%f %f %d'', a_ptr, b_ptr, n_ptr); position = 0; MPI_Pack(a_ptr, 1, MPI_FLOAT, buffer, 100, &position, MPI_COMM_WORLD); MPI_Pack(b_ptr, 1, MPI_FLOAT, buffer, 100, &position, MPI_COMM_WORLD); MPI_Pack(n_ptr, 1, MPI_INT, buffer, 100, &position, MPI_COMM_WORLD); MPI_Bcast(buffer, 100, MPI_PACKED, root, MPI_COMM_WORLD); }else{ MPI_Bcast(buffer, 100, MPI_PACKED, root, MPI_COMM_WORLD); position = 0; MPI_Unpack(buffer, 100, &position, a_ptr, 1, MPI_FLOAT, MPI_COMM_WORLD); MPI_Unpack(buffer, 100, &position, b_ptr, 1, MPI_FLOAT, MPI_COMM_WORLD); MPI_Unpack(buffer, 100, &position, n_ptr, 1, MPI_INT, MPI_COMM_WORLD); } }

Profiling static int nsend = 0; int MPI_Send( start, count, datatype, dest, tag, comm ) { nsend++; return PMPI_Send( start, count, datatype, dest, tag, comm ) ; }

Architecture of MPI • Complex communication operations can be expressed portably in terms of lower-level ones • All MPI functions are implemented in terms of the macros and functions that make up the ADI(Abstract Device Interface) • ADI • specifying a message to be sent or received • moving data between the API and the message-passing hardware • managing lists of pending messages (both sent and received), • providing basic information about the execution environment (e.g., how many tasks are there

Upper layers of MPICH

Channel Interface • Routines for send and receive envelope(control) information: • MPID_SendControl(MPID_SendControlBlock ) • MPID_RecvAnyControl • MPID_ControlMsgAvail • Send and receive data: • MPID_SendChannel • MPID_RecvFromChannel

Channel InterfaceThree different data exchange mechanisms • Eager (default) • Data is sent to the destination immediately. Buffered on receiver site. • Rendezvous (MPI_Bsend) • Data is sent to the destination only when requested. • Get (shared memory) • Data is read directly by the receiver.

Lower layers of MPICH

Summary • Point-to-point and collective operations • Blocking • NonBlocking • Asynchronous • Synchronous • Buffered • Ready • Abstraction for processes • Rank of the group • Virtual topologies • Data types • User specific • predefined • pack/unpack • Architecture of MPI • ADI(Abstract Device Interface) • Channel Interface

MPI Message Passing Interface