MPI: Message Passing Interface

MPI: Message Passing Interface Prabhaker Mateti Wright State University

Overview • MPI Hello World! • Introduction to programming with MPI • MPI library calls Mateti, MPI

MPI Overview • Similar to PVM • Network of Heterogeneous Machines • Multiple implementations • Open source: • MPICH • LAM • Vendor specific Mateti, MPI

MPI Features • Rigorously specified standard • Portable source code • Enables third party libraries • Derived data types to minimize overhead • Process topologies for efficiency on MPP • Van fully overlap communication • Extensive group communication Mateti, MPI

MPI 2 • Dynamic Process Management • One-Sided Communication • Extended Collective Operations • External Interfaces • Parallel I/O • Language Bindings (C++ and Fortran-90) • http://www.mpi-forum.org/ Mateti, MPI

MPI Overview • 125+ functions • typical applications need only about 6 Mateti, MPI

#include <mpi.h> main(int argc, char *argv[]) { int myrank; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); if (myrank == 0) manager(); else worker(); MPI_Finalize(); } MPI_Initinitializes the MPI system MPI_Finalizecalled last by all processes MPI_Comm_rank identifies a process by its rank MPI_COMM_WORLD is the group that this process belongs to MPI: manager+workers Mateti, MPI

manager() { MPI_Status status; MPI_Comm_size( MPI_COMM_WORLD, &ntasks); for (i = 1;i < ntasks;++i){ work= nextWork(); MPI_Send(&work, 1, MPI_INT,i,WORKTAG, MPI_COMM_WORLD); } … MPI_Reduce(&sub, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD); } MPI_Comm_size MPI_Send MPI: manager() Mateti, MPI

worker() { MPI_Status status; for (;;) { MPI_Recv(&work, 1, MPI_INT, 0, MPI_ANY_TAG, MPI_COMM_WORLD, &status); result = doWork(); MPI_Send(&result, 1, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD); } } MPI_Recv MPI: worker() Mateti, MPI

#include "mpi.h" int main(int argc, char *argv[]) { MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&np); MPI_Comm_rank(MPI_COMM_WORLD,&myid); n = ...; /* intervals */ MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD); sub = series_sum(n, np); MPI_Reduce(&sub, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD); if (myid == 0) printf("pi is %.16f\n", pi); MPI_Finalize(); return 0; } MPI computes  Mateti, MPI

Process groups • Group membership is static. • There are no race conditions caused by processes independently entering and leaving a group. • New group formation is collective and group membership information is distributed, not centralized. Mateti, MPI

MPI_Send blocking send MPI_Send( &sendbuffer, /* message buffer */ n, /* n items of */ MPI_type, /* datatype in message */ destination, /* process rank */ WORKTAG, /* user chosen tag */ MPI_COMM /* group */ ); Mateti, MPI

MPI_Recv blocking receive MPI_Recv( &recvbuffer, /* message buffer */ n, /* n data items */ MPI_type, /* of type */ MPI_ANY_SOURCE, /* from any sender */ MPI_ANY_TAG, /* any type of message */ MPI_COMM, /* group */ &status ); Mateti, MPI

Send-receive succeeds … • Sender’s destination is a valid process rank • Receiver specified a valid source process • Communicator is the same for both • Tags match • Message data types match • Receiver’s buffer is large enough Mateti, MPI

Message Order • P sends messages m1 first then m2 to Q • Q will receive m1 before m2 • P sends m1 to Q, then m2 to R • In terms of a global wall clock, conclude nothing re R receiving m2 before/after Q receiving m1. Mateti, MPI

Blocking and Non-blocking • Send, receive can be blocking or not • A blocking send can be coupled with a non-blocking receive, and vice-versa • Non-blocking send can use • Standard mode MPI_Isend • Synchronous mode MPI_Issend • Buffered mode MPI_Ibsend • Ready mode MPI_Irsend Mateti, MPI

MPI_Isend non-blocking MPI_Isend( &buffer, /* message buffer */ n, /* n items of */ MPI_type, /* datatype in message */ destination, /* process rank */ WORKTAG, /* user chosen tag */ MPI_COMM, /* group */ &handle ); Mateti, MPI

MPI_Irecv MPI_Irecv( &result, /* message buffer */ n, /* n data items */ MPI_type, /* of type */ MPI_ANY_SOURCE, /* from any sender */ MPI_ANY_TAG, /* any type of message */ MPI_COMM_WORLD, /* group */ &handle ); Mateti, MPI

MPI_Wait MPI_Wait( handle, &status ); Mateti, MPI

MPI_Wait( handle, &status ); MPI_Test( handle, &status ); MPI_Wait, MPI_Test Mateti, MPI

Collective Communication Mateti, MPI

MPI_Bcast( buffer, count, MPI_Datatype, root, MPI_Comm ); All processes use the same count, data type, root, and communicator. Before the operation, the root’s buffer contains a message. After the operation, all buffers contain the message from the root MPI_Bcast Mateti, MPI

MPI_Scatter( sendbuffer, sendcount, MPI_Datatype, recvbuffer, recvcount, MPI_Datatype, root, MPI_Comm); All processes use the same send and receive counts, data types, root and communicator. Before the operation, the root’s send buffer contains a message of length sendcount * N', where N is the number of processes. After the operation, the message is divided equally and dispersed to all processes (including the root) following rank order. MPI_Scatter Mateti, MPI

MPI_Gather( sendbuffer, sendcount, MPI_Datatype, recvbuffer, recvcount, MPI_Datatype, root, MPI_Comm); This is the “reverse” of MPI_Scatter(). After the operation the root process has in its receive buffer the concatenation of the send buffers of all processes (including its own), with a total message length of recvcount * N, where N is the number of processes. The message is gathered following rank order. MPI_Gather Mateti, MPI

MPI_Reduce( sndbuf, rcvbuf, count, MPI_Datatype datatype, MPI_Op, root, MPI_Comm); After the operation, the root process has in its receive buffer the result of the pair-wise reduction of the send buffers of all processes, including its own. MPI_Reduce Mateti, MPI

MPI_MAX MPI_MIN MPI_SUM MPI_PROD MPI_LAND MPI_BAND MPI_LOR MPI_BOR MPI_LXOR MPI_BXOR MPI_MAXLOC MPI_MINLOC L logical B bit-wise Predefined Reduction Ops Mateti, MPI

User Defined Reduction Ops void myOperator ( void * invector, void * inoutvector, int * length, MPI_Datatype * datatype) { … } Mateti, MPI

Ten Reasons to Prefer MPI over PVM • MPI has more than one free, and quality implementations. • MPI can efficiently program MPP and clusters. • MPI is rigorously specified. • MPI efficiently manages message buffers. • MPI has full asynchronous communication. • MPI groups are solid, efficient, and deterministic. • MPI defines a 3rd party profiling mechanism. • MPI synchronization protects 3rd party software. • MPI is portable. • MPI is a standard. Mateti, MPI

Summary • Introduction to MPI • Reinforced Manager-Workers paradigm • Send, receive: blocked, non-blocked • Process groups Mateti, MPI

MPI resources • Open source implementations • MPICH • LAM • Books • Using MPIWilliam Gropp, Ewing Lusk, Anthony Skjellum • Using MPI-2William Gropp, Ewing Lusk, Rajeev Thakur • On-line tutorials • www.tc.cornell.edu/Edu/Tutor/MPI/ Mateti, MPI

MPI: Message Passing Interface