680 likes | 691 Views
Parallel Computing 4 Message Passing Interface I Ond řej Jakl Institute of Geonics, Academy of Sci. of the CR. Outline of the lecture. PVM – forerunner of MPI Development of the MPI standard MPI minimum Point-to-point communication Send modes Preventing deadlock Collective communication
E N D
Parallel Computing 4Message Passing Interface IOndřej JaklInstitute of Geonics, Academy of Sci. of the CR
Outline of the lecture • PVM – forerunner of MPI • Development of the MPI standard • MPI minimum • Point-to-point communication • Send modes • Preventing deadlock • Collective communication • Tips for users
Message passing systems – revision • The message passing model: general, widely acceptable • MPS = computer realizations of the message passing model • MPS act in analogy with a post service • “Shared nothing“system
Historical remarks • Message passing until early 1990's: experimental, proprietary, incompatible • Linda, Express, NX, etc. • Lack of portability, need of a standard
Parallel Virtual Machine • Software package that emulates a general-purpose parallel computing framework on interconnected computers of varied (heterogeneous) architecture • full implementation of the message passing model • since 1989 at Oak Ridge National Laboratory (Vaidy Sunderam, Al Geist) • http://www.epm.ornl.gov/pvm/pvm_home.html • freely available (including source code), compiled on many platforms • typically on networks of workstations (Unix, Windows) • last significant changes in 1999 (version 3.4) • Great impact on the parallel scene: made parallel processing available for everyone • educational tool to teach par. programming (“Pascal of message passing”) • powerful tool to solve important practical problems • “de facto” (not “de jure”) standard for many years • advantageous in certain contexts even today • heterogeneous environments, fault tolerant features, etc. • In general overshaded by MPI
The MPI standard • Strong need for the creation of a message passing library standard • Chronological milestones: • 4/1992: a standardization working group established • 11/1992: MPI Forum; eventually comprised of about 60 individuals from 40 organizations including parallel computer vendors, software writers, academia and application scientists • http://www.mpi-forum.org • 11/1993: draft Message Passing Interface (MPI) standard presented • 4/1994: MPI-1 released • 6/1995: MPI version 1.1 released • 7/1996: MPI-2 draft made available; includes MPI version 1.2 • 4/1997: MPI-2 released • adds parallel I/O, dynamic process management, remote memory operations • mostly a superset of MPI-1
Latest development • EuroPVM/MPI'06 in Bonn: restarting regular work of the MPI Forum • 8/2008 MPI-2.1,9/2009MPI-2.2: modest changes to the standard • clarify and correct errors • MPI-3.0 expected by the end of 2010 • remote memory access • fault tolerance • nonblocking collectives • Both free (e.g. MPICH, Open MPI) and commercial implementations (many vendors) available [further slide]
Goals of the MPI project • To provide source-code portability • To allow efficient implementation • To offer other features, including • support for heterogeneous parallel architectures • semantics independent of the programming language • language bindings for C/C++, Fortran • propinquity to existing tools
Process 1 Process 0 MPI_Send(...,0,...) Message MPI_Recv ...,1,...) MPI version of the HELLO program #include "mpi.h“ main (int argc,char *argv[]) { char msg[20]; int myrank; MPI_Status status; MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD,&myrank); if (myrank == 0) {/* code for process zero */ strcpy(msg,"Hello, there"); MPI_Send(msg,strlen(msg),MPI_CHAR,1,99,MPI_COMM_WORLD); } else {/* code for process one */ MPI_Recv(message,20,MPI_CHAR,0,99,MPI_COMM_WORLD,&status); printf("received :%s:\n", message); } MPI_Finalize(); }
MPI(-1) minimum • MPI is quite a complex system: cca 130 functions specified • about 500 (!) in MPI-2 • But: Just 5 functions are essential for MPI programs • MPI_Init: Initializes MPI • MPI_Comm_rank: Returns the rank of the local task • MPI_Send: Basic (blocking) send operation • MPI_Recv: Performs a (blocking) receive operation • MPI_Finalize: Terminates the MPI execution environment • Sufficient to create simple codes
Specification of routines • The MPI standard uses a language independent description of routines • Names prefixed by MPI_ to avoid conflicts with program variables • MPI constants are all upper case • Arguments: • IN: the call uses, but does not update the argument • OUT: the call may update the argument • INOUT: the call both uses and updates the argument • ReturnMPI_SUCCESS on success • failure codes are implementation dependent • General MPI function call (in C): error = MPI_Xxxxx (parameter, ...); • mpi.h header file • Ex.: MPI_WAITALL (count, requests, statuses) IN count (integer)INOUT requests (array of handles)OUT statuses (array of statuses) In C:int MPI_Waitall (int count, MPI_Request *requests, MPI_Status *statuses);
Calling MPI-routines #include <mpi.h> ... int err; ... err = MPI_Init(&argc, &argv); if (err == MPI_SUCCESS) { ...routine ran correctly... } ...
More technical points • Handles • references to internal MPI (opaque) objects • in C, different handle types are defined (typedef in mpi.h ) • returned by MPI calls, used as arguments in MPI calls • MPI_COMM_WORLD in C, an object of type MPI_Comm (a communicator) • Error handling • MPI provides the user with reliable message transmission • MPI implementor must insulate the user from a possibly unreliable underlying communication system • no mechanisms for handling communication or processor failures • implementations may go beyond the standard • error detected during MPI execution by default aborts the application • can be changed to handle recoverable errors using returned error codes
MPI startup and cleanup int MPI_Init (int *argc, char ***argv) Initializes MPI. All MPI programs must call this routine (only once) before any other MPI routine. This routine establishes the MPI environment. int MPI_Finalize (void) Terminates the MPI execution environment by cleaning up all MPI data structures, cancelling operations that never completed, and so on. This function should be the last MPI routine called in every MPI process – later no other MPI routines (including MPI_Init) may be called. ... int err; err = MPI_Init(&argc, &argv); ... err = MPI_Finalize(); ...
Communicator (1) • A handle that defines a group of processes that are permitted to communicate with one another • i.e. a communication context for a communication operation; the context provides a separate “communication universe” • messages are always received within the context they were sent • messages in different contexts do not interfere • specified via a name that is included as an parameter within the argument list of the MPI call • This group of processes is enumerated and ordered: • processes are identified by their unique ranks within this group • valid ranks: 0, 1, ..., (n - 1), where n is the group size • ranks are used to control program execution: if (rank == 0) /* master */ do_this(); else /* workers */ do_that();
Communicator (2) • MPI_COMM_WORLD: a basic predefined communicator comprising all processes • after MPI initialization, all MPI processes are identified by their rank in the group of MPI_COMM_WORLD • if a process becomes associated with other communicators, it will get a unique rank within each of these as well
Getting communicator info int MPI_Comm_rank (MPI_Comm comm, int *rank) Returns the rank of the local process in the group associated with a communicator int MPI_Comm_size (MPI_Comm comm, int *size) Determines the number of processes in the group associated with a communicator • for MPI_COMM_WORLD, this number is given by the number of processes launched at the start of the MPI application [next slide]
MPI application startup • The MPI standard provides no specification about how an MPI program is started or about its run-time environment • Various implementations may use different commands, environment variables, etc. • mpirun -np 4 parallel.exe (MPICH) • poe parallel.exe -procs 4 (IBM POE) • MP_PROCS=4 (IBM POE) • mpiexec -n 4parallel.exe (MPICH2) • Implementation may need some setup before MPI routines can be called • e.g. hostfile specification, starting daemons, etc. • .mpd.conf .. file, mpd .. deamon (MPICH2) • Similarly, some clean-up may be needed to finish an MPI program
Rank example After [CI-Tutor] // Rank – rank.c #include <stdio.h> #include <mpi.h> int main (int argc, char *argv[]) { int myrank, size; MPI_Init(&argc, &argv); // Initialize MPI MPI_Comm_rank(MPI_COMM_WORLD, &myrank); // Get my rank MPI_Comm_size(MPI_COMM_WORLD, &size); // Get the processor count printf("Processor %d of %d: Hello World!\n", myrank, size); MPI_Finalize(); // Terminate MPI } $ mpicc rank.c -o rank Compilation (MPICH) $ mpirun -np 4 rank Execution on 4 processors Processor 1 of 4: Hello World! Output Processor 2 of 4: Hello World! Processor 3 of 4: Hello World! Processor 0 of 4: Hello World!
MPI basic data types (1) • MPI ensures automatic translation between data representations in messages in a heterogeneous environment • rules for representation conversion (e.g. XDR) are not specified • MPI provides its own reference (basic) datatypes that correspond to the elementary datatypesof the host language MPI datatype C type MPI_CHAR signed char MPI_SHORT signed short int MPI_INT signed int MPI_LONG signed long int MPI_UNSIGNED_CHAR unsigned char MPI_UNSIGNED_SHORT unsigned short int MPI_UNSIGNED unsigned int MPI_UNSIGNED_LONG unsigned long int MPI_FLOAT float MPI_DOUBLE double MPI_LONG_DOUBLE long double
MPI basic data types (2) • Variables are normally declared as C/Fortran types;MPI type names are used as arguments in MPI routines when a type is needed • char msg[] = “a message”;MPI_Send(msg,strlen(msg),MPI_CHAR,1,99,MPI_COMM_WORLD); • Value types must all match in the send call and receive call • Special MPI datatypes: • MPI_BYTE: eight binary digits; typeless transfer, no conversions • MPI_PACKED: type indicator to support sending noncontiguous data by packing it into a contiguous buffer (MPI_Pack) • For more complex data structures, MPI allows for the definition of so called derived datatypes, that are built from the basic types [later] • Note MPI system typesMPI_COMM, MPI_STATUS,MPI_DATATYPE, etc.
P2P communication – revision • Exactly two processes are involved • One process (sender / source) sends a message and another process (receiver / destination) receives it • active participation of processes on both sides usually required • SEND, RECV operations • two-sided communication • In general, the source and destination processes operate asynchronously • the source may complete sending a message long before the destination gets around to receiving it • the destination may initiate receiving a message that has not yet been sent • The order of messages is guaranteed (they do not overtake)
Buffers • Send and receive buffers in the user’s address space are required to perform the message transfer: • the message data is pulled out the send buffer and a message is assembled • the message is transferred from sender to receiver • data is pulled from the incoming message and disassembled into the receive buffer • Buffer location, size and type specified in the send/receive calls • Auxiliary system buffers may be used to make the message transfer more efficient (implementation dependent) • opaque to the programmer and managed entirely by the MPI library • a finite resource that can be easy to exhaust • often mysterious and not well documented • can exist on the sending side, the receiving side, or both • PVM buffers all messages automatically
Components of an MPI message • Envelope • source – sending process • destination – receiving process • tag – integer 0 used to classify messages • communicator – specifies a group of processes to which both source and destination belong • Message body • buffer – message data [c.f. next slide] • count – number of items of type datatype in buffer • datatype – type of the message data • Components are specified as the arguments of the sent operation • except the first one, which is given implicitly by the identity of the senderMPI_Send(&Myint,1,MPI_INT,1,99,MPI_COMM_WORLD)
Sending messages int MPI_Send (void* buffer, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm) • Standard blocking send • buffer reference to the data to be sent • a sequence of count values of MPI type datatype • count number of data elements of a particular type to be sent • datatype MPI data type of the sent data • dest rank of the process where the message should be delivered • tag arbitrary non-negative integer assigned by the programmer to uniquely identify a message • the MPI standard guarantees that integers 0-32767 can be used as tags • comm communicator Ex.: double a[100]; ... MPI_Send(a,100,MPI_DOUBLE,1,999,MPI_COMM_WORLD);
Receiving messages int MPI_Recv (void* buffer, int count, MPI_Datatype datatype, int source, int tag,MPI_Comm comm, MPI_Status *status) • Blocking receive • buffer reference to the data (receive buffer) to be received • count number of data elements to be received • datatype MPI data type of the received data elements • source rank of the sending process • MPI_ANY_SOURCE to receive from any task • tag massage tag • MPI_ANY_TAG to receive regardless of tags • comm communicator • statusstructure containing the source and the tag of the message Ex.: double b[100]; ... MPI_Recv(b,100,MPI_DOUBLE,0,999,MPI_COMM_WORLD,&status);
Ping-pong example(1) // Ping-pong – pp.c #include <mpi.h> #include <stdio.h> int main(int argc,char *argv[]) { int numtasks,rank,dst,src,rc,count,tag=1; char inmsg,outmsg='x'; MPI_Status Stat; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&numtasks); MPI_Comm_rank(MPI_COMM_WORLD,&rank); if (rank == 0) { dst = 1;src = 1; rc = MPI_Send(&outmsg,1,MPI_CHAR,dst,tag,MPI_COMM_WORLD); rc = MPI_Recv(&inmsg,1,MPI_CHAR,src,tag,MPI_COMM_WORLD,&Stat); } timings with MPI_Wtime() here to get communication latency
Ping-pongexample(2) else if (rank == 1) { dst = 0;src = 0; rc = MPI_Recv(&inmsg,1,MPI_CHAR,src,tag,MPI_COMM_WORLD,&Stat); rc = MPI_Send(&outmsg,1,MPI_CHAR,dst,tag,MPI_COMM_WORLD); } rc = MPI_Get_count(&Stat,MPI_CHAR,&count); printf("Task %d: Received %d char(s) from task %d with tag %d\n“, rank,count,Stat.MPI_SOURCE,Stat.MPI_TAG); MPI_Finalize(); } $ mpicc pp.c -o pp Compilation $ mpirun -np 2 pp Execution Task 1: Received 1 char(s) from task 0 with tag 1 Output Task 0: Received 1 char(s) from task 1 with tag 1
buffered synchronous ready standard blocking nonblocking SEND buffered synchronous ready standard Send – global overview Things are more complicated : modes The modes differ in: • dependence of completion of the send operation on the receipt of the message • usage of buffers
Blocking send • Does not return until the message has been safely stored away so that the sender is free to overwrite (reuse) the send buffer • message might be copied directly to the matching receive buffer, or into a temporary system buffer • Blocking SEND buffered (MPI_Bsend) synchronous (MPI_Ssend) ready (MPI_Rsend) standard (MPI_Send)
Buffered send – blocking (1) [KTH98]
Buffered send – blocking (2) int MPI_Bsend (void* buffer, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm) • Guaranties buffering, if no matching receive is posted • Allows immediate completion of the send call • Buffer management may be expensive (memory allocation, copying of data) • amount of the buffer space is controlled by the user:MPI_BUFFER_ATTACH, MPI_BUFFER_DETACH • Local operation: its completion does not depend on the occurrence of a matching receive
Synchronous send – blocking (1) [KTH98]
Synchronous send – blocking (2) int MPI_Ssend (void* buffer, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm) • Completes successfully only if a matching receive is posted and the receive operation is started • The completion indicates (in addition to the reusability of the send buffer) that the receiver has reached a certain point in its execution (start of a receive) – synchronizing effect (both processes randezvous at the communication)
Ready send – blocking (1) [KTH98]
Ready send – blocking (2) int MPI_Rsend (void* buffer, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm) • May be started only if a matching receive is posted • otherwise the operation is erroneous and the result undefined • Allows to remove the handshake operation and may result in better performance • Difficult to debug • Non-local operation: Completion depends on the behaviour of an other process
Standard send – blocking (1) [KTH98]
Standard send – blocking (2) int MPI_Send (void* buffer, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm) • MPI implementation decides whether outgoing messages will be buffered (using system buffers) • In practice usually depends on the message size • buffering messages up to some threshold • When buffered, the send call can complete before a matching receive is invoked as soon as the message is copied to the buffer • cf. buffered send • Without buffering, the send call will not complete until a matching receive is posted and the data has been moved to the receiver • cf. synchronous send • Portable programs should not be dependent on this buffering • Non-local operation
Nonblocking send (1) • Nonblocking send calls initiate the send operation, but do not wait for completion • thesend start call will return before the message was copied out of the send buffer • May improve performance by overlapping communication and computation • with suitable hardware, e.g. an intelligent communication controller • A separate send complete call is needed to verify the completion of the communication • MPI_TEST, MPI_WAIT • verify if the send buffer can be reused • Requestobjects, accessed via a handle, are used to identify the posted nonblocking operation • an extra argument in the nonblocking routines
Nonblocking send (2) buffered (MPI_Ibsend) synchronous (MPI_Issend) ready (MPI_Irsend) standard (MPI_Isend) • The same four communication modes as with blocking send, with the same semantics: Nonblocking SEND • Always a local operation: an immediate return, independent of the state of other processes • Buffered, ready sends: limited practical difference between blocking and nonblocking form • Standard, synchronous sends: blocking is suppressed
Standard send – nonblocking int MPI_Isend (void* buffer, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_request *request) [KTH98]
Completion: waiting and testing int MPI_Wait (MPI_Request *request, MPI_Status *status) • Blocks until a specified nonblocking (send or receive) operation has completed int MPI_Test (MPI_Request *request, int *flag, MPI_Status *status) • Checks the status of a specified nonblocking operation • The flag parameter returns logical true if the operation has completed, and logical false if not MPI_Waitany, MPI_Waitall, MPI_Waitsome MPI_Testany, MPI_Testall, MPI_Testsome • For multiple nonblocking operations, one can specify any (one), all or some (one or more) completions at once int MPI_Cancel (MPI_Request *request)
Receive – blocking & nonblocking (1) • The selection of a message by a RECV operation is governed by the envelope • source, tag and communicator values must match • The received message is stored in the receive buffer specified in the receive call • length of the buffer must be greater or equal than the length of the message • MPI_ANY_SOURCE and MPI_ANY_TAG, respectively, are wildcards for source and tag values • Data matching rules apply • Completion: a message has arrived and is stored in the receive buffer • Blocking vs. nonblockingreceive: • blocking receive: returns after the receive completion • nonblocking receive: a receive start call initiates the operation; a receive complete call is needed to complete it • Nonblocking sends (of any mode) can be matched with blocking receives, and vice versa
Receive – blocking & nonblocking (2) int MPI_Recv (void* buffer, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status) • Blocking receive: Receives a message and blocks until the requested data is available in the receive buffer in the receiving task int MPI_Irecv (void* buffer, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status, MPI_request *request) • Nonblocking receive: Processing continues immediately without waiting for the message to be received. • A request handle is returned for handling the pending message status • program must use calls to MPI_WAIT or MPI_TEST to determine when the nonblocking receive operation completes
Probing a message int MPI_Probe(int source,int tag,MPI_Comm comm,MPI_Status *status) • Blocking test for a message • Updates status without transferring data • MPI_ANY_SOURCE and MPI_ANY_TAG may be used to test for a message from any source or with any tag • actual source and tag will be returned in the status structure • size of the message can be discovered through MPI_Get_count int MPI_Iprobe(int source,int tag,MPI_Comm comm,int *flag, MPI_Status *status) • Nonblocking test for a message • The flag parameter returns logical true if a message has arrived, and logical false if not
Nonblocking example: data exchange ... double a[100],b[100]; MPI_Init(&argc, &argv); // Initialize MPI MPI_Comm_rank(MPI_COMM_WORLD, &myrank); // Get rank if (myrank == 0) { // Post a receive, send a message, then wait MPI_Irecv(b,100,MPI_DOUBLE,1,19,MPI_COMM_WORLD,&request); MPI_Send(a,100,MPI_DOUBLE,1,17,MPI_COMM_WORLD); MPI_Wait(&request,&status); } else if (myrank == 1) { // Post a receive, send a message, then wait MPI_Irecv(b,100,MPI_DOUBLE,0,17,MPI_COMM_WORLD,&request); MPI_Send(a,100,MPI_DOUBLE,0,19,MPI_COMM_WORLD); MPI_Wait(&request,&status); }
Deadlock (1) • Bidirectional communication: 2 processes exchange data with each other • Danger of deadlocks • two (or more) processes are blocked and each is waiting for the other to make progress • Deadlocks can take place either due to the incorrect order of (blocking) send and receive (A), or due to the limited size of the system buffer (B) if (myrank == 0) { MPI_SEND(...,1,...); MPI_RECV(...,1,...); } else if (myrank == 1) { MPI_SEND(...,0,...); MPI_RECV(...,0,...); } if (myrank == 0) { MPI_RECV(...,1,...); MPI_SEND(...,1,...);} else if (myrank == 1) { MPI_RECV(...,0,...); MPI_SEND(...,0,...); } B A
Deadlock (2) • Deadlocks can be avoided using nonblocking communication (A) [c.f. nonblocking exaple] or reordering to match sends and receives(B) if (myrank == 0) { MPI_RECV(...,1,...); MPI_SEND(...,1,...); } else if (myrank == 1) { MPI_SEND(...,0,...); MPI_RECV(...,0,...); } if (myrank == 0) { MPI_ISEND(...,1,...,req); MPI_RECV(...,1,...); MPI_WAIT(req,...); } else if (myrank == 1) { MPI_ISEND(...,0,...,req); MPI_RECV(...,0,...); MPI_WAIT(req,...); } B A