290 likes | 318 Views
Non-Blocking Communications. Example. #include <mpi.h> #include <stdio.h> int main(int argc, char **argv) { int my_rank, ncpus; int left_neighbor, right_neighbor; int data_received=-1; int tag = 101; MPI_Status stat Send, statRecv ; MPI_Request reqSend, reqRecv;
E N D
Example #include <mpi.h> #include <stdio.h> int main(int argc, char **argv) { int my_rank, ncpus; int left_neighbor, right_neighbor; int data_received=-1; int tag = 101; MPI_Status statSend, statRecv; MPI_Request reqSend, reqRecv; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); MPI_Comm_size(MPI_COMM_WORLD, &ncpus); left_neighbor = (my_rank-1 + ncpus)%ncpus; right_neighbor = (my_rank+1)%ncpus; MPI_Isend(&my_rank,1,MPI_INT,left_neighbor,tag,MPI_COMM_WORLD,&reqSend); // comm start MPI_Irecv(&data_received,1,MPI_INT,right_neighbor,tag,MPI_COMM_WORLD,&reqRecv); // maybe do something useful here MPI_Wait(&reqSend, &statSend); // complete comm MPI_Wait(&reqRecv, &statRecv); printf("Among %d processes, process %d received from right neighbor: %d\n", ncpus, my_rank, data_received); // clean up MPI_Finalize(); return 0; } mpirun –np 4 test_shift Among 4 processes, process 3 received from right neighbor: 0 Among 4 processes, process 2 received from right neighbor: 3 Among 4 processes, process 0 received from right neighbor: 1 Among 4 processes, process 1 received from right neighbor: 2
Semantics etc • Purpose: • Mechanism for overlapping communication and useful computations. Communication and computation may proceed concurrently. Latency hiding. • Deadlock avoidance • May avoid system buffering and memory-to-memory copying, and improve performance • Structure of non-blocking calls Post communication requests non-blocking call, MPI_Isend … … // do some useful work Complete communication call MPI_Wait, MPI_Test, …
Semantics etc • Non-blocking calls: MPI_Isend, MPI_Irecv etc • Will return immediately. Merely post a request to system to initiate communication. • However, communication is not completed yet. • Cannot tamper with the memory provided in these calls until the communication is completed by calling MPI_Wait or MPI_Test etc Non-blocking send Non-blocking receive
Non-blocking Send/Recv int MPI_Isend(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request) MPI_ISEND(BUF,COUNT,DATATYPE,DEST,TAG,COMM,REQUEST,IERROR) <type> BUF(*) INTEGER COUNT,DATATYPE,DEST,TAG,COMM,REQUEST, IERROR int MPI_Irecv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *request) MPI_IRECV(BUF,COUNT,DATATYPE,SOURCE,TAG,COMM,REQUEST,IERROR) <type> BUF(*) INTEGER COUNT,DATATYPE,SOURCE,TAG,COMM,REQUEST,IERROR Post send/recv requests to MPI system. Calls return immediately, but don’t access the memory pointed to by *buf MPI_Request request is a handle to an internal MPI object. Everything about that non-blocking communication is through that handle. MPI_REQUEST_NULL is a NULL request. MPI_Request req1, req2; double A[10], B[5]; … MPI_Isend(A, 10, MPI_DOUBLE, rank, tag, MPI_COMM_WORLD, &req1); MPI_Irecv(B, 5, MPI_DOUBLE, rank, tag, MPI_COMM_WORLD, &req2);
Other Non-blocking Sends • 4 communication modes, same semantics as blocking sends. • MPI_ISEND – standard mode • MPI_IBSEND – buffered mode • MPI_ISSEND – synchronous mode • MPI_IRSEND – ready mode Identical arguments as MPI_Isend int MPI_Ibsend(void *buf,int count,MPI_Datatype datatype,int dest, int tag, MPI_Comm comm, MPI_Request *request) int MPI_Issend(void *buf,int count,MPI_Datatype datatype,int dest, int tag, MPI_Comm comm, MPI_Request *request) int MPI_Irsend(void *buf,int count,MPI_Datatype datatype,int dest, int tag, MPI_Comm comm, MPI_Request *request)
Completion • Use MPI_Wait or MPI_Test to complete non-blocking communication • Semantics: after MPI_Wait returns • For standard send, message data has been safely stored away, safe to access buffer. • For receive, data is received.
MPI_Wait int MPI_Wait(MPI_Request *request, MPI_Status *status) MPI_WAIT(REQUEST,STATUS,IERROR) INTEGER REQUEST, STATUS(MPI_STATUS_SIZE), IERROR • Will block until the communication completes (or fails) • If request is from MPI_Isend, MPI_Irecv etc • Will deallocate request object, set request to MPI_REQUEST_NULL. • Will return in status the status information. • for MPI_Irecv, hold additional information. • For MPI_Isend, not much to be used *request is a handle returned from MPI_Isend, MPI_Irecv etc MPI_Request req; MPI_Status stat; … MPI_Irecv(…, &req); MPI_Wait(&req, &stat);
MPI_Test int MPI_Test(MPI_Request *request, int *flag, MPI_Status *status) MPI_TEST(REQUEST,FLAG,STATUS,IERROR) LOGICAL FLAG INTEGER REQUEST, STATUS, IERROR • request – MPI_Request object from MPI_Isend, etc • flag – true if communication complete; false if not yet • If true, request object will be de-allocated, and set to MPI_REQUEST_NULL • status – contain status information if complete • Does not block, return immediately. • Provide a mechanism for overlapping communication and computation • Do useful computation; periodically check communication status; if not complete, go back to computation.
Properties • Order: non-overtaking, order preserved • according to the execution order of non-blocking calls that initiate the communications • Progress: guarantees progress • Receive call completed by MPI_Wait will eventually return if there is a matching send. • Send call completed by MPI_Wait will eventually return if there is a matching receive. MPI_Comm_rank(comm,&rank); If(rank==0) { MPI_Isend(A,1,MPI_DOUBLE,1,99,comm,&req1); MPI_Isend(B,1,MPI_DOUBLE,1,99,comm,&req2); } Else if(rank==1) { MPI_Irecv(A,1,MPI_DOUBLE,0,MPI_ANY_TAG,comm,&req1); MPI_Irecv(B,1,MPI_DOUBLE,0,99,comm,&req2); } MPI_Wait(&req1,&stat1); MPI_Wait(&req2,&stat2);
MPI_Wait Variants • Deal with arrays of MPI_Requests: MPI_Request req[4]; • MPI_Waitall: • MPI_Waitall(int count, MPI_Request *request, MPI_Status *status) • Blocks until all active requests in array complete; return status of all communications • Deallocate request objects, set to MPI_REQUEST_NULL • MPI_Waitany: • MPI_Waitany(int count,MPI_Request *req, int *index, MPI_Status *stat) • Blocks until one of the active requests in array completes; return its index in array and the status of completing request; deallocate that request object. If none completes, return index=MPI_UNDEFINED. • MPI_Waitsome: • MPI_Waitsome(int incount, MPI_Request *req, int *outcount, int *array_indices, MPI_Status *array_status) • Blocks until at least one of the active communications completes; return associated indices and status of completed communications; deallocate objects. If none, outcount=MPI_UNDEFINED. MPI_Request req[2]; MPI_Status stat; Int index; MPI_Isend(…, &req[0]); MPI_Isend(…, &req[1]); MPI_Waitany(2, req, &index, &stat); … MPI_Request req[2]; MPI_Status stat[2]; … MPI_Isend(…, &req[0]); MPI_Isend(…, &req[1]); MPI_Waitall(2, req, stat);
MPI_Test Variants • MPI_Testall: • MPI_Testall(int count, MPI_Request *array_req, int *flag, MPI_Status *array_stat) • Return flag=true if all active requests complete; return flag=false otherwise. • If true, will de-allocate request objects, set to MPI_REQUEST_NULL. • MPI_Testany: • MPI_Testany(int count, MPI_Request *array_req, int *index, int *flag, MPI_Status *stat) • If one of active comm completes, return flag=true the index and status of completing comm; deallocate that object. • Return flag=false, index=MPI_UNDEFINED if none completes • Return flag=true, index=MPI_UNDEFINED if none active requests. • MPI_Testsome: • MPI_Testsome(int incount, MPI_Request *array_req, int *outcount, int *array_indices, MPI_Status *array_stat) • Return in outcount the number of completed active comm and associated indices and status of completing comm. • If none completes, return outcount=0 • if none active comm, outcount=MPI_UNDEFINED.
Persistent Communication • Structure for nonblocking calls: • MPI_Ixxxx allocates MPI_Request • MPI_Wait or MPI_Test completes and de-allocates request objects • Often a communication with same arguments is executed repeatedly • e.g. every time step or every iteration. • Can create a persistent request that will not be de-allocated by MPI_Wait. Reduce overhead Create persistent request MPI_Send_init, MPI_Recv_init Repeat: Start communication MPI_Start … Complete communication MPI_Wait, MPI_Test Free persistent request MPI_Request_free
Creation int MPI_Send_init(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *req) int MPI_Recv_init(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *req) • Creates a persistent request object for standard send mode. • Bind to the arguments: buf, count, datatype, dest, tag, comm. These arguments will not change in following communications • On creation, request inactive – not associated with any active communication. Communication initiated by MPI_Start MPI_Request req_send, req_recv; double A[100], B[100]; int left_neighbor, right_neighbor, tag=999; MPI_Status stat_send, stat_recv; … MPI_Send_init(A,100,MPI_DOUBLE,left_neighbor,tag,MPI_COMM_WORLD,&req_send); MPI_Recv_init(B,100,MPI_DOUBLE,right_neighbor,tag,MPI_COMM_WORLD,&req_recv); MPI_Start(&req_send); MPI_Start(&req_recv); … // do something else useful MPI_Wait(&req_send, &stat_send); MPI_Wait(&req_recv, &stat_recv); MPI_Request_free(&req_send); MPI_Request_free(&req_recv);
Start Communication, Free Request int MPI_Start(MPI_Request *request) MPI_START(REQUEST) integer REQUEST • request is a persistent request created by MPI_Send_init etc. • Start the communication on request object. • The call returns immediately. It starts a non-blocking communication. Should not access the buffer after this call until completion. • Complete communication by MPI_Wait, MPI_Test etc. • MPI_Wait, MPI_Test will not de-allocate the request upon completion of communication • De-allocate persistent request using MPI_Request_free in the end. int MPI_Request_free(MPI_Request *request) MPI_REQUEST_FREE(request) integer REQUEST
A X Y cpu 0 cpu 0 cpu 0 Y1 A11 A12 A13 X1 cpu 1 cpu 1 cpu 1 = A21 A22 A23 X2 Y2 cpu 2 cpu 2 cpu 2 X3 A31 A32 A33 Y3 Y1 Y1 A11 A11 A12 A12 A13 A13 X3 X2 = = A21 A21 A22 A22 A23 A23 X3 X1 Y2 Y2 X2 X1 A31 A31 A32 A32 A33 A33 Y3 Y3 Example: Matrix-Vector Multiplication AX=Y A – NxN matrix X,Y – vectors, dimension N Y1 = A11*X1 + A12*X2 + A13*X3 Y2 = A21*X1 + A22*X2 + A23*X3 Y3 = A31*X1 + A32*X2 + A33*X3 Y1 = A11*X1 + A12*X2 + A13*X3 Y2 = A21*X1 + A22*X2 + A23*X3 Y3 = A31*X1 + A32*X2 + A33*X3 Y1 = A11*X1 + A12*X2 + A13*X3 Y2 = A21*X1 + A22*X2 + A23*X3 Y3 = A31*X1 + A32*X2 + A33*X3
Example: Matrix-Vector Data on cpu 0: [A11 A12 A13] N/3 x N matrix X1 vector, length N/3 Y1 vector, length N/3 Data on cpu 1: [A21 A22 A23] N/3 x N matrix X2 vector, length N/3 Y2 vector, length N/3 Data on cpu 2: [A31 A32 A33] N/3 x N matrix X3 vector, length N/3 Y3 vector, length N/3 Need to communicate: X1, X2, X3 Upward shift. Number of shifts = ncpus-1 Assume: A[i][j] = i+j X[i] = i
#include <stdio.h> #include <string.h> #include <mpi.h> #include "dmath.h“ // ignore this for now #define DIM 1000 // logical A[DIM][DIM], X[DIM], Y[DIM] int main(int argc, char **argv) { int ncpus, my_rank, left_neighbor, right_neighbor, tag=1001; int Nx, Ny; // Ny=DIM, Nx=DIM/ncpus, on each cpu: A[Nx][Ny], X[Nx], Y[Nx] MPI_Request req_sr[2]; MPI_Status stat_sr[2]; double **A, *X, *Y, *Xt; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); MPI_Comm_size(MPI_COMM_WORLD, &ncpus); if(DIM%ncpus != 0) { // assume DIM dividable by ncpus if(my_rank==0) printf("ERROR: grid size cannot be divided by ncpus!\n"); MPI_Finalize(); return -1; } Nx = DIM/ncpus; // again on each cpu: A[Nx][Ny] etc Ny = DIM; left_neighbor = (my_rank-1 + ncpus)%ncpus; // top neighbor right_neighbor = (my_rank+1)%ncpus; // bottom neighbor A = DMath::newD(Nx, Ny); // allocate memory, ignore DMath – my own routine X = DMath::newD(Nx); Xt = DMath::newD(Nx); // Xt – temporary space for receiving from neighbor Y = DMath::newD(Nx); Example(non-blocking comm)
int i,j; for(i=0;i<Nx;i++) { // initialize A, X for(j=0;j<Ny;j++) A[i][j] = (my_rank*Nx+i) + j; // *** important *** X[i] = my_rank*Nx+i; } int count; // loop counter int sindex, curr_block; memset(Y, '\0', sizeof(double)*Nx); // zero out result vector Y first for(count=0;count<ncpus;count++){ if(count < ncpus-1) { MPI_Irecv(Xt, Nx, MPI_DOUBLE,right_neighbor,tag,MPI_COMM_WORLD,&req_sr[0]); // receive from bottom neighbor MPI_Isend(X, Nx, MPI_DOUBLE, left_neighbor, tag, MPI_COMM_WORLD, &req_sr[1]); // send to top neighbor } // compute on current data curr_block = (my_rank+count)%ncpus; // *** important *** sindex = curr_block*Nx; // starting index of A[i][sindex+0:sindex+Nx-1] for(i=0;i<Nx;i++) for(j=0;j<Nx;j++) Y[i] += A[i][sindex+j]*X[j]; // *** important *** // complete comm if(count<ncpus-1) { MPI_Waitall(2, req_sr, stat_sr); // data now in Xt memcpy(X, Xt, sizeof(double)*Nx); // copy data from Xt to X *** important ** } } Example
Example // clean up, free memory DMath::del(A); // Ignore DMath for now DMath::del(X); DMath::del(Xt); DMath::del(Y); MPI_Finalize(); return 0; }
Example: Persistent Communication ... MPI_Recv_init(Xt, Nx, MPI_DOUBLE,right_neighbor,tag,MPI_COMM_WORLD,&req_sr[0]); MPI_Send_init(X, Nx, MPI_DOUBLE, left_neighbor, tag, MPI_COMM_WORLD, &req_sr[1]); for(count=0;count<ncpus;count++){ if(count < ncpus-1) MPI_Startall(2, req_sr); // compute on current data curr_block = (my_rank+count)%ncpus; sindex = curr_block*Nx; for(i=0;i<Nx;i++) for(j=0;j<Nx;j++) Y[i] += A[i][sindex+j]*X[j]; // complete comm if(count<ncpus-1) { MPI_Waitall(2, req_sr, stat_sr); // data now in Xt memcpy(X, Xt, sizeof(double)*Nx); // copy data to X } } MPI_Request_free(&req_sr[0]); MPI_Request_free(&req_sr[1]); ...
Example: Send-Recv ... for(count=0;count<ncpus;count++){ // compute on current data curr_block = (my_rank+count)%ncpus; sindex = curr_block*Nx; for(i=0;i<Nx;i++) for(j=0;j<Nx;j++) Y[i] += A[i][sindex+j]*X[j]; // send-recv if(count<ncpus-1) MPI_Sendrecv_replace(X,Nx,MPI_DOUBLE,left_neighbor,tag, right_neighbor, tag, MPI_COMM_WORLD, &stat_sr); } ...
HWK#2: Matrix Multiplication C A B = A, B, C – NxN matrices P – number of processors A1, A2, A3 – Nx(N/P) matrices C1, C2, C3 - … Bij – (N/P)x(N/P) matrices C1 = A1*B11 + A2*B21 + A3*B31 cpu 0 C2 = A1*B12 + A2*B22 + A3*B32 cpu 1 C3 = A1*B13 + A2*B23 + A3*B33cpu 2 Input: A[i][j] = 2*i + j B[i][j] = 2*i – j Column-wise decomposition
HWK #2 • Implement the above parallel matrix multiplication (column-wise data decomposition) in either C, C++ or Fortran • Use non-blocking communication or persistent communication in MPI • Test your parallel implementation and make sure the result is correct • Result for matrix C on p CPUs must be identical to that on 1 CPU • Use a matrix size 2048x2048 (double) • Time the “multiplication section” of your code using MPI_Wtime() routine for wall-clock time. • Run your code on 1, 2, 4, 8, 16 CPUs and obtain the wall-clock time it takes: T1, T2, …, T16 • Compute parallel speedup factors: Sp = T1/Tp, e.g. Sp=T1/T8 for 8 CPUs. • Plot Sp vs. number of CPUs. • Turn in: • Source code + compiled binary code on either hamlet or radon. • Table of wall-clock time vs. number of CPUs. • Plot of parallel speedup factors. • Write-up of what you have learned from the implementation and timing results • Due date: Oct. 11
Overview • All processes in a group participate in communication, by calling the same function with matching arguments. • Types of collective operations: • Synchronization: MPI_Barrier • Data movement: MPI_Bcast, MPI_Scatter, MPI_Gather, MPI_Allgather, MPI_Alltoall • Collective computation: MPI_Reduce, MPI_Allreduce, MPI_Scan • Collective routines are blocking: • Completion of call means the communication buffer can be accessed • No indication on other processes’ status of completion • May or may not have effect of synchronization among processes.
Overview • Can use same communicators as PtP communications • MPI guarantees messages from collective communications will not be confused with PtP communications. • Key is a group of processes partaking communication • If you want only a sub-group of processes involved in collective communication, need to create a sub-group/sub-communicator from MPI_COMM_WORLD
Barrier int MPI_Barrier(MPI_Comm comm) MPI_BARRIER(COMM,IERROR) integer COMM, IERROR • Blocks the calling process until all group members have called it. • Decreases performance. Refrain from using it explicitly. … MPI_Barrier(MPI_COMM_WORLD); // synchronization point …
Broadcast • Broadcasts a message from process with rank root to all processes in group, including itself. • comm, root must be the same in all processes • The amount of data sent must be equal to amount of data received, pairwise between each process and the root • For now, means count and datatype must be the same for all processes; may be different when generalized datatypes are involved. int MPI_Bcast(void *buffer, int count, MPI_Datatype datatype,int root, MPI_Comm comm) MPI_BCAST(BUFFER, COUNT, DATATYPE, ROOT, COMM) integer BUFFER, COUNT, DATATYPE, ROOT, COMM