Performance Oriented MPI

Performance Oriented MPI Jeffrey M. Squyres Andrew Lumsdaine NERSC/LBNL and U. Notre Dame

Overview • Overview and History of MPI • Performance Oriented Point to Point • Collectives, Data Types • Diagnostics and Tuning • Rules of Thumb and Gotchas

Scope of This Talk • Beginning to intermediate user • General principles and rules of thumb • When and where performance might be available • Omit (advanced) low-level issues

Overview and History of MPI • Library (not language) specification • Goals • Portability • Efficiency • Functionality (small and large) • Safety (communicators) • Conservative (current best practices)

Performance in MPI • MPI includes many performance-oriented features • These features are only potentially high-performance • The standard seeks not to preclude performance, it does not mandate it • Progress might only be made during MPI function calls

(Potential) Performance Features • Non-blocking operations • Persistent operations • Collective operations • MPI Datatypes

Basic Point to Point • “Six function MPI” includes • MPI_Send() • MPI_Recv() • These are useful, but there is more

Basic Point to Point MPI_Comm_rank(MPI_COMM_WORLD, &rank); if (rank == 0) { MPI_Send(&work, 1, MPI_INT, dest, TAG, MPI_COMM_WORLD); } else { MPI_Recv(&result, 1, MPI_INT, src, TAG, MPI_COMM_WORLD, &status); }

Non-Blocking Operations • MPI_Isend() • MPI_Irecv() • “I” is for immediate • Paired with MPI_Test()/MPI_Wait()

Non-Blocking Operations MPI_Comm_rank(comm,&rank); if (rank == 0) { MPI_Isend(sendbuf,count,MPI_REAL,1,tag,comm,&request); /* Do some computation */ MPI_Wait(&request,&status); } else { MPI_Irecv(recvbuf,count,MPI_REAL,0,tag,comm,&request); /* Do some computation */ MPI_Wait(&request,&status); }

Persistent Operations • MPI_Send_Init() • MPI_Recv_init() • Creates a request but does not start it • MPI_Start() begins the communication • A single request can be re-used with multiple calls to MPI_Start()

Persistent Operations MPI_Comm_rank(comm, &rank); if (rank == 0) MPI_Send_init(sndbuf, count, MPI_REAL, 1, tag, comm, &request); else MPI_Recv_init(rcvbuf, count, MPI_REAL, 0, tag, comm, &request); /* … */ for (i = 0; i < n; i++) { MPI_Start(&request); /* Do some work */ MPI_Wait(&request, &status); }

Collective Operations • May be layered on point to point • May use tree communication patterns for efficiency • Synchronization! (No non-blocking collectives)

Collective Operations MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, comm); O(P) O(log P)

MPI Datatypes • May allow MPI to send a message directly from memory • May avoid copying/packing • (General) high performance implementations not widely available network copy

Quiz: MPI_Send() • After I call MPI_Send() • The recipient has received the message • I have sent the message • I can write to the message buffer without corrupting the message • I can write to the message buffer

Sidenote: MPI_Ssend() • MPI_Ssend() has the (perhaps) expected semantics • When MPI_Ssend() returns, the recipient has received the message • Useful for debugging (replace MPI_Send() with MPI_Ssend())

Quiz: MPI_Isend() • After I call MPI_Isend() • The recipient has started to receive the message • I have started to send the message • I can write to the message buffer without corrupting the message • None of the above (I must call MPI_Test() or MPI_Wait())

Quiz: MPI_Isend() • True or False • I can overlap communication and computation by putting some computation between MPI_Isend() and MPI_Test()/MPI_Wait() • False (in many/most cases)

Communication is Still Computation • A CPU, usually the main one, must do the communication work • Part of your process (inside MPI calls) • Another process on main CPU • Another thread on main CPU • Another processor

No Free Lunch • Part of your process (most common) • Fast but no overlap • Another process (daemons) • Overlap, but slow (extra copies) • Another thread (rare) • Overlap and fast, but difficult • Another processor (emerging) • Overlap and fast, but more hardware • E.g., Myri/gm, VIA

How Do I Get Performance? • Minimize time spent communicating • Minimize data copies • Minimize synchronization • I.e., time waiting for communication

Minimizing Communication Time • Bandwidth • Latency

Minimizing Latency • Collect small messages together (if you can) • One 1024-byte message instead of 1024 one-byte messages • Minimize other overhead (e.g., copying) • Overlap with computation (if you can)

Example: Domain Decomposition

Naïve Approach while (!done) { exchange(D, neighbors, myrank); dored(D); exchange(D, neighbors, myrank); doblack(D); } void exchange(Array D, int *neighbors, int myrank) { for (i = 0; i < 4; i++) MPI_send(…); for (i = 0; i < 4; i++) MPI_recv(…); }

Naïve Approach • Deadlock! (Maybe) • Can fix with careful coordination of receiving versus sending on alternate processes • But this can still serialize

MPI_Sendrecv() while (!done) { exchange(D, neighbors, myrank); dored(D); exchange(D, neighbors, myrank); doblack(D); } void exchange(Array D, int *neighbors, int myrank) { for (i = 0; i < 4; i++) { MPI_Sendrecv(…); } }

Immediate Operations while (!done) { exchange(D, neighbors, myrank); dored(D); exchange(D, neighbors, myrank); doblack(D); } void exchange(Array D, int *neighbors, int myrank) { for (i = 0; i < 4; i++) { MPI_Isend(…); MPI_Irecv(…); } MPI_Waitall(…); }

Receive Before Sending while (!done) { exchange(D, neighbors, myrank); dored(D); exchange(D, neighbors, myrank); doblack(D); } void exchange(Array D, int *neighbors, int myrank) { for (i = 0; i < 4; i++) MPI_Irecv(…); for (i = 0; i < 4; i++) MPI_Isend(…); MPI_Waitall(…); }

Persistent Operations for (i = 0; i < 4; i++) { MPI_Recv_init(…); MPI_Send_init(…); } while (!done) { exchange(D, neighbors, myrank); dored(D); exchange(D, neighbors, myrank); doblack(D); } void exchange(Array D, int *neighbors, int myrank) { MPI_Startall(…) MPI_Waitall(…); }

Overlapping while (!done) { MPI_Startall(…); /* Start exchanges */ do_inner_red(D); /* Internal computation */ for (i = 0; i < 4; i++) { MPI_Wait_any(…); /* As information arrives */ do_received_red(D); /* Process */ } MPI_Startall(…); do_inner_black(D); for (i = 0; i < 4; i++) { MPI_Wait_any(…); do_received_black(D); } }

Advanced Overlap MPI_Startall(…); /* Start all receives */ /* … */ while (!done) { MPI_Startall(…); /* Start sends */ do_inner_red(D); /* Internal computation */ for (i = 0; i < 4; i++) { MPI_Wait_any(…); /* Wait on receives */ if (received) { do_received_red(D); /* Process */ MPI_Start(…); /* Restart receive */ } } /* Repeat for black */ }

MPI Data Types • MPI_Type_vector • MPI_Type_struct • Etc. • MPI_Pack might be better network copy

Minimizing Synchronization • At synchronization point (e.g., with collective communication) all processes must arrive at collective call • Can spend lots of time waiting • This is often an algorithmic issue • E.g., check for convergence every 5 iterations instead of every iteration

Gotchas • MPI_Probe • Guarantees extra memory copy • MPI_Any_source • Can cause additional (internal) looping • MPI_All_to_all • All pairs must communicate • Synchronization (avoid in general)

Diagnostic Tools • Totalview • Prism • Upshot • XMPI

Summary • Receive before sending • Collect small messages together • Overlap (if possible) • Use immediate operations • Use persistent operations • Use diagnostic tools

Performance Oriented MPI