Porting MPI Programs to the IBM Cluster 1600

Porting MPI Programs to theIBM Cluster 1600 Peter Towers March 2004 Porting MPI Programs to the IBM Cluster 1600

Topics • The current hardware switch • Parallel Environment (PE) • Issues with Standard Sends/Receives • Use of non blocking communications • Debugging MPI programs • MPI tracing • Profiling MPI programs • Tasks per Node • Communications Optimisation • The new hardware switch • Third Practical Porting MPI Programs to the IBM Cluster 1600

The current hardware switch • Designed for a previous generation of IBM hardware • Referred to as the Colony switch • 2 switch adaptors per logical node • 8 processors share 2 adaptors • called a dual plane switch • Adaptors are multiplexed • software stripes large messages across both adaptors • Minimum latency 21 microseconds • Maximum bandwidth approx 350 MBytes/s - about 45 MB/s per task when all going off node together Porting MPI Programs to the IBM Cluster 1600

Parallel Environment (PE) • MPI programs are managed by the IBM PE • IBM documentation refers to PE and POE • POE stands for Parallel Operating Environment • many environment variables to tune the parallel environment • talks about launching parallel jobs interactively • ECMWF uses Loadleveler for batch jobs • PE usage becomes almost transparent Porting MPI Programs to the IBM Cluster 1600

Issues with Standard Sends/Receives • The MPI standard can be implemented in different ways • Programs may not be fully portable across platforms • Standard Sends and Receives can cause problems • Potential for deadlocks • need to understand Blocking v Non Blocking communications • need to understand Eager versus Rendezvous protocols • IFS had to be modified to run on IBM Porting MPI Programs to the IBM Cluster 1600

Blocking Communications • MPI_Send is a blocking routine • It returns when it is safe to re-use the buffer being sent • the send buffer can then be overwritten • The MPI layer may have copied the data elsewhere • using internal buffer/mailbox space • the message is then in transit but not yet received • this is called an “eager” protocol • good for short messages • The MPI layer may have waited for the receiver • the data is copied from send to receive buffer directly • lower overhead transfer • this is called a “rendezvous” protocol • good for large messages Porting MPI Programs to the IBM Cluster 1600

MPI_Send on the IBM • Uses the “Eager” protocol for short messages • By default short means up to 4096 bytes • the higher the task count, the lower the value • Uses the “Rendezvous” protocol for long messages • Potential for send/send deadlocks • tasks block in mpi_send if(me .eq.0) then him=1 else him=0 endif call mpi_send(sbuff,n,MPI_REAL8,him,tag,MPI_COMM_WORLD,ierror) call mpi_recv(rbuff,n,MPI_REAL8,him,tag,MPI_COMM_WORLD,stat,ierror) Porting MPI Programs to the IBM Cluster 1600

Solutions to Send/Send deadlocks • Pair up sends and receives • use MPI_SENDRECV • use a buffered send • MPI_BSEND • use asynchronous sends/receives • MPI_ISEND/MPI_IRECV Porting MPI Programs to the IBM Cluster 1600

Paired Sends and Receives • More complex code • Requires close synchronisation if (me .eq. 0) then him=1 call mpi_send(sbuff,n,MPI_REAL8,him,tag,MPI_COMM_WORLD,ierror) call mpi_recv(rbuff,n,MPI_REAL8,him,tag,MPI_COMM_WORLD,stat,ierror) else him=0 call mpi_recv(rbuff,n,MPI_REAL8,him,tag,MPI_COMM_WORLD,stat,ierror) call mpi_send(sbuff,n,MPI_REAL8,him,tag,MPI_COMM_WORLD,ierror) endif Porting MPI Programs to the IBM Cluster 1600

MPI_SENDRECV • Easier to code • Still implies close synchronisation call mpi_sendrecv(sbuff,n,MPI_REAL8,him,1, & rbuff,n,MPI_REAL8,him,1, & MPI_COMM_WORLD,stat,ierror) Porting MPI Programs to the IBM Cluster 1600

MPI_BSEND • This performs a send using an additional buffer • the buffer is allocated by the program via MPI_BUFFER_ATTACH • done once as part of the program initialisation • Typically quick to implement • add the mpi_buffer_attach call • how big to make the buffer? • change MPI_SEND to MPI_BSEND everywhere • But introduces additional memory copy • extra overhead • not recommended for production codes Porting MPI Programs to the IBM Cluster 1600

MPI_IRECV / MPI_ISEND • Uses Non Blocking Communications • Routines return without completing the operation • the operations run asynchronously • Must NOT reuse the buffer until safe to do so • Later test that the operation completed • via an integer identification handle passed to MPI_WAIT • I stands for immediate • the call returns immediately call mpi_irecv(rbuff,n,MPI_REAL8,him,1,MPI_COMM_WORLD,request,ierror) call mpi_send (sbuff,n,MPI_REAL8,him,1,MPI_COMM_WORLD,ierror) call mpi_wait(request,stat,ierr) Alternatively could have used MPI_ISEND and MPI_RECV Porting MPI Programs to the IBM Cluster 1600

Non Blocking Communications • Routines include • MPI_ISEND • MPI_IRECV • MPI_WAIT • MPI_WAITALL Porting MPI Programs to the IBM Cluster 1600

Debugging MPI Programs • The Universal Debug Tool and • Totalview Porting MPI Programs to the IBM Cluster 1600

The Universal Debug Tool • The Print/Write Statement • Recommend the use of call flush(unit_number) • ensures output is not left in runtime buffers • Recommend the use of separate output files eg: unit_number=100+mytask write(unit_number,*) ...... call flush(unit_number) • Or set the Environment variable MP_LABELIO=yes • Do not output too much • Use as few processors as possible • Think carefully..... • Discuss the problem with a colleague Porting MPI Programs to the IBM Cluster 1600

Totalview • Assumes you can launch X-Windows remotely • Run totalview as part of a loadleveler job export DISPLAY=....... poe totalview -a a.out <arguments> • But you have to wait for the job to run..... • Use only a few processors • minimises the queuing time • minimises the waste of resource while thinking.... Porting MPI Programs to the IBM Cluster 1600

MPI Trace Tools • Identify message passing hot spots • Just link with /usr/local/lib/trace/libmpiprof.a • low overhead timer for all mpi routine calls • Produces output files named mpi_profile.N • were N is the task number • Examples of the output follow Porting MPI Programs to the IBM Cluster 1600

Porting MPI Programs to the IBM Cluster 1600

Profiling MPI programs • The same as for serial codes • Use the –pg flag at compile and/or link time • Produces multiple gmon.out.N files • N is the task number • gprof a.out gmon.out.* • The routine .kickpipes often appears high up the profile • an internal mpi library routine • where the mpi library spins waiting for something • eg a message to be sent • or in a barrier Porting MPI Programs to the IBM Cluster 1600

Tasks Per Node ( 1 of 2 ) • Try both 7 and 8 tasks per node for multi node jobs • 7 tasks may run faster than 8 • depends on the frequency of communications • 7 tasks leaves a processor spare • used by the OS and background daemons such as for GPFS • mpi tasks run with minimal scheduling interference • 8 tasks are subject to scheduling interference • by default mpi tasks cpu spin in kickpipes • they may spin waiting for a task that has been scheduled out • the OS has to schedule cpu time for background processes • random interference across nodes is cumulative Porting MPI Programs to the IBM Cluster 1600

Tasks Per Node ( 2 of 2 ) • Also try 8 tasks per node and MP_WAIT_MODE=sleep • export MP_WAIT_MODE=sleep • tasks give up the cpu instead of spinning • increases latency but reduces interference • effect varies from application to application • Mixed mode MPI/OpenMP works well • master OpenMP thread does the message passing • while slave OpenMP threads go to sleep • cpu cycles are freed up for background processes • used by the IFS to good effect • 2 tasks each of 4 threads per node • suspect success depends on the parallel granularity Porting MPI Programs to the IBM Cluster 1600

Communications Optimisation • Communications costs often impact parallel speedup • Concatenate messages • fewer larger messages are better • reduces the effect of latency • Increase MP_EAGER_LIMIT • export MP_EAGER_LIMIT=65536 • maximum size for messages sent with the “eager” protocol • Use collective routines • Use ISEND/IRECV • Remove barriers • Experiment with tasks per node Porting MPI Programs to the IBM Cluster 1600

The new hardware switch • Designed for the Cluster 1600 • Referred to as the Federation switch • 2 switch adaptors per physical node • 2 links each 2GB/s per adaptor • 32 processors share 4 links • Adaptors/links are NOT multiplexed • Minimum latency 10 microseconds • Maximum bandwidth approx 2000 MByte/s • about 250 MB/s per task when all going off node together • Up to 5 times better performance • 32 processor nodes • will affect how we schedule and run jobs Porting MPI Programs to the IBM Cluster 1600

Third Practical • Contained in the directory /home/ectrain/trx/mpi/exercise3 on hpca • Parallelising the computation of PI • See the README for details Porting MPI Programs to the IBM Cluster 1600

Porting MPI Programs to the IBM Cluster 1600