1 / 25

Porting MPI Programs to the IBM Cluster 1600

Porting MPI Programs to the IBM Cluster 1600. Peter Towers March 2004. Topics. The current hardware switch Parallel Environment (PE) Issues with Standard Sends/Receives Use of non blocking communications Debugging MPI programs MPI tracing Profiling MPI programs Tasks per Node

ondrea
Download Presentation

Porting MPI Programs to the IBM Cluster 1600

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Porting MPI Programs to theIBM Cluster 1600 Peter Towers March 2004 Porting MPI Programs to the IBM Cluster 1600

  2. Topics • The current hardware switch • Parallel Environment (PE) • Issues with Standard Sends/Receives • Use of non blocking communications • Debugging MPI programs • MPI tracing • Profiling MPI programs • Tasks per Node • Communications Optimisation • The new hardware switch • Third Practical Porting MPI Programs to the IBM Cluster 1600

  3. The current hardware switch • Designed for a previous generation of IBM hardware • Referred to as the Colony switch • 2 switch adaptors per logical node • 8 processors share 2 adaptors • called a dual plane switch • Adaptors are multiplexed • software stripes large messages across both adaptors • Minimum latency 21 microseconds • Maximum bandwidth approx 350 MBytes/s - about 45 MB/s per task when all going off node together Porting MPI Programs to the IBM Cluster 1600

  4. Parallel Environment (PE) • MPI programs are managed by the IBM PE • IBM documentation refers to PE and POE • POE stands for Parallel Operating Environment • many environment variables to tune the parallel environment • talks about launching parallel jobs interactively • ECMWF uses Loadleveler for batch jobs • PE usage becomes almost transparent Porting MPI Programs to the IBM Cluster 1600

  5. Issues with Standard Sends/Receives • The MPI standard can be implemented in different ways • Programs may not be fully portable across platforms • Standard Sends and Receives can cause problems • Potential for deadlocks • need to understand Blocking v Non Blocking communications • need to understand Eager versus Rendezvous protocols • IFS had to be modified to run on IBM Porting MPI Programs to the IBM Cluster 1600

  6. Blocking Communications • MPI_Send is a blocking routine • It returns when it is safe to re-use the buffer being sent • the send buffer can then be overwritten • The MPI layer may have copied the data elsewhere • using internal buffer/mailbox space • the message is then in transit but not yet received • this is called an “eager” protocol • good for short messages • The MPI layer may have waited for the receiver • the data is copied from send to receive buffer directly • lower overhead transfer • this is called a “rendezvous” protocol • good for large messages Porting MPI Programs to the IBM Cluster 1600

  7. MPI_Send on the IBM • Uses the “Eager” protocol for short messages • By default short means up to 4096 bytes • the higher the task count, the lower the value • Uses the “Rendezvous” protocol for long messages • Potential for send/send deadlocks • tasks block in mpi_send if(me .eq.0) then him=1 else him=0 endif call mpi_send(sbuff,n,MPI_REAL8,him,tag,MPI_COMM_WORLD,ierror) call mpi_recv(rbuff,n,MPI_REAL8,him,tag,MPI_COMM_WORLD,stat,ierror) Porting MPI Programs to the IBM Cluster 1600

  8. Solutions to Send/Send deadlocks • Pair up sends and receives • use MPI_SENDRECV • use a buffered send • MPI_BSEND • use asynchronous sends/receives • MPI_ISEND/MPI_IRECV Porting MPI Programs to the IBM Cluster 1600

  9. Paired Sends and Receives • More complex code • Requires close synchronisation if (me .eq. 0) then him=1 call mpi_send(sbuff,n,MPI_REAL8,him,tag,MPI_COMM_WORLD,ierror) call mpi_recv(rbuff,n,MPI_REAL8,him,tag,MPI_COMM_WORLD,stat,ierror) else him=0 call mpi_recv(rbuff,n,MPI_REAL8,him,tag,MPI_COMM_WORLD,stat,ierror) call mpi_send(sbuff,n,MPI_REAL8,him,tag,MPI_COMM_WORLD,ierror) endif Porting MPI Programs to the IBM Cluster 1600

  10. MPI_SENDRECV • Easier to code • Still implies close synchronisation call mpi_sendrecv(sbuff,n,MPI_REAL8,him,1, & rbuff,n,MPI_REAL8,him,1, & MPI_COMM_WORLD,stat,ierror) Porting MPI Programs to the IBM Cluster 1600

  11. MPI_BSEND • This performs a send using an additional buffer • the buffer is allocated by the program via MPI_BUFFER_ATTACH • done once as part of the program initialisation • Typically quick to implement • add the mpi_buffer_attach call • how big to make the buffer? • change MPI_SEND to MPI_BSEND everywhere • But introduces additional memory copy • extra overhead • not recommended for production codes Porting MPI Programs to the IBM Cluster 1600

  12. MPI_IRECV / MPI_ISEND • Uses Non Blocking Communications • Routines return without completing the operation • the operations run asynchronously • Must NOT reuse the buffer until safe to do so • Later test that the operation completed • via an integer identification handle passed to MPI_WAIT • I stands for immediate • the call returns immediately call mpi_irecv(rbuff,n,MPI_REAL8,him,1,MPI_COMM_WORLD,request,ierror) call mpi_send (sbuff,n,MPI_REAL8,him,1,MPI_COMM_WORLD,ierror) call mpi_wait(request,stat,ierr) Alternatively could have used MPI_ISEND and MPI_RECV Porting MPI Programs to the IBM Cluster 1600

  13. Non Blocking Communications • Routines include • MPI_ISEND • MPI_IRECV • MPI_WAIT • MPI_WAITALL Porting MPI Programs to the IBM Cluster 1600

  14. Debugging MPI Programs • The Universal Debug Tool and • Totalview Porting MPI Programs to the IBM Cluster 1600

  15. The Universal Debug Tool • The Print/Write Statement • Recommend the use of call flush(unit_number) • ensures output is not left in runtime buffers • Recommend the use of separate output files eg: unit_number=100+mytask write(unit_number,*) ...... call flush(unit_number) • Or set the Environment variable MP_LABELIO=yes • Do not output too much • Use as few processors as possible • Think carefully..... • Discuss the problem with a colleague Porting MPI Programs to the IBM Cluster 1600

  16. Totalview • Assumes you can launch X-Windows remotely • Run totalview as part of a loadleveler job export DISPLAY=....... poe totalview -a a.out <arguments> • But you have to wait for the job to run..... • Use only a few processors • minimises the queuing time • minimises the waste of resource while thinking.... Porting MPI Programs to the IBM Cluster 1600

  17. MPI Trace Tools • Identify message passing hot spots • Just link with /usr/local/lib/trace/libmpiprof.a • low overhead timer for all mpi routine calls • Produces output files named mpi_profile.N • were N is the task number • Examples of the output follow Porting MPI Programs to the IBM Cluster 1600

  18. Porting MPI Programs to the IBM Cluster 1600

  19. Porting MPI Programs to the IBM Cluster 1600

  20. Profiling MPI programs • The same as for serial codes • Use the –pg flag at compile and/or link time • Produces multiple gmon.out.N files • N is the task number • gprof a.out gmon.out.* • The routine .kickpipes often appears high up the profile • an internal mpi library routine • where the mpi library spins waiting for something • eg a message to be sent • or in a barrier Porting MPI Programs to the IBM Cluster 1600

  21. Tasks Per Node ( 1 of 2 ) • Try both 7 and 8 tasks per node for multi node jobs • 7 tasks may run faster than 8 • depends on the frequency of communications • 7 tasks leaves a processor spare • used by the OS and background daemons such as for GPFS • mpi tasks run with minimal scheduling interference • 8 tasks are subject to scheduling interference • by default mpi tasks cpu spin in kickpipes • they may spin waiting for a task that has been scheduled out • the OS has to schedule cpu time for background processes • random interference across nodes is cumulative Porting MPI Programs to the IBM Cluster 1600

  22. Tasks Per Node ( 2 of 2 ) • Also try 8 tasks per node and MP_WAIT_MODE=sleep • export MP_WAIT_MODE=sleep • tasks give up the cpu instead of spinning • increases latency but reduces interference • effect varies from application to application • Mixed mode MPI/OpenMP works well • master OpenMP thread does the message passing • while slave OpenMP threads go to sleep • cpu cycles are freed up for background processes • used by the IFS to good effect • 2 tasks each of 4 threads per node • suspect success depends on the parallel granularity Porting MPI Programs to the IBM Cluster 1600

  23. Communications Optimisation • Communications costs often impact parallel speedup • Concatenate messages • fewer larger messages are better • reduces the effect of latency • Increase MP_EAGER_LIMIT • export MP_EAGER_LIMIT=65536 • maximum size for messages sent with the “eager” protocol • Use collective routines • Use ISEND/IRECV • Remove barriers • Experiment with tasks per node Porting MPI Programs to the IBM Cluster 1600

  24. The new hardware switch • Designed for the Cluster 1600 • Referred to as the Federation switch • 2 switch adaptors per physical node • 2 links each 2GB/s per adaptor • 32 processors share 4 links • Adaptors/links are NOT multiplexed • Minimum latency 10 microseconds • Maximum bandwidth approx 2000 MByte/s • about 250 MB/s per task when all going off node together • Up to 5 times better performance • 32 processor nodes • will affect how we schedule and run jobs Porting MPI Programs to the IBM Cluster 1600

  25. Third Practical • Contained in the directory /home/ectrain/trx/mpi/exercise3 on hpca • Parallelising the computation of PI • See the README for details Porting MPI Programs to the IBM Cluster 1600

More Related