90 likes | 305 Views
Non blocking communications in RK dynamics. Current status and future work. Stefano Zampini, CASPUR/CNMCA WG-6 PP POMPA @ Cosmo GM Rome, September 6 2011. Halo exchange in Cosmo. 3 types of point to point communications: 2 partially non blocking and 1 full blocking (with MPI_SENDRECV)
E N D
Non blocking communications in RK dynamics.Current status and future work.Stefano Zampini, CASPUR/CNMCAWG-6 PP POMPA @ Cosmo GMRome, September 6 2011
Halo exchange in Cosmo • 3 types of point to point communications: 2 partially non blocking and 1 full blocking (with MPI_SENDRECV) • Halo swapping needs completion of East to West before starting South to North communication (implicit corner exchange) • Also: choice for explicit buffering or derived MPI datatypes
Details on nonblocking exchange • Full halo exchange including corners: 2x messages, same amount of data on network. • 3 different stages: send, receive and wait. • Minimizing overhead: at first time step persistent requests are created using calls to MPI_SEND_INIT and MPI_RECV_INIT. • During model run: MPI_STARTALL used for starting requests. MPI_TESTANY/MPI_WAITANY used for completion. • Actual implementation with explicit send and receive buffering only: needs to be extended to derived MPI datatypes. • Strategy used in RK dynamics (manual implementation): - Sends are posted whenever needed data has been locally computed. - Receives are posted whenever receive buffer is ready to be used. - Waits are posted just before data is needed for next local computation.
New synopsis for swap subroutine • Actual call to subroutine exchg_boundaries • 4 more argument in call to subroutine iexchg_boundaries - ilocalreq(16): array of request (integer declared as module variable, one for each swap scenario inside the module) - operation(3): array of logicals indicating stage to perform (send,recv,wait) - istartpar,iendpar: needed for corners' definition
Benchmark details • COSMO RAPS 5.0 with MeteoSwiss namelist (25h hours of forecast) • Cosmo2 (520x350x60, dt 20) and Cosmo7 (393x338x60, dt 60) • Decompositions: tiny (10x12+4), small (20x24+4) and usual (28x35+4) • Code compiled with Intel ifort 11.1.072 and HPMPI COMFLG1 = -xssse3 -O3 -fp-model precise -free -fpp -override-limits -convert big_endian COMFLG2 = -xssse3 -O3 -fp-model precise -free -fpp -override-limits -convert big_endian COMFLG3 = -xssse3 -O2 -fp-model precise -free -fpp -override-limits -convert big_endian COMFLG4 = -xssse3 -O2 -fp-model precise -free -fpp -override-limits -convert big_endian LDFLG = -finline-functions -O3 • Runs on PORDOI linux cluster at CNMCA:128 dual-socket quad-core nodes (1024 total cores) • Each socket: quad core Intel Xeon E5450 @3.00 Ghz with 1 GB RAM for each core • Profiling with Scalasca 1.3.3 (very small overhead)
Early results: COSMO 7 • Total time (s) for model runs Mean total time for RK dynamics
Early results: COSMO2 • Total time (s) for model runs Mean total time for RK dynamics
Comments and future works • Almost same computational times for test cases considered with INTEL-HPMPI configuration • Not shown: 5% improve in computational times for PGI-MVAPICH2 (but with worse absolute times) • CFL check performed only locally with izdebug<2. • Still a lot of sinchronization in collective calls during multiplicative filling in semi-lagrange scheme: Allreduce and Allgather operations in multiple calls to sum_DDI subroutine (bottleneck for number of cores > 1000) • Bad perfomances in w_bbc_rk_up5 during RK loop over small time steps. Rewrite loop code? • What about automatic detection/insertion of swapping calls in microphysics and other parts of code? • Is Testany/Waitany the most efficient way to assure completion?