1 / 25

IFS Benchmark with Federation Switch

IFS Benchmark with Federation Switch. John Hague, IBM. Introduction. Federation has dramatically improved pwr4 p690 communication, so Measure Federation performance with Small Pages and Large Pages using simulation program Compare Federation and pre-Federation (Colony) performance of IFS

hosea
Download Presentation

IFS Benchmark with Federation Switch

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. IFS Benchmark with Federation Switch John Hague, IBM

  2. Introduction • Federation has dramatically improved pwr4 p690 communication, so • Measure Federation performance with Small Pages and Large Pages using simulation program • Compare Federation and pre-Federation (Colony) performance of IFS • Compare Federation performance of IFS with and without Large Pages and Memory Affinity • Examine IFS communication using mpi profiling

  3. Colony v Federation • Colony (hpca) • 1.3GHz 32-processor p690s • Four 8-processor Affinity LPARs per p690 • Needed to get communication performance • Two 180MB/s adapters per LPAR • Federation (hpcu) • 1.7GHz p690s • One 32-processor LPAR per p690 • Memory and MPI MCM Affinity • MPI Task and Memory from same MCM • Slightly better than binding task to specific processor • Two 2-link 1.2GB/s Federation adapters per p690 • Four 1.2GB/s links per node

  4. 2 1 0 1 2 3 4 5 3 4 8 9 8 30 31 IFS Communication:transpositions 0 MPI task Node 1 • MPI Alltoall in all rows simultaneously • Mostly shared memory • MPI Alltoall in all columns simultaneously

  5. Simulation of transpositions • All transpositions in “row” use shared memory • All transpositions in “column” use switch • Number of MPI tasks per node varied • But all processors used by using OpenMP threads • Bandwidth measured for MPI Sendrecv calls • Buffers allocated and filled by threadsbetween each call • Large Pages give best switch performance • With current switch software

  6. “Transposition” Bandwidth per link(8 nodes, 4 links/node, 8 tasks/node, 4 threads/task, 2 tasks/link) SP = Small Pages LP = Large Pages

  7. “Transposition” Bandwidth per link(8 nodes, 4 links/node) Multiple threads ensure all processors are used

  8. hpcu v hpca with IFS • Benchmark jobs (provided 3 years ago) • Same executable used on hpcu and hpca • 256 processors used • All jobs run with mpi_profiling (and barriers before data exchange)

  9. IFS Speedups: hpcu v hpca LP = Large Pages; SP = Small Pages MA = Memory Affinity

  10. LP/SP & MA/noMA CPU comparison

  11. LP/SP & MA/noMA Comms comparison

  12. Percentage Communication hpca ------------------- hpcu --------------------------

  13. Extra Memory needed by Large Pages Large Pages are allocated in Real Memory in segments of 256 MB • MPI_INIT • 80MB which may not be used • MP_BUFFER_MEM (default 64MB) can be reduced • MPI_BUFFER_ALLOCATE needs memory which may not be used • OpenMP threads: • Stack allocated with XLSMPOPTS=“stack=…” may not be used • Fragmentation • Memory is "wasted" • Last 256 MB segment • Only a small part of it may be used

  14. mpi_profile • Examine IFS communication using mpi profiling • Use libmpiprof.a • Calls and MB/s rate for each type of call • Overall • For each higher level subroutine • Histogram of blocksize for each type of call

  15. mpi_profile for T799 128 MPI tasks, 2 threadsWALL time = 5495 sec--------------------------------------------------------------MPI Routine #calls avg. bytes Mbytes time(sec) --------------------------------------------------------------MPI_Send 49784 52733.2 2625.3 7.873MPI_Bsend 6171 454107.3 2802.3 1.331MPI_Isend 84524 1469867.4 124239.1 1.202MPI_Recv 91940 1332252.1 122487.3 359.547MPI_Waitall 75884 0.0 0.0 59.772MPI_Bcast 362 26.6 0.0 0.028MPI_Barrier 9451 0.0 0.0 436.818-------TOTAL 866.574 ----------------------------------------------------------------Barrier indcates load imbalance

  16. mpi_profile for 4D_Var min0 128 MPI tasks, 2 threads WALL time = 1218 sec -------------------------------------------------------------- MPI Routine #calls avg. bytes Mbytes time(sec) -------------------------------------------------------------- MPI_Send 43995 7222.9 317.8 1.033 MPI_Bsend 38473 13898.4 534.7 0.843 MPI_Isend 326703 168598.3 55081.6 6.368 MPI_Recv 432364 127061.8 54936.9 220.877 MPI_Waitall 276222 0.0 0.0 23.166 MPI_Bcast 288 374491.7 107.9 0.490 MPI_Barrier 27062 0.0 0.0 94.168 MPI_Allgatherv 466 285958.8 133.3 26.250 MPI_Allreduce 1325 73.2 0.1 1.027 ------- TOTAL 374.223 ----------------------------------------------------------------- Barrier indicates load imbalance

  17. MPI Profiles for send/recv

  18. mpi_profiles for recv/send 799

  19. Conclusions • Speedups of hpcu over hpca Large Memory Pages Affinity Speedup N N 1.32 – 1.60 Y N 1.43 – 1.62 N Y 1.47 – 1.78 Y Y 1.52 – 1.85 • Best Environment Variables • MPI.network=ccc0 (instead of cccs) • MEMORY_AFFINITY=yes • MP_AFFINITY=MCM ! With new pvmd • MP_BULK_MIN_MSG_SIZE=50000 • LDR_CNTRL="LARGE_PAGE_DATA=Y“ don’t use – else system calls in LP very slow • MP_EAGER_LIMIT=64K

  20. hpca v hpcu ------Time---------- ----Speedup----- % LP Aff I/O* Total CPU Comms Total CPU Comms Comms min0:     hpca  ***   N  N       2499 1408  1091                    43.6     hpcu H+/22  N  N       1502 1119   383   1.66 1.26  2.85  25.5          H+/21  N  Y       1321  951   370   1.89 1.48  2.95  28.0          H+/20  Y  N       1444 1165   279   1.73 1.21  3.91  19.3          H+/19  Y  Y       1229  962   267   2.03 1.46  4.08  21.7min1:     hpca  ***   N  N       1649 1065   584                    43.6     hpcu H+/22  N  N       1033  825   208   1.60 1.29  2.81  20.1          H+/21  N  Y        948  734   214   1.74 1.45  2.73  22.5          H+/15  Y  N       1019  856   163   1.62 1.24  3.58  16.0          H+/19  Y  Y        914  765   149   1.80 1.39  3.91  16.3

  21. hpca v hpcu ------Time---------- ----Speedup----- % LP Aff I/O* Total CPU Comms Total CPU Comms Comms min0:     hpca  ***   N  N       2499 1408  1091                    43.6     hpcu H+/22  N  N       1502 1119   383   1.66 1.26  2.85  25.5          H+/21  N  Y       1321  951   370   1.89 1.48  2.95  28.0          H+/20  Y  N       1444 1165   279   1.73 1.21  3.91  19.3          H+/19  Y  Y       1229  962   267   2.03 1.46  4.08  21.7min1:     hpca  ***   N  N       1649 1065   584                    43.6     hpcu H+/22  N  N       1033  825   208   1.60 1.29  2.81  20.1          H+/21  N  Y        948  734   214   1.74 1.45  2.73  22.5          H+/15  Y  N       1019  856   163   1.62 1.24  3.58  16.0          H+/19  Y  Y        914  765   149   1.80 1.39  3.91  16.3

  22. mpi_profiles for recv/send 799

  23. Conclusions • Memory Affinity with binding • Program binds to: MOD(task_id*nthrds+thrd_id,32), or • Use new /usr/lpp/ppe.poe/bin/pmdv4 • How to bind if whole node not used • Try VSRAC code from Montpellier • Bind adapter link to MCM ? • Large Pages • Advantages • Need LP for best communication B/W with current software • Disadvantages • Uses extra memory (4GB more per node in 4D-Var min1) • Load Leveler Scheduling • Prototype switch software indicates Large Pages not necessary • Collective Communication • To be investigated

  24. Linux compared to PWR4 for IFS • Linux (run by Peter Mayes) • Opteron, 2GHz, 2 CPUs/node, 6GB/node, myrinet switch • Portland Group compiler: • Compiler flags: -O3 -Mvect=sse • No code optimisation or OpenMP • Linux 1: 1 CPU/node, Myrinet IP • Linux 1A: 1 CPU/node, Myrinet GM • Linux 2: using 2 CPUs/node • IBM Power4 • MPI (intra-node shared memory) and OpenMP • Compiler flags: -O3 –qstrict • hpca: 1.3GHz p690, 8 CPUs/node, 8GB/node, colony switch • hpcu: 1.7GHz p690, 32 CPUs/node, 32GB/node, federation switch

  25. Linux compared to Pwr4

More Related