250 likes | 497 Views
IFS Benchmark with Federation Switch. John Hague, IBM. Introduction. Federation has dramatically improved pwr4 p690 communication, so Measure Federation performance with Small Pages and Large Pages using simulation program Compare Federation and pre-Federation (Colony) performance of IFS
E N D
IFS Benchmark with Federation Switch John Hague, IBM
Introduction • Federation has dramatically improved pwr4 p690 communication, so • Measure Federation performance with Small Pages and Large Pages using simulation program • Compare Federation and pre-Federation (Colony) performance of IFS • Compare Federation performance of IFS with and without Large Pages and Memory Affinity • Examine IFS communication using mpi profiling
Colony v Federation • Colony (hpca) • 1.3GHz 32-processor p690s • Four 8-processor Affinity LPARs per p690 • Needed to get communication performance • Two 180MB/s adapters per LPAR • Federation (hpcu) • 1.7GHz p690s • One 32-processor LPAR per p690 • Memory and MPI MCM Affinity • MPI Task and Memory from same MCM • Slightly better than binding task to specific processor • Two 2-link 1.2GB/s Federation adapters per p690 • Four 1.2GB/s links per node
2 1 0 1 2 3 4 5 3 4 8 9 8 30 31 IFS Communication:transpositions 0 MPI task Node 1 • MPI Alltoall in all rows simultaneously • Mostly shared memory • MPI Alltoall in all columns simultaneously
Simulation of transpositions • All transpositions in “row” use shared memory • All transpositions in “column” use switch • Number of MPI tasks per node varied • But all processors used by using OpenMP threads • Bandwidth measured for MPI Sendrecv calls • Buffers allocated and filled by threadsbetween each call • Large Pages give best switch performance • With current switch software
“Transposition” Bandwidth per link(8 nodes, 4 links/node, 8 tasks/node, 4 threads/task, 2 tasks/link) SP = Small Pages LP = Large Pages
“Transposition” Bandwidth per link(8 nodes, 4 links/node) Multiple threads ensure all processors are used
hpcu v hpca with IFS • Benchmark jobs (provided 3 years ago) • Same executable used on hpcu and hpca • 256 processors used • All jobs run with mpi_profiling (and barriers before data exchange)
IFS Speedups: hpcu v hpca LP = Large Pages; SP = Small Pages MA = Memory Affinity
Percentage Communication hpca ------------------- hpcu --------------------------
Extra Memory needed by Large Pages Large Pages are allocated in Real Memory in segments of 256 MB • MPI_INIT • 80MB which may not be used • MP_BUFFER_MEM (default 64MB) can be reduced • MPI_BUFFER_ALLOCATE needs memory which may not be used • OpenMP threads: • Stack allocated with XLSMPOPTS=“stack=…” may not be used • Fragmentation • Memory is "wasted" • Last 256 MB segment • Only a small part of it may be used
mpi_profile • Examine IFS communication using mpi profiling • Use libmpiprof.a • Calls and MB/s rate for each type of call • Overall • For each higher level subroutine • Histogram of blocksize for each type of call
mpi_profile for T799 128 MPI tasks, 2 threadsWALL time = 5495 sec--------------------------------------------------------------MPI Routine #calls avg. bytes Mbytes time(sec) --------------------------------------------------------------MPI_Send 49784 52733.2 2625.3 7.873MPI_Bsend 6171 454107.3 2802.3 1.331MPI_Isend 84524 1469867.4 124239.1 1.202MPI_Recv 91940 1332252.1 122487.3 359.547MPI_Waitall 75884 0.0 0.0 59.772MPI_Bcast 362 26.6 0.0 0.028MPI_Barrier 9451 0.0 0.0 436.818-------TOTAL 866.574 ----------------------------------------------------------------Barrier indcates load imbalance
mpi_profile for 4D_Var min0 128 MPI tasks, 2 threads WALL time = 1218 sec -------------------------------------------------------------- MPI Routine #calls avg. bytes Mbytes time(sec) -------------------------------------------------------------- MPI_Send 43995 7222.9 317.8 1.033 MPI_Bsend 38473 13898.4 534.7 0.843 MPI_Isend 326703 168598.3 55081.6 6.368 MPI_Recv 432364 127061.8 54936.9 220.877 MPI_Waitall 276222 0.0 0.0 23.166 MPI_Bcast 288 374491.7 107.9 0.490 MPI_Barrier 27062 0.0 0.0 94.168 MPI_Allgatherv 466 285958.8 133.3 26.250 MPI_Allreduce 1325 73.2 0.1 1.027 ------- TOTAL 374.223 ----------------------------------------------------------------- Barrier indicates load imbalance
Conclusions • Speedups of hpcu over hpca Large Memory Pages Affinity Speedup N N 1.32 – 1.60 Y N 1.43 – 1.62 N Y 1.47 – 1.78 Y Y 1.52 – 1.85 • Best Environment Variables • MPI.network=ccc0 (instead of cccs) • MEMORY_AFFINITY=yes • MP_AFFINITY=MCM ! With new pvmd • MP_BULK_MIN_MSG_SIZE=50000 • LDR_CNTRL="LARGE_PAGE_DATA=Y“ don’t use – else system calls in LP very slow • MP_EAGER_LIMIT=64K
hpca v hpcu ------Time---------- ----Speedup----- % LP Aff I/O* Total CPU Comms Total CPU Comms Comms min0: hpca *** N N 2499 1408 1091 43.6 hpcu H+/22 N N 1502 1119 383 1.66 1.26 2.85 25.5 H+/21 N Y 1321 951 370 1.89 1.48 2.95 28.0 H+/20 Y N 1444 1165 279 1.73 1.21 3.91 19.3 H+/19 Y Y 1229 962 267 2.03 1.46 4.08 21.7min1: hpca *** N N 1649 1065 584 43.6 hpcu H+/22 N N 1033 825 208 1.60 1.29 2.81 20.1 H+/21 N Y 948 734 214 1.74 1.45 2.73 22.5 H+/15 Y N 1019 856 163 1.62 1.24 3.58 16.0 H+/19 Y Y 914 765 149 1.80 1.39 3.91 16.3
hpca v hpcu ------Time---------- ----Speedup----- % LP Aff I/O* Total CPU Comms Total CPU Comms Comms min0: hpca *** N N 2499 1408 1091 43.6 hpcu H+/22 N N 1502 1119 383 1.66 1.26 2.85 25.5 H+/21 N Y 1321 951 370 1.89 1.48 2.95 28.0 H+/20 Y N 1444 1165 279 1.73 1.21 3.91 19.3 H+/19 Y Y 1229 962 267 2.03 1.46 4.08 21.7min1: hpca *** N N 1649 1065 584 43.6 hpcu H+/22 N N 1033 825 208 1.60 1.29 2.81 20.1 H+/21 N Y 948 734 214 1.74 1.45 2.73 22.5 H+/15 Y N 1019 856 163 1.62 1.24 3.58 16.0 H+/19 Y Y 914 765 149 1.80 1.39 3.91 16.3
Conclusions • Memory Affinity with binding • Program binds to: MOD(task_id*nthrds+thrd_id,32), or • Use new /usr/lpp/ppe.poe/bin/pmdv4 • How to bind if whole node not used • Try VSRAC code from Montpellier • Bind adapter link to MCM ? • Large Pages • Advantages • Need LP for best communication B/W with current software • Disadvantages • Uses extra memory (4GB more per node in 4D-Var min1) • Load Leveler Scheduling • Prototype switch software indicates Large Pages not necessary • Collective Communication • To be investigated
Linux compared to PWR4 for IFS • Linux (run by Peter Mayes) • Opteron, 2GHz, 2 CPUs/node, 6GB/node, myrinet switch • Portland Group compiler: • Compiler flags: -O3 -Mvect=sse • No code optimisation or OpenMP • Linux 1: 1 CPU/node, Myrinet IP • Linux 1A: 1 CPU/node, Myrinet GM • Linux 2: using 2 CPUs/node • IBM Power4 • MPI (intra-node shared memory) and OpenMP • Compiler flags: -O3 –qstrict • hpca: 1.3GHz p690, 8 CPUs/node, 8GB/node, colony switch • hpcu: 1.7GHz p690, 32 CPUs/node, 32GB/node, federation switch