IFS Benchmark with Federation Switch

IFS Benchmark with Federation Switch John Hague, IBM

Introduction • Federation has dramatically improved pwr4 p690 communication, so • Measure Federation performance with Small Pages and Large Pages using simulation program • Compare Federation and pre-Federation (Colony) performance of IFS • Compare Federation performance of IFS with and without Large Pages and Memory Affinity • Examine IFS communication using mpi profiling

Colony v Federation • Colony (hpca) • 1.3GHz 32-processor p690s • Four 8-processor Affinity LPARs per p690 • Needed to get communication performance • Two 180MB/s adapters per LPAR • Federation (hpcu) • 1.7GHz p690s • One 32-processor LPAR per p690 • Memory and MPI MCM Affinity • MPI Task and Memory from same MCM • Slightly better than binding task to specific processor • Two 2-link 1.2GB/s Federation adapters per p690 • Four 1.2GB/s links per node

2 1 0 1 2 3 4 5 3 4 8 9 8 30 31 IFS Communication:transpositions 0 MPI task Node 1 • MPI Alltoall in all rows simultaneously • Mostly shared memory • MPI Alltoall in all columns simultaneously

Simulation of transpositions • All transpositions in “row” use shared memory • All transpositions in “column” use switch • Number of MPI tasks per node varied • But all processors used by using OpenMP threads • Bandwidth measured for MPI Sendrecv calls • Buffers allocated and filled by threadsbetween each call • Large Pages give best switch performance • With current switch software

“Transposition” Bandwidth per link(8 nodes, 4 links/node, 8 tasks/node, 4 threads/task, 2 tasks/link) SP = Small Pages LP = Large Pages

“Transposition” Bandwidth per link(8 nodes, 4 links/node) Multiple threads ensure all processors are used

hpcu v hpca with IFS • Benchmark jobs (provided 3 years ago) • Same executable used on hpcu and hpca • 256 processors used • All jobs run with mpi_profiling (and barriers before data exchange)

IFS Speedups: hpcu v hpca LP = Large Pages; SP = Small Pages MA = Memory Affinity

LP/SP & MA/noMA CPU comparison

LP/SP & MA/noMA Comms comparison

Percentage Communication hpca ------------------- hpcu --------------------------

Extra Memory needed by Large Pages Large Pages are allocated in Real Memory in segments of 256 MB • MPI_INIT • 80MB which may not be used • MP_BUFFER_MEM (default 64MB) can be reduced • MPI_BUFFER_ALLOCATE needs memory which may not be used • OpenMP threads: • Stack allocated with XLSMPOPTS=“stack=…” may not be used • Fragmentation • Memory is "wasted" • Last 256 MB segment • Only a small part of it may be used

mpi_profile • Examine IFS communication using mpi profiling • Use libmpiprof.a • Calls and MB/s rate for each type of call • Overall • For each higher level subroutine • Histogram of blocksize for each type of call

mpi_profile for T799 128 MPI tasks, 2 threadsWALL time = 5495 sec--------------------------------------------------------------MPI Routine #calls avg. bytes Mbytes time(sec) --------------------------------------------------------------MPI_Send 49784 52733.2 2625.3 7.873MPI_Bsend 6171 454107.3 2802.3 1.331MPI_Isend 84524 1469867.4 124239.1 1.202MPI_Recv 91940 1332252.1 122487.3 359.547MPI_Waitall 75884 0.0 0.0 59.772MPI_Bcast 362 26.6 0.0 0.028MPI_Barrier 9451 0.0 0.0 436.818-------TOTAL 866.574 ----------------------------------------------------------------Barrier indcates load imbalance

mpi_profile for 4D_Var min0 128 MPI tasks, 2 threads WALL time = 1218 sec -------------------------------------------------------------- MPI Routine #calls avg. bytes Mbytes time(sec) -------------------------------------------------------------- MPI_Send 43995 7222.9 317.8 1.033 MPI_Bsend 38473 13898.4 534.7 0.843 MPI_Isend 326703 168598.3 55081.6 6.368 MPI_Recv 432364 127061.8 54936.9 220.877 MPI_Waitall 276222 0.0 0.0 23.166 MPI_Bcast 288 374491.7 107.9 0.490 MPI_Barrier 27062 0.0 0.0 94.168 MPI_Allgatherv 466 285958.8 133.3 26.250 MPI_Allreduce 1325 73.2 0.1 1.027 ------- TOTAL 374.223 ----------------------------------------------------------------- Barrier indicates load imbalance

MPI Profiles for send/recv

mpi_profiles for recv/send 799

Conclusions • Speedups of hpcu over hpca Large Memory Pages Affinity Speedup N N 1.32 – 1.60 Y N 1.43 – 1.62 N Y 1.47 – 1.78 Y Y 1.52 – 1.85 • Best Environment Variables • MPI.network=ccc0 (instead of cccs) • MEMORY_AFFINITY=yes • MP_AFFINITY=MCM ! With new pvmd • MP_BULK_MIN_MSG_SIZE=50000 • LDR_CNTRL="LARGE_PAGE_DATA=Y“ don’t use – else system calls in LP very slow • MP_EAGER_LIMIT=64K

hpca v hpcu ------Time---------- ----Speedup----- % LP Aff I/O* Total CPU Comms Total CPU Comms Comms min0: hpca *** N N 2499 1408 1091 43.6 hpcu H+/22 N N 1502 1119 383 1.66 1.26 2.85 25.5 H+/21 N Y 1321 951 370 1.89 1.48 2.95 28.0 H+/20 Y N 1444 1165 279 1.73 1.21 3.91 19.3 H+/19 Y Y 1229 962 267 2.03 1.46 4.08 21.7min1: hpca *** N N 1649 1065 584 43.6 hpcu H+/22 N N 1033 825 208 1.60 1.29 2.81 20.1 H+/21 N Y 948 734 214 1.74 1.45 2.73 22.5 H+/15 Y N 1019 856 163 1.62 1.24 3.58 16.0 H+/19 Y Y 914 765 149 1.80 1.39 3.91 16.3

mpi_profiles for recv/send 799

Conclusions • Memory Affinity with binding • Program binds to: MOD(task_id*nthrds+thrd_id,32), or • Use new /usr/lpp/ppe.poe/bin/pmdv4 • How to bind if whole node not used • Try VSRAC code from Montpellier • Bind adapter link to MCM ? • Large Pages • Advantages • Need LP for best communication B/W with current software • Disadvantages • Uses extra memory (4GB more per node in 4D-Var min1) • Load Leveler Scheduling • Prototype switch software indicates Large Pages not necessary • Collective Communication • To be investigated

Linux compared to PWR4 for IFS • Linux (run by Peter Mayes) • Opteron, 2GHz, 2 CPUs/node, 6GB/node, myrinet switch • Portland Group compiler: • Compiler flags: -O3 -Mvect=sse • No code optimisation or OpenMP • Linux 1: 1 CPU/node, Myrinet IP • Linux 1A: 1 CPU/node, Myrinet GM • Linux 2: using 2 CPUs/node • IBM Power4 • MPI (intra-node shared memory) and OpenMP • Compiler flags: -O3 –qstrict • hpca: 1.3GHz p690, 8 CPUs/node, 8GB/node, colony switch • hpcu: 1.7GHz p690, 32 CPUs/node, 32GB/node, federation switch

Linux compared to Pwr4

IFS Benchmark with Federation Switch

IFS Benchmark with Federation Switch

Presentation Transcript

Benchmark:

Paul Johnson, IFS

Benchmark

Benchmark

BENCHMARK

IFS/2013/329397

JWST/IFS Prototype

JWST/IFS Prototype

IFS Coverage

IFS DATA ANALYSIS

MAP Re-Analysis with IFS/ARPEGE/ALADIN

Benchmark

Benchmark

IFS Mathematics Syllabus

Ifs