140 likes | 244 Views
Controlling Process/Thread Placement on Zeus. Raghu Reddy ( Raghu.Reddy@noaa.gov ). Placement example. GFS, T574 case, with 241 MPI ranks Nodes=121, PPN=2, NUMTHD=5 2 cores left idle Artificial example to illustrate the problem . Default or “basic” omplace.
E N D
Controlling Process/ThreadPlacement on Zeus Raghu Reddy (Raghu.Reddy@noaa.gov) NCEP MTT Meeting
Placement example • GFS, T574 case, with 241 MPI ranks • Nodes=121, PPN=2, NUMTHD=5 • 2 cores left idle • Artificial example to illustrate the problem NCEP MTT Meeting
Default or “basic” omplace qsub -l nodes=121:ppn=2 jobfile mpiexec_mpt -np 241 omplace –nt 5 ./a.out NCEP MTT Meeting
One Zeus node Node Socket C C C C C C C C C C C C • Two Sockets • Each with it’s own memory (12 GB each, 24 GB Total) • QPI makes it possible for 1 sock to access the other’s memory • Non Uniform Memory Access (NUMA) • Different BW • Difference Latency • Memory BW is shared by 6 cores NCEP MTT Meeting
Omplace with bsand st set NCEP MTT Meeting
Controlling placement • bs = (ppn*numthds)/2 • mpiexec_mpt-np 241 omplace–c 0-:bs=$bs+st=6 –nt5 ./a.out -c CPU list BS Block Size ST Stride NCEP MTT Meeting
Impact of Proper Placement? • Depends on the application • If not memory BW limited, minimal impact • DGEMM • Memory BW limited applications will see good benefit • STREAM • Most applications fall in-between NCEP MTT Meeting
Impact of Placement: Kernels • HPCC Benchmark • MPI benchmark to characterize performance • “Single” (we will ignore this for today) • “Star” – Independent copy per core NCEP MTT Meeting
Controlling placement: GFS • bs = (ppn*numthds)/2 • mpiexec_mpt-np 241 omplace–c 0-:bs=$bs+st=6 –nt5 ./a.out • fe2% dd 20:13:36 20:17:39 nt-5-ppn-2-noomplace • 243 nt-5-ppn-2-noomplace • fe2% dd 16:26:01 16:28:47 nt-5-ppn-2 • 166 nt-5-ppn-2 • fe2% dd 16:30:32 16:33:00 nt-5-ppn-2-bs-5 • 148 nt-5-ppn-2-bs-5 • fe2% NCEP MTT Meeting
A More Practical example • Your MPI (non-threaded application) needs more memory than what is available per core • So you have to use ppn=6 instead of ppn=12 • If you run it without omplace, all ranks would be put on 1 socket • 6 ranks on one socket, 0 ranks on the second! • Use bs = 3 (ppn*numthd/2) • This will put 3 ranks on each socket • Improves memory bandwidth NCEP MTT Meeting
Test run of GFS (no OpenMP) • GFS, T574 case, with 241 MPI ranks • Nodes=41, PPN=6, NUMTHD=1 noomp runs ------------ fe2% dd 13:10:50 13:16:35 nt-1-ppn-6 345 nt-1-ppn-6 fe2% dd 13:26:55 13:32:13 nt-1-ppn-6-nt-2 318 nt-1-ppn-6-nt-2 fe2% dd 13:57:15 14:02:27 nt-1-ppn-6-bs-3 312 nt-1-ppn-6-bs-3 fe2% NCEP MTT Meeting
GFS:Nodes=41, PPN=6, THDS=2 fe2% dd 14:51:07 14:54:52 nt-2-ppn-6 (no TAU) 225 nt-2-ppn-6 fe2% dd 14:27:57 14:32:17 tau-3files-nt-2-ppn-6 260 tau-3files-nt-2-ppn-6 fe2% NCEP MTT Meeting
Summary • Under certain circumstances using proper placement can be beneficial • In general, if you’re using all the available cores this may not be important. • This may be significant if you are leaving some cores idle where it may be beneficial. • Especially so, if you idle cores and use “remote” memory NCEP MTT Meeting
Questions? • Thanks! NCEP MTT Meeting