1 / 19

BG/L Application Tuning and Lessons Learned Bob Walkup IBM Watson Research Center

BG/L Application Tuning and Lessons Learned Bob Walkup IBM Watson Research Center. Performance Decision Tree. Timing Summary from MPI Wrappers. Data for MPI rank 0 of 32768: Times and statistics from MPI_Init() to MPI_Finalize(). --------------------------------------------------------

elmer
Download Presentation

BG/L Application Tuning and Lessons Learned Bob Walkup IBM Watson Research Center

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BG/L Application Tuning and Lessons LearnedBob WalkupIBM Watson Research Center

  2. Performance Decision Tree

  3. Timing Summary from MPI Wrappers Data for MPI rank 0 of 32768: Times and statistics from MPI_Init() to MPI_Finalize(). -------------------------------------------------------- MPI Routine #calls avg. bytes time(sec) -------------------------------------------------------- MPI_Comm_size 3 0.0 0.000 MPI_Comm_rank 3 0.0 0.000 MPI_Sendrecv 2816 112084.3 23.197 MPI_Bcast 3 85.3 0.000 MPI_Gather 1626 104.2 0.579 MPI_Reduce 36 207.2 0.028 MPI_Allreduce 679 76586.3 19.810 -------------------------------------------------------- total communication time = 43.614 seconds. total elapsed time = 302.099 seconds. top of the heap address = 84.832 MBytes.

  4. Compute-Bound => Use gprof/Xprofiler Compile/link with -g -pg Optionally link with libmpitrace.a to limit profiler output – get profile data for ranks 0, min, max, median communication time. Analysis is the same for serial and parallel codes. Gprof => subroutine-level Xprofiler => statement level: clock ticks tied to source lines

  5. Gprof Example : GTC Flat Profile Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 37.43 144.07 144.07 201 0.72 0.72 chargei 25.44 241.97 97.90 200 0.49 0.49 pushi 6.12 265.53 23.56 _xldintv 4.85 284.19 18.66 cos 4.49 301.47 17.28 sinl 4.19 317.61 16.14 200 0.08 0.08 poisson 3.79 332.18 14.57 _pxldmod 3.55 345.86 13.68 _ieee754_exp Time is concentrated in two routines and intrinsic functions. Good prospects for tuning.

  6. Statement-Level Profile : GTC pushi Line ticks source 115 do m=1,mi 116 657 r=sqrt(2.0*zion(1,m)) 117 136 rinv=1.0/r 118 34 ii=max(0,min(mpsi-1,int((r-a0)*delr))) 119 55 ip=max(1,min(mflux,1+int((r-a0)*d_inv))) 120 194 wp0=real(ii+1)-(r-a0)*delr 121 52 wp1=1.0-wp0 122 104 tem=wp0*temp(ii)+wp1*temp(ii+1) 123 86 q=q0+q1*r*ainv+q2*r*r*ainv*ainv 124 166 qinv=1.0/q 125 68 cost=cos(zion(2,m)) 126 18 sint=sin(zion(2,m)) 129 104 b=1.0/(1.0+r*cost) Can pipeline expensive operations like sqrt, reciprocal, cos, sin, … Requires either compiler option (-qhot=vector) or hand-tuning.

  7. Compiler Listing Example : GTC chargei Source section: 55 |! inner flux surface 56 | im=ii 57 | tdum=pi2_inv*(tflr-zetatmp*qtinv(im))+10.0 58 | tdum=(tdum-aint(tdum))*delt(im) 59 | j00=max(0,min(mtheta(im)-1,int(tdum))) 60 | jtion0(larmor,m)=igrid(im)+j00 61 | wtion0(larmor,m)=tdum-real(j00) Register section: GPR's set/used: ssss ssss ssss s-ss ssss ssss ssss ssss FPR's set/used: ssss ssss ssss ssss ssss ssss ssss ssss ssss ssss ssss ss-- ---- ---- ---s s--s Assembler section: 50| 000E04 stw 0 ST4A #SPILL52(gr31,520)=gr4 58| 000E08 bl 0 CALLN fp1=_xldintv,0,fp1,… 59| 000E0C mullw 2 M gr3=gr19,gr15 58| 000E10 rlwinm 1 SLL4 gr5=gr15,2 Issues: function call for aint(), register spills, pipelining, … Get listing with source code: -qlist -qsource

  8. GTC – Performance on Blue Gene/L Original code: main loop time = 384 sec (512 nodes, coprocessor) Tuned code : main loop time = 244 sec (512 nodes, coprocessor) Factor of ~1.6 performance improvement by tuning. Weak scaling, relative performance per processor: #nodes coprocessor virtual-node 512 1.000 0.974 1024 1.002 0.961 2048 0.985 0.963 4096 1.002 0.956 8192 1.009 0.935 16384 0.968 NAN

  9. Daxpy: y(:) = a*x(:) + y(:), with compiler-generated code

  10. Getting Double-FPU Code Generation Use library routines (blas, vector intrinsics, ffts,…) Try compiler options : -O3 -qarch=440d (-qlist -qsource) -O3 -qarch=440d -qhot=simd Add alignment assertions: Fortran: call alignx(16,array(1)) C: __alignx(16,array); Try double-FPU intrinsics: Fortran: loadfp(A), fpadd(a,b), fpmadd(c,b,a), … C : __ldpf(address), __fpadd(a,b), __fpmadd(c,b,a) Can write assembler code.

  11. Optimizing Communication Performance 3D Torus network => the bandwidth is degraded if the traffic goes many hops, sharing links => stay local if possible. Example: 2D domain on 1024 nodes (8x8x16 torus) try 16x64 with BGLMPI_MAPPING=ZXYT Example: QCD codes with logical 4D decomposition try Px = torus x dimension (same for y, z) Pt = 2 (for virtual-node-mode) Layout optimizer : Record the communication matrix, then minimize the cost function to obtain a good mapping. Currently limited to modest torus dimensions.

  12. Finding Communication Problems

  13. POP Communication Time 1920 Processors (40x48 decomposition)

  14. Some Experience with Performance Tools Follow the decision tree – don’t get buried in data. Get details for a just a subset of MPI ranks. Use the parallel job for data analysis (min, max, median etc.). For applications that repeat the same work: sample or trace just a few repeat units. Save cumulative performance data for all ranks in one file.

  15. Some Lessons Learned Text core files are great, as long as you get the call stack (need -g). Use addr2line … takes you from instruction address to the source file and line number. Standard GNU bin utility, compile/link with -g. Use the backtrace() library routine – standard GNU libc utility. Can make wrappers for exit() and abort() routines so that normal exits provide the call stack. What do you do when >10**4 processes are hung? halt cores, dump stacks, make separate text core files, use grep (grep -L tells you which of the >10**4 core files was not stopped in an MPI routine, also use grep + wc (word count).

  16. Lesson: Flash Flood If (task = 0) for (t=1, …, P-1) recv data from task t Else send data to task 0 => results in a flood at task 0 ------------------------------------------------------------------------------------- Add flow control … slow but safe: If (task = 0) for (t=1, … P-1) {send a flag to task t; then recv data from task t} Else {recv a flag from task 0; then send data to task 0}

  17. Lesson: P*P => Can’t Scale integer table(P,P) requires 1 GB for P = 16K Memory requirement limits scalability: example Metis Can sometimes replace table(P,P) with local(P) and remote(P) plus communication to get values stored elsewhere. Some computational algorithms scale as P*P, which can limit scaling: example - certain methods for automatic mesh refinement

  18. More Processors = More Fun Looking forward to the petaflop scale …

More Related