Code Tuning and Parallelization on Boston University’s Scientific Computing Facility

Code Tuning and Parallelization on Boston University’s Scientific Computing Facility Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

Outline • Introduction • Timing • Profiling • Cache • Tuning • Timing/profiling exercise • Parallelization

Introduction • Tuning • Where is most time being used? • How to speed it up • Often as much art as science • Parallelization • After serial tuning, try parallel processing • MPI • OpenMP

Timing

Timing • When tuning/parallelizing a code, need to assess effectiveness of your efforts • Can time whole code and/or specific sections • Some types of timers • unix time command • function/subroutine calls • profiler

CPU or Wall-Clock Time? • both are useful • for parallel runs, really want wall-clock time, since CPU time will be about the same or even increase as number of procs. is increased • CPU time doesn’t account for wait time • wall-clock time may not be accurate if sharing processors • wall-clock timings should always be performed in batch mode

Unix Time Command • easiest way to time code • simply type time before your run command • output differs between c-type shells (cshell, tcshell) and Bourne-type shells (bsh, bash, ksh)

Unix time Command (cont’d) • tcsh results • twister:~ % time mycode • 1.570u 0.010s 0:01.77 89.2% 75+1450k 0+0io 64pf+0w input + output operations user CPU time (s) wall-clock time (s) avg. shared + unshared text space system CPU time (s) page faults + no. times proc. was swapped (u+s)/wc

Unix Time Command (3) • bsh results • $ time mycode • Real 1.62 • User 1.57 • System 0.03 wall-clock time (s) user CPU time (s) system CPU time (s)

Function/Subroutine Calls • often need to time part of code • timers can be inserted in source code • language-dependent

cpu_time • intrinsic subroutine in Fortran • returns userCPU time (in seconds) • no system time is included • 0.01 sec. resolution on p-series real :: t1, t2 call cpu_time(t1) ... do stuff to be timed ... call cpu_time(t2) print*, 'CPU time = ', t2-t1, ' sec.'

system_clock • intrinsic subroutine in Fortran • good for measuring wall-clock time • on p-series: • resolution is 0.01 sec. • max. time is 24 hr.

system_clock (cont’d) • t1andt2are tic counts • count_rateis optional argument containing tics/sec. integer :: t1, t2, count_rate call system_clock(t1, count_rate) ... do stuff to be timed... call system_clock(t2) print*,'wall-clock time = ', & real(t2-t1)/real(count_rate), ‘sec’

times • can be called from C to obtain CPU time • 0.01 sec. resolution on p-series • can also get system time with tms_stime #include <sys/times.h> #include <unistd.h> void main(){ int tics_per_sec; float tic1, tic2; struct tms timedat; tics_per_sec = sysconf(_SC_CLK_TCK); times(&timedat); tic1 = timedat.tms_utime; … do stuff to be timed… times(&timedat); tic2 = timedat.tms_utime; printf("CPU time = %5.2f\n", (float)(tic2-tic1)/(float)tics_per_sec); }

gettimeofday • can be called from C to obtain wall-clock time • msec resolution on p-series #include <sys/time.h> void main(){ struct timeval t; double t1, t2; gettimeofday(&t, NULL); t1 = t.tv_sec + 1.0e-6*t.tv_usec; … do stuff to be timed … gettimeofday(&t, NULL); t2 = t.tv_sec + 1.0e-6*t.tv_usec; printf(“wall-clock time = %5.3f\n", t2-t1); }

MPI_Wtime • convenient wall-clock timer for MPI codes • msec resolution on p-series

MPI_Wtime (cont’d) double precision t1, t2 t1 = mpi_wtime() ... do stuff to be timed ... t2 = mpi_wtime() print*,'wall-clock time = ', t2-t1 • Fortran • C double t1, t2; t1 = MPI_Wtime(); ... do stuff to be timed ... t2 = MPI_Wtime(); printf(“wall-clock time = %5.3f\n”,t2-t1);

omp_get_wtime • convenient wall-clock timer for OpenMP codes • resolution available by calling omp_get_wtick() • 0.01 sec. resolution on p-series

omp_get_wtime (cont’d) double precision t1, t2, omp_get_wtime t1 = omp_get_wtime() ... do stuff to be timed ... t2 = omp_get_wtime() print*,'wall-clock time = ', t2-t1 • Fortran • C double t1, t2; t1 = omp_get_wtime(); ... do stuff to be timed ... t2 = omp_get_wtime(); printf(“wall-clock time = %5.3f\n”,t2-t1);

Timer Summary

Profiling

Profilers • profile tells you how much time is spent in each routine • various profilers available, e.g. • gprof (GNU) • pgprof (Portland Group) • Xprofiler (AIX)

gprof • compile with-pg • filegmon.out will be created when you run • gprof executable > myprof • for multiple procs. (MPI), copy or link gmon.out.n to gmon.out, then run gprof

gprof (cont’d) ngranularity: Each sample hit covers 4 bytes. Time: 435.04 seconds called/total parents index %time self descendents called+self name index called/total children 0.00 340.50 1/1 .__start [2] [1] 78.3 0.00 340.50 1 .main [1] 2.12 319.50 10/10 .contrl [3] 0.04 7.30 10/10 .force [34] 0.00 5.27 1/1 .initia [40] 0.56 3.43 1/1 .plot3da [49] 0.00 1.27 1/1 .data [73]

gprof (3) ngranularity: Each sample hit covers 4 bytes. Time: 435.04 seconds % cumulative self self total time seconds seconds calls ms/call ms/call name 20.5 89.17 89.17 10 8917.00 10918.00 .conduct [5] 7.6 122.34 33.17 323 102.69 102.69 .getxyz [8] 7.5 154.77 32.43 .__mcount [9] 7.2 186.16 31.39 189880 0.17 0.17 .btri [10] 7.2 217.33 31.17 .kickpipes [12] 5.1 239.58 22.25 309895200 0.00 0.00 .rmnmod [16] 2.3 249.67 10.09 269 37.51 37.51 .getq [24]

pgprof • compile with Portland Group compiler • pgf95 (pgf90, etc.) • pgcc • –Mprof=func • similar to –pg • run code • pgprof –exe executable • pops up window with flat profile

pgprof (cont’d)

pgprof (3) • line-level profiling • –Mprof=line • optimizer will re-order lines • profiler will lump lines in some loops or other constructs • may want to compile without optimization, may not • in flat profile, double-click on function

pgprof (4)

xprofiler • AIX (twister) has a graphical interface to gprof • compile with-g -pg -Ox • Ox representswhatever level of optimization you’re using (e.g., O5) • run code • producesgmon.outfile • type xprofiler mycode • mycode is your code run comamnd

xprofiler (cont’d)

xprofiler (3) • filled boxes represent functions or subroutines • “fences” representlibraries • left-click a box to get function name and timing information • right-click on box to get source code or other information

xprofiler (4) • can also get same profiles as from gprof by using menus • report flat profile • report call graph profile

Cache

Cache • Cache is a small chunk of fast memory between the main memory and the registers registers primary cache secondary cache main memory

Cache (cont’d) • Variables are moved from main memory to cache in lines • L1 cache line sizes on our machines • Opteron (katana cluster) 64 bytes • Power4 (p-series) 128 bytes • PPC440 (Blue Gene) 32 bytes • Pentium III (linux cluster) 32 bytes • If variables are used repeatedly, code will run faster since cache memory is much faster than main memory

Cache (cont’d) • Why not just make the main memory out of the same stuff as cache? • Expensive • Runs hot • This was actually done in Cray computers • Liquid cooling system

Cache (cont’d) • Cache hit • Required variable is in cache • Cache miss • Required variable not in cache • If cache is full, something else must be thrown out (sent back to main memory) to make room • Want to minimize number of cache misses

Cache example “mini” cache holds 2 lines, 4 words each for(i=0; i<10; i++) x[i] = i x[8] x[0] x[1] x[9] x[2] a b x[3] Main memory x[4] … … x[5] x[6] x[7]

Cache example (cont’d) • We will ignore i for simplicity • need x[0], not in cache cache miss • load line from memory into cache • next 3 loop indices result in cache hits x[0] x[1] x[2] x[3] for(i=0; i<10; i++) x[i] = i x[8] x[0] x[1] x[9] x[2] a b x[3] x[4] … … x[5] x[6] x[7]

Cache example (cont’d) • need x[4], not in cache cache miss • load line from memory into cache • next 3 loop indices result in cache hits x[0] x[4] x[5] x[1] x[6] x[2] x[3] x[7] for(i=0; i<10; i++) x[i] = i x[8] x[0] x[1] x[9] x[2] a b x[3] x[4] … … x[5] x[6] x[7]

Cache example (cont’d) • need x[8], not in cache cache miss • load line from memory into cache • no room in cache! • replace old line x[8] x[4] x[5] x[9] x[6] a b x[7] for(i==0; i<10; i++) x[i] = i x[8] x[0] x[1] x[9] x[2] a b x[3] x[4] … … x[5] x[6] x[7]

Cache (cont’d) • Contiguous access is important • In C, multidimensional array is stored in memory as a[0][0] a[0][1] a[0][2] …

Cache (cont’d) • In Fortran and Matlab, multidimensional array is stored the opposite way: a(1,1) a(2,1) a(3,1) …

Cache (cont’d) • Rule: Always order your loops appropriately • will usually be taken care of by optimizer • suggestion: don’t rely on optimizer! for(i=0; i<N; i++){ for(j=0; j<N; j++){ a[i][j] = 1.0; } } do j = 1, n do i = 1, n a(i,j) = 1.0 enddo enddo C Fortran

Tuning Tips

Tuning Tips • Some of these tips will be taken care of by compiler optimization • It’s best to do them yourself, since compilers vary

Tuning Tips (cont’d) • Access arrays in contiguous order • For multi-dimensional arrays, rightmost index varies fastest for C and C++, leftmost for Fortran and Matlab Bad Good • for(i=0; i<N; i++){ • for(j=0; j<N; j++{ • a[i][j] = 1.0; • } • } • for(j=0; j<N; j++){ • for(i=0; i<N; i++{ • a[i][j] = 1.0; • } • }

Tuning Tips (3) • Eliminate redundant operations in loops • Bad: • Good: • for(i=0; i<N; i++){ • x = 10; • } … • x = 10; • for(i=0; i<N; i++){ • } …

Tuning Tips (4) for(i=0; i<N; i++){ if(i==0) perform i=0 calculations else perform i>0 calculations } • Eliminate if statements within loops • They may inhibit pipelining

Code Tuning and Parallelization on Boston University’s Scientific Computing Facility