730 likes | 905 Views
Code Tuning and Optimization. Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization. Outline. Introduction Example code Timing Profiling Cache Tuning. Introduction. Timing Where is most time being used? Tuning How to speed it up
E N D
Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization
Information Services & Technology Outline Introduction Example code Timing Profiling Cache Tuning
Information Services & Technology Introduction • Timing • Where is most time being used? • Tuning • How to speed it up • Often as much art as science • Parallel Performance • How to assess how well parallelization is working
Information Services & Technology Example Code
Information Services & Technology Example Code • Simulation of response of eye to stimuli • Response is affected by adjacent inputs • A dark area next to a bright area makes the bright area look brighter • Based on Grossberg & Todorovic paper • Appendix in paper contains all equations • errors in eqns (A4) and (A5) – cross out “log2” • Paper contains 6 levels of response • Our code only contains levels 1 through 5 • Level 6 takes a long time to compute, and would skew our timings!
Information Services & Technology Example Code (cont’d) • All calculations done on a square array • Array size and other constants are defined in gt.h (C) or in the “mods” module at the top of the code (Fortran) • Due to nature of algorithm, array is padded on all sides • npad is size of padding
Information Services & Technology Example Code – Level 1 bright dark Luminance (input) distribution Paper (and code) use “yin-yang square” Array I magnitude of “bright” is ihigh magnitude of “dark” is ilow Fig. 4 in paper
Information Services & Technology Example Code – Level 2 Fig. 5 in paper Level 2 – Circular Concentric On and Off Units Excitation and inhibition vary with distance
Information Services & Technology Level 2 Equations Ipq=initial input (yin-yang)
Information Services & Technology Example Code – Level 3 Fig. 6(d) in paper • Oriented Direction-of-Contrast-Sensitive Units • Respond to angle • 12 discrete angles • Respond to direction of contrast, i.e., light-to-dark or dark-to-light
Information Services & Technology Level 3 Equations
Information Services & Technology Example Code - Level 4 • Oriented Direction-of-Contrast-Insensitive Units • Respond to angle • Do not respond to direction of contrast, i.e., light-to-dark or dark-to-light Fig. 8(a) in paper
Information Services & Technology Level 4 Equations
Information Services & Technology Example Code – Level 5 Level 5 – Boundary Contour Units Pool nearby excitations Fig. 8(d) in paper
Information Services & Technology Level 5 Equation
Information Services & Technology Timing • When tuning/parallelizing a code, need to assess effectiveness of your efforts • Can time whole code and/or specific sections • Some types of timers • unix time command • function/subroutine calls • profiler
Information Services & Technology CPU Time or Wall-Clock Time? • CPU time • How much time the CPU is actually crunching away • User CPU time • Time spent executing your source code • System CPU time • Time spent in system calls such as i/o • Wall-clock time • What you would measure with a stopwatch
Information Services & Technology CPU Time or Wall-Clock Time? (cont’d) • Both are useful • For serial runs without interaction from keyboard, CPU and wall-clock times are usually close • If you prompt for keyboard input, wall-clock time will accumulate if you get a cup of coffee, but CPU time will not
Information Services & Technology CPU Time or Wall-Clock Time? (3) • Parallel runs • Want wall-clock time, since CPU time will be about the same or even increase as number of procs. is increased • Wall-clock time may not be accurate if sharing processors • Wall-clock timings should always be performed in batch mode
Information Services & Technology Unix Time Command easiest way to time code simply type time before your run command output differs between c-type shells (cshell, tcshell) and Bourne-type shells (bsh, bash, ksh)
Information Services & Technology Unix Time Command (cont’d) input + output operations wall-clock time (s) user CPU time (s) avg. shared + unshared text space system CPU time (s) page faults + no. times proc. was swapped (u+s)/wc twister:~ % time mycode 1.570u 0.010s 0:01.77 89.2% 75+1450k 0+0io 64pf+0w
Information Services & Technology Unix Time Command (3) • $ time mycode • Real 1.62 • User 1.57 • System 0.03 wall-clock time (s) user CPU time (s) system CPU time (s) Bourne shell results
Information Services & Technology Exercise 1 zero small oh capital oh • Copy files from /scratch/sondak/gt cp /scratch/sondak/gt/*. • Choose C (gt.c) or Fortran (gt.f90) • Compile with no optimization: pgcc –O0 –o gt gt.cc pgf90 –O0 –o gt gt.f90 • Submit rungt script to batch queue qsubrungt
Information Services & Technology Exercise 1 (cont’d) • Check status qstat–u username • After run has completed a file will appear named rungt.o??????, where ?????? represents the process number • File contains result of time command • Write down wall-clock time • Re-compile using –O3 • Re-run and check time
Information Services & Technology Function/Subroutine Calls often need to time part of code timers can be inserted in source code language-dependent
Information Services & Technology cpu_time real :: t1, t2 call cpu_time(t1) ... do stuff to be timed ... call cpu_time(t2) print*, 'CPU time = ', t2-t1, ' sec.' • intrinsic subroutine in Fortran • returnsuserCPU time(in seconds) • no system time is included • 0.01 sec. resolution on p-series
Information Services & Technology system_clock • intrinsic subroutine in Fortran • good for measuring wall-clocktime • on p-series: • resolution is 0.01 sec. • max. time is 24 hr.
Information Services & Technology system_clock (cont’d) integer :: t1, t2, count_rate call system_clock(t1, count_rate) ... do stuff to be timed... call system_clock(t2) print*,'wall-clock time = ', & real(t2-t1)/real(count_rate), ‘sec’ t1 and t2 are tic counts count_rate is optional argument containing tics/sec.
Information Services & Technology times #include <sys/times.h> #include <unistd.h> void main(){ int tics_per_sec; float tic1, tic2; struct tms timedat; tics_per_sec = sysconf(_SC_CLK_TCK); times(&timedat); tic1 = timedat.tms_utime; … do stuff to be timed… times(&timedat); tic2 = timedat.tms_utime; printf("CPU time = %5.2f\n", (float)(tic2-tic1)/(float)tics_per_sec); } can be called from C to obtain CPU time 0.01 sec. resolution on p-series can also get system time with tms_stime
Information Services & Technology gettimeofday #include <sys/time.h> void main(){ struct timeval t; double t1, t2; gettimeofday(&t, NULL); t1 = t.tv_sec + 1.0e-6*t.tv_usec; … do stuff to be timed … gettimeofday(&t, NULL); t2 = t.tv_sec + 1.0e-6*t.tv_usec; printf(“wall-clock time = %5.3f\n", t2-t1); } can be called from C to obtain wall-clock time msec resolution on p-series
Information Services & Technology MPI_Wtime convenient wall-clock timer for MPI codes msecresolution on p-series
Information Services & Technology MPI_Wtime (cont’d) double precision t1, t2 t1 = mpi_wtime() ... do stuff to be timed ... t2 = mpi_wtime() print*,'wall-clock time = ', t2-t1 double t1, t2; t1 = MPI_Wtime(); ... do stuff to be timed ... t2 = MPI_Wtime(); printf(“wall-clock time = %5.3f\n”,t2-t1); Fortran C
Information Services & Technology omp_get_time convenientwall-clocktimer for OpenMPcodes resolution available by calling omp_get_wtick() 0.01 sec. resolution on p-series
Information Services & Technology omp_get_wtime (cont’d) double precision t1, t2, omp_get_wtime t1 = omp_get_wtime() ... do stuff to be timed ... t2 = omp_get_wtime() print*,'wall-clock time = ', t2-t1 double t1, t2; t1 = omp_get_wtime(); ... do stuff to be timed ... t2 = omp_get_wtime(); printf(“wall-clock time = %5.3f\n”,t2-t1); Fortran C
Information Services & Technology Timer Summary
Information Services & Technology Exercise 2 Put wall-clock timer around each “level” in the example code Print time for each level Compile and run
Information Services & Technology Profiling
Information Services & Technology Profilers • profile tells you how much time is spent in each routine • gives a level of granularity not available with previous timers • e.g., function may be called from many places • various profilers available, e.g. • gprof (GNU) • pgprof (Portland Group) • Xprofiler (AIX)
Information Services & Technology gprof compile with -pg filegmon.out will be created when you run gprof executable > myprof for multiple procs. (MPI), copy or link gmon.out.n to gmon.out, then run gprof
Information Services & Technology gprof (cont’d) ngranularity: Each sample hit covers 4 bytes. Time: 435.04 seconds % cumulative self self total time seconds seconds calls ms/call ms/call name 20.5 89.17 89.17 10 8917.00 10918.00 .conduct [5] 7.6 122.34 33.17 323 102.69 102.69 .getxyz [8] 7.5 154.77 32.43 .__mcount [9] 7.2 186.16 31.39 189880 0.17 0.17 .btri [10] 7.2 217.33 31.17 .kickpipes [12] 5.1 239.58 22.25 309895200 0.00 0.00 .rmnmod [16] 2.3 249.67 10.09 269 37.51 37.51 .getq [24]
Information Services & Technology gprof (3) ngranularity: Each sample hit covers 4 bytes. Time: 435.04 seconds called/total parents index %time self descendents called+self name index called/total children 0.00 340.50 1/1 .__start [2] [1] 78.3 0.00 340.50 1 .main [1] 2.12 319.50 10/10 .contrl [3] 0.04 7.30 10/10 .force [34] 0.00 5.27 1/1 .initia [40] 0.56 3.43 1/1 .plot3da [49] 0.00 1.27 1/1 .data [73]
Information Services & Technology pgprof • compile with Portland Group compiler • pgf90 (pgf95, etc.) • pgcc • –Mprof=func • similar to –pg • run code • pgprof –exe executable • pops up window with flat profile
Information Services & Technology pgprof (cont’d)
Information Services & Technology pgprof (3) • To save profile data to a file: • re-run pgprof using –textflag • at command prompt type p > filename • filename is the name you want to give the profile file • type quit to get out of profiler
Information Services & Technology Exercise 3 • Use pgprof to profile code • compile using –Mprof=func • run code • create profile using pgprof –exe gt • Note which routines use most time • Please close pgprof when you’re through • Leaving window open ties up a license
Information Services & Technology Line-Level Profiling • Times individual lines • For pgprof, compile with the flag –Mprof=line • Optimizer will re-order lines • profiler will lump lines in some loops or other constructs • may want to compile without optimization, may not • In flat profile, double-click on function to get line-level data
Information Services & Technology Line-Level Profiling (cont’d)
Information Services & Technology Exercise 4 • Compile code with –Mprof=lineand –O0and run • will take about 5 minutes to run due to overhead from line-level profiling and lack of optimization • Examine line-level profile for most time-consuming routine • Note lines with longest time consumption • Save your profile data to a file (we will need it later) • re-run pgprof using –textflag • at command prompt type p > prof