1 / 73

Code Tuning and Optimization

Code Tuning and Optimization. Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization. Outline. Introduction Example code Timing Profiling Cache Tuning. Introduction. Timing Where is most time being used? Tuning How to speed it up

calais
Download Presentation

Code Tuning and Optimization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

  2. Information Services & Technology Outline Introduction Example code Timing Profiling Cache Tuning

  3. Information Services & Technology Introduction • Timing • Where is most time being used? • Tuning • How to speed it up • Often as much art as science • Parallel Performance • How to assess how well parallelization is working

  4. Information Services & Technology Example Code

  5. Information Services & Technology Example Code • Simulation of response of eye to stimuli • Response is affected by adjacent inputs • A dark area next to a bright area makes the bright area look brighter • Based on Grossberg & Todorovic paper • Appendix in paper contains all equations • errors in eqns (A4) and (A5) – cross out “log2” • Paper contains 6 levels of response • Our code only contains levels 1 through 5 • Level 6 takes a long time to compute, and would skew our timings!

  6. Information Services & Technology Example Code (cont’d) • All calculations done on a square array • Array size and other constants are defined in gt.h (C) or in the “mods” module at the top of the code (Fortran) • Due to nature of algorithm, array is padded on all sides • npad is size of padding

  7. Information Services & Technology Example Code – Level 1 bright dark Luminance (input) distribution Paper (and code) use “yin-yang square” Array I magnitude of “bright” is ihigh magnitude of “dark” is ilow Fig. 4 in paper

  8. Information Services & Technology Example Code – Level 2 Fig. 5 in paper Level 2 – Circular Concentric On and Off Units Excitation and inhibition vary with distance

  9. Information Services & Technology Level 2 Equations Ipq=initial input (yin-yang)

  10. Information Services & Technology Example Code – Level 3 Fig. 6(d) in paper • Oriented Direction-of-Contrast-Sensitive Units • Respond to angle • 12 discrete angles • Respond to direction of contrast, i.e., light-to-dark or dark-to-light

  11. Information Services & Technology Level 3 Equations

  12. Information Services & Technology Example Code - Level 4 • Oriented Direction-of-Contrast-Insensitive Units • Respond to angle • Do not respond to direction of contrast, i.e., light-to-dark or dark-to-light Fig. 8(a) in paper

  13. Information Services & Technology Level 4 Equations

  14. Information Services & Technology Example Code – Level 5 Level 5 – Boundary Contour Units Pool nearby excitations Fig. 8(d) in paper

  15. Information Services & Technology Level 5 Equation

  16. Information Services & Technology Timing

  17. Information Services & Technology Timing • When tuning/parallelizing a code, need to assess effectiveness of your efforts • Can time whole code and/or specific sections • Some types of timers • unix time command • function/subroutine calls • profiler

  18. Information Services & Technology CPU Time or Wall-Clock Time? • CPU time • How much time the CPU is actually crunching away • User CPU time • Time spent executing your source code • System CPU time • Time spent in system calls such as i/o • Wall-clock time • What you would measure with a stopwatch

  19. Information Services & Technology CPU Time or Wall-Clock Time? (cont’d) • Both are useful • For serial runs without interaction from keyboard, CPU and wall-clock times are usually close • If you prompt for keyboard input, wall-clock time will accumulate if you get a cup of coffee, but CPU time will not

  20. Information Services & Technology CPU Time or Wall-Clock Time? (3) • Parallel runs • Want wall-clock time, since CPU time will be about the same or even increase as number of procs. is increased • Wall-clock time may not be accurate if sharing processors • Wall-clock timings should always be performed in batch mode

  21. Information Services & Technology Unix Time Command easiest way to time code simply type time before your run command output differs between c-type shells (cshell, tcshell) and Bourne-type shells (bsh, bash, ksh)

  22. Information Services & Technology Unix Time Command (cont’d) input + output operations wall-clock time (s) user CPU time (s) avg. shared + unshared text space system CPU time (s) page faults + no. times proc. was swapped (u+s)/wc twister:~ % time mycode 1.570u 0.010s 0:01.77 89.2% 75+1450k 0+0io 64pf+0w

  23. Information Services & Technology Unix Time Command (3) • $ time mycode • Real 1.62 • User 1.57 • System 0.03 wall-clock time (s) user CPU time (s) system CPU time (s) Bourne shell results

  24. Information Services & Technology Exercise 1 zero small oh capital oh • Copy files from /scratch/sondak/gt cp /scratch/sondak/gt/*. • Choose C (gt.c) or Fortran (gt.f90) • Compile with no optimization: pgcc –O0 –o gt gt.cc pgf90 –O0 –o gt gt.f90 • Submit rungt script to batch queue qsubrungt

  25. Information Services & Technology Exercise 1 (cont’d) • Check status qstat–u username • After run has completed a file will appear named rungt.o??????, where ?????? represents the process number • File contains result of time command • Write down wall-clock time • Re-compile using –O3 • Re-run and check time

  26. Information Services & Technology Function/Subroutine Calls often need to time part of code timers can be inserted in source code language-dependent

  27. Information Services & Technology cpu_time real :: t1, t2 call cpu_time(t1) ... do stuff to be timed ... call cpu_time(t2) print*, 'CPU time = ', t2-t1, ' sec.' • intrinsic subroutine in Fortran • returnsuserCPU time(in seconds) • no system time is included • 0.01 sec. resolution on p-series

  28. Information Services & Technology system_clock • intrinsic subroutine in Fortran • good for measuring wall-clocktime • on p-series: • resolution is 0.01 sec. • max. time is 24 hr.

  29. Information Services & Technology system_clock (cont’d) integer :: t1, t2, count_rate call system_clock(t1, count_rate) ... do stuff to be timed... call system_clock(t2) print*,'wall-clock time = ', & real(t2-t1)/real(count_rate), ‘sec’ t1 and t2 are tic counts count_rate is optional argument containing tics/sec.

  30. Information Services & Technology times #include <sys/times.h> #include <unistd.h> void main(){ int tics_per_sec; float tic1, tic2; struct tms timedat; tics_per_sec = sysconf(_SC_CLK_TCK); times(&timedat); tic1 = timedat.tms_utime; … do stuff to be timed… times(&timedat); tic2 = timedat.tms_utime; printf("CPU time = %5.2f\n", (float)(tic2-tic1)/(float)tics_per_sec); } can be called from C to obtain CPU time 0.01 sec. resolution on p-series can also get system time with tms_stime

  31. Information Services & Technology gettimeofday #include <sys/time.h> void main(){ struct timeval t; double t1, t2; gettimeofday(&t, NULL); t1 = t.tv_sec + 1.0e-6*t.tv_usec; … do stuff to be timed … gettimeofday(&t, NULL); t2 = t.tv_sec + 1.0e-6*t.tv_usec; printf(“wall-clock time = %5.3f\n", t2-t1); } can be called from C to obtain wall-clock time msec resolution on p-series

  32. Information Services & Technology MPI_Wtime convenient wall-clock timer for MPI codes msecresolution on p-series

  33. Information Services & Technology MPI_Wtime (cont’d) double precision t1, t2 t1 = mpi_wtime() ... do stuff to be timed ... t2 = mpi_wtime() print*,'wall-clock time = ', t2-t1 double t1, t2; t1 = MPI_Wtime(); ... do stuff to be timed ... t2 = MPI_Wtime(); printf(“wall-clock time = %5.3f\n”,t2-t1); Fortran C

  34. Information Services & Technology omp_get_time convenientwall-clocktimer for OpenMPcodes resolution available by calling omp_get_wtick() 0.01 sec. resolution on p-series

  35. Information Services & Technology omp_get_wtime (cont’d) double precision t1, t2, omp_get_wtime t1 = omp_get_wtime() ... do stuff to be timed ... t2 = omp_get_wtime() print*,'wall-clock time = ', t2-t1 double t1, t2; t1 = omp_get_wtime(); ... do stuff to be timed ... t2 = omp_get_wtime(); printf(“wall-clock time = %5.3f\n”,t2-t1); Fortran C

  36. Information Services & Technology Timer Summary

  37. Information Services & Technology Exercise 2 Put wall-clock timer around each “level” in the example code Print time for each level Compile and run

  38. Information Services & Technology Profiling

  39. Information Services & Technology Profilers • profile tells you how much time is spent in each routine • gives a level of granularity not available with previous timers • e.g., function may be called from many places • various profilers available, e.g. • gprof (GNU) • pgprof (Portland Group) • Xprofiler (AIX)

  40. Information Services & Technology gprof compile with -pg filegmon.out will be created when you run gprof executable > myprof for multiple procs. (MPI), copy or link gmon.out.n to gmon.out, then run gprof

  41. Information Services & Technology gprof (cont’d) ngranularity: Each sample hit covers 4 bytes. Time: 435.04 seconds % cumulative self self total time seconds seconds calls ms/call ms/call name 20.5 89.17 89.17 10 8917.00 10918.00 .conduct [5] 7.6 122.34 33.17 323 102.69 102.69 .getxyz [8] 7.5 154.77 32.43 .__mcount [9] 7.2 186.16 31.39 189880 0.17 0.17 .btri [10] 7.2 217.33 31.17 .kickpipes [12] 5.1 239.58 22.25 309895200 0.00 0.00 .rmnmod [16] 2.3 249.67 10.09 269 37.51 37.51 .getq [24]

  42. Information Services & Technology gprof (3) ngranularity: Each sample hit covers 4 bytes. Time: 435.04 seconds called/total parents index %time self descendents called+self name index called/total children 0.00 340.50 1/1 .__start [2] [1] 78.3 0.00 340.50 1 .main [1] 2.12 319.50 10/10 .contrl [3] 0.04 7.30 10/10 .force [34] 0.00 5.27 1/1 .initia [40] 0.56 3.43 1/1 .plot3da [49] 0.00 1.27 1/1 .data [73]

  43. Information Services & Technology pgprof • compile with Portland Group compiler • pgf90 (pgf95, etc.) • pgcc • –Mprof=func • similar to –pg • run code • pgprof –exe executable • pops up window with flat profile

  44. Information Services & Technology pgprof (cont’d)

  45. Information Services & Technology pgprof (3) • To save profile data to a file: • re-run pgprof using –textflag • at command prompt type p > filename • filename is the name you want to give the profile file • type quit to get out of profiler

  46. Information Services & Technology Exercise 3 • Use pgprof to profile code • compile using –Mprof=func • run code • create profile using pgprof –exe gt • Note which routines use most time • Please close pgprof when you’re through • Leaving window open ties up a license

  47. Information Services & Technology Line-Level Profiling • Times individual lines • For pgprof, compile with the flag –Mprof=line • Optimizer will re-order lines • profiler will lump lines in some loops or other constructs • may want to compile without optimization, may not • In flat profile, double-click on function to get line-level data

  48. Information Services & Technology Line-Level Profiling (cont’d)

  49. Information Services & Technology Exercise 4 • Compile code with –Mprof=lineand –O0and run • will take about 5 minutes to run due to overhead from line-level profiling and lack of optimization • Examine line-level profile for most time-consuming routine • Note lines with longest time consumption • Save your profile data to a file (we will need it later) • re-run pgprof using –textflag • at command prompt type p > prof

  50. Information Services & Technology Cache

More Related