520 likes | 633 Views
Single Processor Code Optimization for the Datastar (Power 4) Architecture by Ross Walker. Why optimize your code for single cpu performance?. Advantages More calculation in a given time. Less SUs per simulation. Faster time to solution. Better chance of successful MRAC/LRAC proposals.
E N D
Single Processor Code Optimizationfor theDatastar (Power 4) Architectureby Ross Walker
Why optimize your code for single cpu performance? Advantages • More calculation in a given time. • Less SUs per simulation. • Faster time to solution. • Better chance of successful MRAC/LRAC proposals.
REMEMBERWALLCLOCK TIME IS EVERYTHING! • The only metric that ultimately matters is Wallclock time to solution. • Wallclock is how long it takes to run your simulation. • Wallclock is how much you get charged for. • Wallclock is how long your code is blocking other users from using the machine. • Lies, Damn Lies and Statistics... • TFlops/PFlops numbers are ‘worse’ than statistics. • You can easily write a code that gets very high Flops numbers but has a longer time to solution.
Why optimize your code for single cpu performance? Disadvantages • Time consuming – Do not waste weeks optimizing a one off code that will run for 1 hour. • Can make the code harder to read and debug. • Can adversely affect parallel scaling. • Different architectures can respond in different ways.
When to optimize? • Code optimization is an iterative process requiring time, energy and thought. • Performance tuning is recommended for: • production codes that are widely distributed and often used in the research community 2) projects with limited allocation (to maximize available compute hours).
Optimization Strategy This talk will concentrate on how to carry out this section.
Optimization Strategy • Aim: To try to minimize the amount of work the compiler is required to do. DO NOT rely on the compiler to optimize bad code. Use a two pronged approach: 1) Write easily optimizable code to begin with. 2) Once the code is debugged and working try more aggressive optimizations.
Example Code • This talk will go through a number of simple examples of bad coding and show what compilers can and can’t do and how you can work ‘with’ the compiler to produce faster code. • Example code is in: /gpfs/projects/summer_inst/code_opt/ After this talk you will have an opportunity to try to optimize this code. There will be prizes for the most efficient codes.
The Power 4 Architecture • Remember: Clock speed is NOT everything.
Why is the Power 4 chip faster for a lower clock speed? • The speed comes from the inherent parallelism designed into the chip. • 2 Floating point units per core. • 2 Fixed point units per core. Each FP unit can perform 1 fused multiply add (FMA) per clock tick. Thus the two floating point units can perform a theoretical maximum of 4 floating point operations per clock cycle. • You should design your code with this in mind.
What compilers will not do. • They will not reorder large memory arrays. • They will not factor/simplify large equations. • They will not replace inefficient logic. • They will not vectorize large complex loops. • They will not fix inefficient memory access. • They will not detect and remove excess work. • They will not automatically invert divisions. • ...
How can we help the compiler? • We should aim to help the compiler as much as possible by designing our code to match the machine architecture. • Try to always move linearly in memory (Cache hits). • Factor / Simplify equations as much as possible. • Think carefully about how we do things. Is there a more efficient approach? • Remove excess work. • Avoid branching inside loops. I will take you through a series of simple examples that you can then use to try and optimize the example code.
Example ‘bad’ code • In /gpfs/projects/summer_inst/code_opt/ is an example piece of code that we will attempt to optimize. • This code solves the following ‘fictional’ equation:
Example ‘bad’ Code The main loop of the code is as follows: total_e = 0.0d0 cut_count = 0 do i = 1, natom do j = 1, natom if ( j < i ) then !Avoid double counting. vec2 = (coords(1,i)-coords(1,j))**2 + (coords(2,i)-coords(2,j))**2 + (coords(3,i)-coords(3,j))**2 !X^2 + Y^2 + Z^2 rij = sqrt(vec2) !Check if this is below the cut off if ( rij <= cut ) then cut_count = cut_count+1 !Increment the counter of pairs below cutoff current_e = (exp(rij*q(i))*exp(rij*q(j)))/rij total_e = total_e + current_e + 1.0d0/a end if end if end do end do While only short there is a lot of potential optimization available with this code.
Step 1 – Select best compiler / library options • We have sqrt’s and exponentials. • We also have multiple loops. Hence it is essential that we compile with optimization and we link again IBM’s accelerated math library (mass). xlf90 –o original.x –O3 –qarch=pwr4 –qtune=pwr4 –L/usr/lib –lmass original.f
Step 2 – Code Optimization • We can modify the code to make the compiler’s job easier. • I will take you through a series of examples and then you can use these to optimize the example code.
Optimization Tip 1 • Avoid expensive mathematical operations as much as possible. • Addition, Subtraction, Multiplication = Very Fast • Division and Exponentiation = Slow • Square Root = Very Slow(Inverse square root can sometimes be faster)
Optimization Tip 1 Contd... • E.g. in our example code we need to test if Rij is <= to a cut off. rij = sqrt(x^2+y^2+z^2) if (rij <= cut) then ... end if • We can avoid doing excessive square roots by taking into account that cut is a constant and so just comparing to cut^2. vec2 = x^2+y^2+z^2 if (vec2 <= cut) then rij = sqrt(vec2) ... end if
Optimization Tip 2 • Do not rely on the compiler to simplify / factorize equations for you. You should manually do this yourself. E.g.: • This replaces 2 expensive exponentials with just 1. • The same is true for excessive divisions, square roots etc. • Try expanding the equation that your code represents and then refactorizing it with the intention of avoiding as many square roots, exponentials and divisions as possible.
Optimization Tip 3 • Invert divisions wherever possible. do i = 1, N do j = 1, N y(j,i) = x(j,i) / r(i) end do end do Becomes do i = 1, Noner = 1.0d0/r(i) do j = 1, N y(j,i) = x(j,i) * oner end do end do This replaces N^2 divisions with N divisions and N multiplications.
Optimization Tip 3 (Extended Example) • One can extend this idea to situations where something both requires a multiplication by a distance and a division by a distance. E.g. do j = 1, iminus vec(1:3)=qmx(1:3)-qmcoords(1:3,j) r2 = vec(1)**2+vec(2)**2+vec(3)**2onerij = 1.0d0/sqrt(r2) rij = r2*onerij !1.0d0/rij ... ... = ... + param*rij sum = sum+q(j)*onerijend do This takes advantage of the fact that on some systems an inverse square root followed by a multiplication can be faster than a square root followed by a division.
Optimization Tip 4 • Avoid branching in inner loops – This leads to cache misses and also prevents the compiler from vectorizing loops and making use of the multiple floating point units. do i=1,n if (r < 1.0e-16) then if (r < 1.0e-16) then do i=1,n a(i)=0.0; b(i)=0.0; c(i)=0.0 a(i)=0.0; b(i)=0.0; c(i)=0.0 else end do a(i)=x(i)/r else b(i)=y(i)/r do i=1,n c(i)=z(i)/r a(i)=x(i)/r end if b(i)=y(i)/r end do c(i)=z(i)/r end doend if
if (r < 1.0e-16) then do i=1,n a(i)=0.0; b(i)=0.0; c(i)=0.0 end do else oner=1.0d0/r do i=1,n a(i)=x(i)*oner b(i)=y(i)*oner c(i)=z(i)*oner end do end if Invert Divisions Optimization Tip 4 Contd... • How can we improve this code even more? if (r < 1.0e-16) then do i=1,n a(i)=0.0; b(i)=0.0; c(i)=0.0 end do else do i=1,n a(i)=x(i)/r b(i)=y(i)/r c(i)=z(i)/r end do end if
Optimization Tip 4 Contd. • How can we remove the following if statement from the example code? • Hint: consider changing the loop indexes. do i = 1, natom do j = 1, natom if ( j < i ) then !Avoid double counting. vec2 = (coords(i,1)-coords(j,1))**2 + (coords(i,2)-coords(j,2))**2 + (coords(i,3)-coords(j,3))**2 !X^2 + Y^2 + Z^2 rij = sqrt(vec2) !Check if this is below the cut off if ( rij <= cut ) then cut_count = cut_count+1 !Increment the counter of pairs below cutoff current_e = (exp(rij*q(i))*exp(rij*q(j)))/rij total_e = total_e + current_e + 1.0d0/a end if end if end do end do
Optimization Tip 5 • Ensure that loop indexes are constant while a loop is executing. E.g.: do i = 2, N ... do j = 1, i-1 ... end do end do Some compilers will not recognize that i-1 is constant during the j loop and so will re-evaluate i-1 on each loop iteration. We can help the compiler out here as follows: • do i = 2, N...iminus = i-1do j = 1, iminus ...end do • end do
Optimization Tip 6 • Avoid excessive array lookups – Factor array lookups that are not dependent on the inner loop out of the loop. • Some compilers will do this for you. Others will not. x(i) becomes a scalar in the inner loop. do i = 1, N xi = x(i) do j = 1,N ... sum = sum+x(j)*xi end do end do do i = 1, N do j = 1,N ... sum = sum+x(j)*x(i) end do end do
Optimization Tip 7 • Always try to use stride 1 memory access. • Traverse linearly in memory. • Maximizes Cache hits. • In Fortran this means you should always loop over the first array index in the inner loop. E.g. do i=1, Ni do j=1,Nj do k=1,Nk v=x(i,j,k) ... end do end doend do This way round each iteration of loop k jumps a total of i*j*8 bytes meaning it almost always misses the cache Wrong!
Optimization Tip 7 Contd. • By changing the order of the loop we can ensure that we always stride linearly through the x array. • Note: in C memory is laid out in the reverse fashion. Ideally you should use a linear array and do the necessary offset arithmetic yourself. do k=1, Nk do j=1,Nj do i=1,Niv=x(i,j,k) ... end do end doend do
Optimization Tip 7 Contd. • If you have multiple nested loops that loop entirely through an array it can often be more beneficial to replace a multi-dimensional array with a 1-D array and do the arithmetic yourself. E.g. • This avoids unnecessary pointer lookups and can make it easier to vectorize the code Nitr = Nk*Nj*Nido k=1, Nitr v=x(k) ...end do
Optimization Tip 7 Contd. • Sometimes (you need to test), especially on the Power 4 architecture it can help to split arrays of the form (1:x,N) where x is a small (ca. 3) constant into a set of x 1-D arrays. E.g. do i = 1, N ... do j = 1, N vec2=(xi-crd(1,j))**2+(yi-crd(2,j))**2+(zi-crd(3,j))**2 ... end do end do BECOMES do i = 1, N ... do j = 1, N vec2=(xi-x(j))**2+(yi-y(j))**2+(zi-z(j))**2 ... end do end do The reasoning behind this is that the power4 chip has multiple cache lines and so by splitting the array into 3 seperate arrays the compiler can utilize 3 pipelines simultaneously. (You need to test if this works for your code!)
Vectorization • Sometimes if you have a large amount of computation occurring inside a loop it can be beneficial to explicitly vectorize the code and make use of specific vector math libraries. • Often the compiler can do this vectorization for you (xlf90 will explicitly try if –qhot is specified). • Compiler can typically only vectorize small inner loops. Complex loop structures will not get vectorized. Here your code can benefit from explicit vectorization.
Why Does Vectorization Help on a Scalar Machine? • Machine is actually super scalar... • Has a pipeline for doing computation. • Has multiple floating point units. • Vectorization makes it easy for the compiler to use multiple floating point units as we have a large number of operations that are all independent of each other. • Vectorization makes it easy to fill the pipeline.
Why Does Vectorization Help on a Scalar Machine? • On Power 4 a single FMA has to go through 6 stages in the pipeline. • After 1 clock cycle the first FMA operation has completed stage 1 and moves on to stage 2. Processor can now start processing stage 1 of the second FMA operation in parallel with stage 2 of first operation etc etc...
Why Does Vectorization Help on a Scalar Machine? • After 6 operations pipelining of subsequent FMA operations gives one result every clock cycle. • Pipeline latency is thus hidden beyond 6 operations. • Power 4 chip has two floating point units so needs a minimum of 12 independent FMA operations to be fully pipelined. • Thus if we can split our calculation up into a series of long ‘independent’ vector calculations we can maximize the floating point performance of the chip.
Vectorization Contd. do i = 2, Ndo j = 1, i-1 vec2 = ... onerij = 1.0d0/sqrt(vec2) rij = vec2*onerij exp1 = exp(rij) esum = esum + exp1*onerijend do end do Scalar Code
Vectorization Contd. Vector Code do i = 2, Ndo j = 1, i-1 vec2 = ... onerij = 1.0d0/sqrt(vec2) rij = vec2*onerij exp1 = exp(rij) esum = esum + exp1*onerijend do end do loopcount = 0 do i = 2, Ndo j = 1, i-1 loopcount = loopcount+1 vec2 = ...vectmp1(loopcount) = vec2end do end do !Begin vector operations call vrsqrt(vectmp2,vectmp1,loopcount) !vectmp2 now contains onerij vectmp1(1:loopcount) = vectmp1(1:loopcount) * vectmp2(1:loopcount) !vectmp1 now contains rij call vexp(vectmp1,vectmp1,loopcount) !vectmp 1 now contains exp(rij) do i = 1, loopcount esum = esum + vectmp1(i)*vectmp2(i) end do
IBM’s vector math library • IBM has a special tuned vector math library for use on Datastar. This library is called libmassvp4 and can be compiled into your code with: • xlf90 -o foo.x -O3 -L/usr/lib -lmassvp4 foo.f • Support is provided for a number of vector functions including: vrec (vectored inverse), vexp (vectored exp) vsqrt (vectored sqrt), vrsqrt (vectored invsqrt) vlog (vectored ln), vcos (vectored cosine), vtanh (vectored tanh)
The Example Code(A competition) • As mentioned previously an example piece of code is available in /gpfs/projects/summer_inst/code_opt/ on datastar.(COPY THIS TO YOUR OWN DIRECTORY ON GPFS BEFORE EDITING) • Provided in this directory is the unoptimized code (original.f), a script to compile it (compile.x) and a loadleveller script to submit it on 1 cpu (llsubmit submit_orig.ll) • There are also a set of optimized codes that I have created (both scalar and vector). These are read protected at present and will be made readable to all at the end of today.
The Example Code(A competition) • Your mission (should you choose to accept it) is to optimize this example code so that it gives the same answer (to 12 s.f.) on a single P655+ (1.5GHz) Datastar node in the shortest time possible. • There is also a parallel version of the original and optimized code which I have run on 16 cpus (2 * 1.5GHz node) of Datastar.
The Example Code(A competition) • My timings are as follows(for 1 cpu of Datastar (1.5GHz Node): PRIZES!!! - There will be a prize for each person who gets better than 31.0 seconds on 1cpu x 1.5GHz and answer to within 12 s.f.
The Example Code (Parallel)(A competition) • My timings are as follows(for 16 cpu of Datastar (1.5GHz Node): PRIZES!!! - There will be a prize for each person who gets better than 3.1 seconds on 16cpu x 1.5GHz and answer to within 12 s.f.
Directory Contents • Directory = /gpfs/projects/summer_inst/code_opt/ -rwxr-xr-x ... compile.x* ./compile.x to compile codes -rw-r--r-- ... coords.dat coordinate file read by code drwxr-xr-x ... opt/ directory containing optimized solution -rw-r--r-- ... original.f original single cpu code (start here) -rw-r--r-- ... original.out output from run -rwxr-xr-x ... original.x* executable built by compile.x -rw-r--r-- ... original_16cpu.out output from 16cpu run -rw-r--r-- ... original_mpi.f original multi cpu code (start here 2) -rwxr-xr-x ... original_mpi.x* executable built by compile.x -rw-r--r-- ... run_original.llscript Loadleveler script to run 1 cpu job -rw-r--r-- ... run_original_16cpu.llscript Loadleveler script to run 16 cpu job Everything in opt directory is readable except the source code ;-) I will make this available at the end of today.
Running Code • We will use the express queue so you must login to “dspoe.sdsc.edu” • Copy the original code and scripts to a directory on gpfs that is owned by you. • I recommend you do • cd /gpfs • mkdir username (substitute your guest username here) • cd username • cp /gpfs/projects/summer_inst/code_opt/* ./ • Edit the original code with the editor of your choice and apply any optimizations you think it needs.
Running Code 2 • Once you have optimized some of the code you can compile it with the ./compile.x script (will compile both serial and parallel) or with (for single cpu): xlf90 -o original.x -L/usr/lib -lmass -O3 -qtune=pwr4 -qarch=pwr4-bmaxdata:0x80000000 original.f • You then need to submit your job to the express queue of datastar by editing the run_original.llscript loadleveller script.
Running Code 3 #!/usr/bin/ksh ... #@class = express #@node = 1 #@tasks_per_node = 1 #@wall_clock_limit = 00:04:00 #@node_usage = not_shared #@network.MPI = sn_all, shared, US #@job_type = parallel #@job_name= job.$(jobid) #@output = LL_out.$(jobid) #@error = LL_err.$(jobid) #@notification = always #@account_no = accountnumber #@notify_user = your@email.com #@initialdir = /dsgpfs/projects/summer_inst/code_opt/ #@queue ./original.x coords.dat 10.0 > original.out You need to change these 3 lines
Running Code 4 • We have a reservation for 4 nodes on the express queue. You need to set the following environment variable. (you can add this to your .cshrc file if you want) • setenv LL_RES_ID ds100.2.r • Submit the job with: • llsubmit script_name.llscript • Check on queue status with: • llq • Look at the output file for the timing. • Anybody who beats 31 seconds in serial or 3.1 seconds in parallel (16 cpu) please see me for your prize.
Code Results • The original code should produce the following answers: • Num Pairs = 18652660 (Your code should get this exactly) • Total E = 54364.4699664615910 (Your code should get this to within 12 s.f)
Prize Winners • Serial • Phil Blood = 30.08 • Kyle Auguston = 30.79 • Shikui Tang = 29.79 • Roy Culver = 29.97 • Cheol Hwan Park = 29.01 • Parallel
My Code implicit none integer, parameter :: vec_length = 20000000 !Estimate of max cut_count integer*4 :: iarg !Intrinsic getarg require 4 byte integers so this provides !a guard against 64 bit integer compilation. integer :: natom, ier, ios, i, j, cut_count, iminus !Timer variables integer :: int_time, int_count_rate REAL*8 :: time0, time1, time2, count_rate character(len=80) :: arg, filename REAL*8 :: cut !Cut off for Rij in distance units REAL*8, allocatable, dimension(:) :: x,y,z,q, vec_cache1, vec_cache2 REAL*8 :: total_e, current_e, vec2, rij, onerij, qi, xi, yi, zi, exparg REAL*8, parameter :: onea = 1.0d0/3.2d0 !Compiler will evaluate this at compile time.