560 likes | 583 Views
Learn about parallelization methods, load balancing, false sharing, and cache line issues in computational astrophysics. Understand efficient code optimization for parallel execution.
E N D
Computational Methods in Astrophysics Dr Rob Thacker (AT319E) thacker@ap
Column vs Row major storage • In FORTRAN the storage is such that A(3,N) is A(1,1),A(2,1),A(3,1),A(1,2),A(2,2),A(3,2) in memory • The first index is contiguous in memory (column major) • In C/C++ A[N][3] is laid out such that A[1][1],A[1][2],A[1][3],A[2][1],A[2][2],A[2][3] • The last index is contiguous (row major)
FORTRAN do i=1,n Y(i)=a*X(i)+Y(i) end do C$OMP PARALLEL DO C$OMP& DEFAULT(NONE) C$OMP& PRIVATE(i),SHARED(X,Y,n,a) do i=1,n Y(i)=a*X(i)+Y(i) end do Loop Level Parallelism • Consider the single precision vector add-multiply operation Y=aX+Y (“SAXPY”) C/C++ for (i=1;i<=n;++i) { Y[i]+=a*X[i]; } #pragma omp parallel for \ private(i) shared(X,Y,n,a) for (i=1;i<=n;++i) { Y[i]+=a*X[i]; }
Comment pragmas for FORTRAN - ampersand necessary for continuation Denotes this is a region of code for parallel execution Good programming practice, must declare nature of all variables Thread SHARED variables: all threads can access these variables, but must not update individual memory locations simultaneously Thread PRIVATE variables: each thread must have their own copy of this variable (in this case i is the only private variable) In more detail C$OMP PARALLEL DO C$OMP& DEFAULT(NONE) C$OMP& PRIVATE(i),SHARED(X,Y,n,a) do i=1,n Y(i)=a*X(i)+Y(i) end do
Reminder • Recall private variables are uninitialized • Motivation: no need to copy in from serial section • Private variables do not carry a value into the serial parts of the code • Motivation: no need to copy from parallel to serial • API provides to mechanisms to circumvent this issue • FIRSTPRIVATE • LASTPRIVATE
Load balancing • When running in parallel you are only as fast as your slowest thread • In example, total work is 40 seconds, & have 4 cpus • Max speed up would be 40/4=10 secs • All have to equal 10 secs though to give max speed-up Example of poor load balance, only a 40/16=2.5 speed-up despite using 4 processors
STATIC scheduling • Simplest of the four • If SCHEDULE is unspecified, STATIC scheduling will result • Default behaviour is to simply divide up the iterations among the threads ~n/(# threads) • STATIC(chunksize), creates a cyclic distribution of iterations
Comparison STATIC No chunksize THREAD 1 THREAD 2 THREAD 3 THREAD 4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 STATIC chunksize=1 THREAD 1 THREAD 2 THREAD 3 THREAD 4 1 9 13 2 6 10 14 3 7 11 15 4 8 12 16 5
Chunksize & Cache line issues • If you are accessing arrays using the loop index, e.g. • Ensure chunksize > words in cache line • False sharing otherwise C$OMP PARALLEL DO C$OMP& PRIVATE(I,..), SHARED(a,..) C$OMP& SCHEDULE(STATIC,1) do i=1,n **work** a(i)=… end do
Drawbacks to STATIC scheduling • Naïve method of achieving good load balance • Works well if the work in each iteration is constant • If work varies by factor ~10 then usually cyclic load balancing can help • If work varies by larger factors then dynamic balancing is usually better
DYNAMIC scheduling • DYNAMIC scheduling is a personal favourite • Specify using DYNAMIC(chunksize) • Simple implementation of master-worker type distribution of iterations • Master thread passes off values of iterations to the workers of size chunksize • Not a silver bullet: if load balance is too severe (i.e. one thread takes longer than the rest combined) an algorithm rewrite is necessary • Also not good if you need a regular access pattern for data locality
THREAD 1 THREAD 2 THREAD 3 Master-Worker Model REQUEST Master REQUEST REQUEST
Performance issues - False Sharing • You may parallelize your algorithm and find performance is less than stellar: Speed-up Ncpu
Example • A possible cause of poor performance is something called `false sharing’: integer m,n,i,j real a(m,n),s(m) C$OMP PARALLEL DO C$OMP& PRIVATE(i,j) C$OMP& SHARED(s,a) do i=1,m s(i)=0.0 do j=1,n s(i)=s(i)+a(i,j) end do end do Simple code which sums rows of a matrix
Execution time line • Set m=4, what happens in each thread? • Since memory is laid out in four word cache lines, at each stage all four threads are fighting for the same cache line t=0 s(1)=0.0 s(2)=0.0 s(3)=0.0 s(4)=0.0 t=1 s(1)=s(1)+a(1,1) s(2)=s(2)+a(2,1) s(3)=s(3)+a(3,1) s(4)=s(4)+a(4,1) t=2 s(1)=s(1)+a(1,2) s(2)=s(2)+a(2,2) s(3)=s(3)+a(3,2) s(4)=s(4)+a(4,2)
Cache line invalidation • For each thread, prior to the next operation on s(), it must retrieve a new version of the s(1:4) cache line from main memory (it’s own copy of s(1:4) has been invalidated by other operations) • Fetches from anywhere other than the local cache are much slower • Result is significantly increased run time
Simple solution • Just need to spread out elements of s() so that each of them is its own cache line: integer m,n,i,j real a(m,n),s(32,m) C$OMP PARALLEL DO C$OMP& PRIVATE(i,j) C$OMP& SHARED(s,a) do i=1,m s(1,i)=0.0 do j=1,n s(1,i)=s(1,i)+a(i,j) end do end do
Layout of s(,) s(,) 1 Each item of interest is now separated by (multiple) cache lines … … 32 i-1 i i+1 i+2 8 word cache lines s(1,i-1) s(1,i) s(1,i+1) s(1,i+2)
Tips to avoid false sharing • Minimize the number of variables that are shared • Segregate rarely updated variables from those that are update frequently (“volatile”) • Isolate the volatile items into separate cache lines • Have to accept the waste of memory to improve performance
OpenMP subroutines and functions OMP_SET_NUM_THREADS (s) OMP_GET_NUM_THREADS (f) OMP_GET_MAX_THREADS (f) OMP_GET_THREAD_NUM (f) OMP_GET_NUM_PROCS (f) OMP_IN_PARALLEL (f) OMP_SET_DYNAMIC (s) OMP_GET_DYNAMIC (f) OMP_SET_NESTED (s) OMP_GET_NESTED (f) C/C++ versions are all lower case Red=subroutines
Useful functions/subroutines • OMP_SET_NUM_THREADS is useful to change the number of threads of execution: • However, it will accept values higher than the number of CPUs – which will result in poor execution times call OMP_SET_NUM_THREADS(num_threads) or void omp_set_num_threads(int num_threads)
OMP_GET_NUM_THREADS • Returns the number of threads currently being used for execution • Will obviously produce a different result when executed in a parallel loop, versus a serial part of the code – be careful!
OMP_GET_MAX_THREADS • While OMP_GET_NUM_THREADS will return the number of threads being used OMP_GET_MAX_THREADS returns the maximum possible number • This value will be the same whether the code is executing in a serial or parallel section • Remember: that is not true for OMP_GET_NUM_THREADS!
OMP_GET_THREAD_NUM • Very useful function – returns your thread number from 0 to (number of threads) – 1 • Can be used to control access to data by using the thread number to index to start and end points of a section • Can also be useful in debugging race conditions
Environment variables • The OpenMP standard defines a number of environment variables (some of which we have met) OMP_NUM_THREADS OMP_SCHEDULE OMP_DYNAMIC OMP_NESTED
Environment variables • All of these variables can be overridden by using the subroutines we discussed (with the exception of the OMP_SCHEDULE variable) • OMP_DYNAMIC is set to false by default • OMP_NESTED is set to false by default • If you are using nesting it is probably safer to ensure you turn on nesting within the code
Thread Stack size • One of the most significant variables is not declared within the OpenMP standard • Each parallel thread may require a large stack to declare its private variables on • Typical sizes are 1-8 MB, but certain codes may require more • I often run problems where I need over 100 MB of thread stack (for example)
Quick example • Consider the following code: • You are assigning 1283*3 words on to the thread stack=24 MB real r(3,2097152) C$OMP PARALLEL DO C$OMP PRIVATE(r,i) do i=1,10 call work(r) end do
Conditional Parallelism • To save code replication, OpenMP allows you to determine whether a loop should be called in parallel or serial: `conditional parallelism’ • A parallel do loop can be followed by an IF statement: • Only if the if statement is true is the routine executed in parallel C$OMP PARALLEL DO C$OMP& IF(L.gt.Lrmax) C$OMP& PRIVATE(…) code….
How this avoids replication • Suppose you have a routine that is applied to a certain data set, but you may have a large single data set, or alternatively many small data sets Option 2 Option 1 call foo call foo call foo call foo Single instance, parallel across machine Multiple, (parallel) serial calls
Conditional Parallelism • To determine whether the routine is executed in parallel or not, the omp_in_parallel function can be used: subroutine foo(bar) real bar logical inpar inpar = omp_in_parallel() C$OMP PARALLEL DO C$OMP& IF(.NOT. inpar) C$OMP& PRIVATE(…) C$OMP& SHARED(…) do i=1,N work…. end do return
Comments • Applications that have a recursive nature can often be significantly reduced in size by using conditional parallelism • Most significant benefit is probably in code correctness – any bugs need only be fixed once!
Conditional Compilation • It’s good practice to use your OpenMP code in serial • However, if you include OpenMP functions you may not have these defined within your compiler • Conditional compilation allows a simple way around this
Conditional Compilation • Two methods can be used to remove OpenMP subroutines • Or C preprocessor statements can be used C$ i=OMP_GET_THREAD_NUM() #IFDEF _OPENMP i=OMP_GET_THREAD_NUM() #ENDIF
Simple Example • Suppose you use the thread number function given in the previous example C$OMP PARALLEL DO C$OMP& PRIVATE(j) do i=1,n j=1 C$ j=OMP_GET_THREAD_NUM() js=j*iblk+1 je=(j+1)*iblk do k=js,je ! Use j to point to ! parts of an array end do end do C$OMP END PARALLEL C$OMP PARALLEL DO C$OMP& PRIVATE(j) do i=1,n j=OMP_GET_THREAD_NUM() js=j*iblk+1 je=(j+1)*iblk do k=js,je ! Use j to point to ! parts of an array end do end do C$OMP END PARALLEL Add default value and ensure all constants are adjusted accordingly
Parallel I/O • Provided you are prepared to read and write multiple files then OpenMP can give moderate I/O speed-up • Even if you are writing only one array, sections of it can be written within the files e.g. write(13) (r(i),i=1,block) write(13) (r(i),i=block+1,2*block)
I/O continued • However, writing to one file system implies a bandwidth limit • Multiple file systems on one machine can show great benefits • Note: some machines require that applications be compiled with special flags • May need parallel io flag at compile time
Code Example blck_len=n/OMP_GET_NUM_THREADS() C$OMP PARALLEL DO C$OMP& PRIVATE(I,filename,js,je) C$OMP& SHARED(blck_len,r) do i=1,nthread write(filename,`(``r``,i4.4,``.out``)`)i js=(i-1)*blck_len+1 je=i*blck_len open(i+10,file=filename,status=‘new’,form=‘unformatted’) write(i+10) (r(j),j=js,je) close(i+10) end do • Writes array r() out in nthread files simultaneously
Performance examples Ultimately depends on system you are on
Alternative programming methodology • OpenMP also allows you to program a way more akin to MPI (SPMD model) • Using the PARALLEL construct you don’t need a do loop, or sections to execute in parallel: C$OMP PARALLEL call foo(I,a,b,c) C$OMP END PARALLEL
Execution Call foo Call foo Call foo Call foo
PARALLEL environment • To do useful work necessary to have each subroutine call access different data • Can use thread id to select different pieces of an array – similar in principle to using process id’s in MPI • Good scaling requires that replications really do perform parallel work, rather than just replicating execution
Pros/cons of this approach • Cons – may be difficult to take an already written code and convert it to this style • Can also lead to tough to debug code • Pros – codes written using this environment often scale very well • Potentially very useful if you are starting from scratch
BARRIERS & SINGLE SECTIONS • Within the parallel region it may be necessary to synchronize and/or have only one thread execute a piece of code • This can be achieved by using the SINGLE clause and BARRIER C$OMP PARALLEL DEFAULT(SHARED) CALL WORK(X) C$OMP BARRIER C$OMP SINGLE CALL OUTPUT(X) CALL INPUT(Y) C$OMP END SINGLE CALL WORK(Y) C$OMP END PARALLEL Block until all threads arrive here Executed by first thread to arrive at this part of code
SINGLE DIRECTIVE • The point of using the SINGLE directive is to avoid the overhead of stopping and then starting the parallel section • There is an implicitly implied barrier at the ENDSINGLE statement • This can be avoided by adding NOWAIT: C$OMP END SINGLE NOWAIT
MASTER DIRECTIVE • Instead of using SINGLE, one can also use the MASTER directive to select code that only the master thread should execute • Unlike the SINGLE directive there is no implied barrier at the END MASTER statement • MASTER regions can be used to keep tabs on the values of control variables • SINGLE regions are better for I/O (for example)
DO • Within parallel regions DO loops can be specified as parallel do’s • Exactly the same idea as loop level parallelism, have to specify scope of the variables, scheduling etc
Example for 4 threads C$OMP PARALLEL do i=1,4 write(*,*) ‘Hello’ end do C$OMP END PARALLEL C$OMP PARALLEL C$OMP DO do i=1,4 write(*,*) ‘Hello’ end do C$OMP END DO C$OMP END PARALLEL Hello Hello Hello Hello Hello Hello Hello Hello Hello Hello Hello Hello Hello Hello Hello Hello Thread 1 Thread 1 Thread 2 Hello Hello Hello Hello Thread 2 Thread 3 Thread 3 Thread 4 Thread 4
Task Queue in the SPMD model Program tasks Integer my_task,mytid,index index=1 C$OMP PARALLEL C$OMP& DEFAULT(NONE)C$OMP& PRIVATE(mytid),SHARED(index) C$OMP CRITICAL mytid=my_task(index) C$OMP END CRITICAL do while (mytid.ne.-1) call work(mytid) C$OMP CRITICAL mytid=my_task(index) C$OMP END CRITICAL end do C$OMP END PARALLEL end integer function my_task(index) integer index,MAX MAX=some value if (index.lt.MAX) then index=index+1 my_task=index else my_task=-1 end if return end
WORKSHARE • This is a new feature to the OpenMP 2.0 standard • Allows parallelization of f90 array expressions and FORALL statements • Example, if A & B are arrays: • Directive was brought in to get around problems with f90 array syntax C$OMP WORKSHARE A=B C$OMP END WORKSHARE