210 likes | 330 Views
Measuring Synchronisation and Scheduling Overheads in OpenMP. J. Mark Bull EPCC University of Edinburgh, UK email: m.bull@epcc.ed.ac.uk. Overview. Motivation Experimental method Results and analysis Synchronisation Loop scheduling Conclusions and future work. Motivation.
E N D
Measuring Synchronisation and Scheduling Overheads in OpenMP J. Mark Bull EPCC University of Edinburgh, UK email: m.bull@epcc.ed.ac.uk
Overview • Motivation • Experimental method • Results and analysis • Synchronisation • Loop scheduling • Conclusions and future work
Motivation • Compare OpenMP implementations on different systems. • Highlight inefficiencies. • Investigate performance implications of semantically equivalent directives. • Allow estimation of synchronisation/scheduling overheads in applications.
Basic idea is to compare same code executed with and without directives. Overhead computed as (mean) difference in execution time. e.g. for DO directive, compare: !$OMP PARALLEL do j=1,innerreps !$OMP DO do i=1,numthreads to do j=1,innerreps call delay(dlength) call delay(dlength) end do end do end do !$OMP END PARALLEL Experimental method
Experimental method (cont.) • Similar technique can be used for PARALLEL (with and without REDUCTION clause), PARALLEL DO, BARRIER and SINGLE directives. • For mutual exclusion (CRITICAL, ATOMIC, lock/unlock) use a similar method, comparing !$OMP PARALLEL do j=1,innerreps/nthreads !$OMP CRITICAL call delay(dlength) !$OMP END CRITICAL end do !$OMP END PARALLEL to same reference time.
Experimental method (cont.) • Can use same method as for DO directive to investigate loop scheduling overheads. • For loop scheduling options, overhead depends on • number of threads • number of iterations per thread • execution time of loop body • chunk size • Large parameter space - fix first 3 and look at varying chunk size. • 4 threads • 1024 iterations per thread • 100 clock cycles to execute loop body
Timing • Need to take care with timing routines: • second differences of 32 bit floating point values (e.g .etime) lose too much precision. • need microsecond accuracy (Fortran 90 system_clock isn’t good enough on some systems) • For statistical stability, repeat each measurement 50 times per run, and for 20 runs of the executable. • observe significant variation between runs which is absent within a given run. • Reject runs with large standard deviations, or with large numbers of outliers.
Systems tested Benchmark codes have been run on: • Sun HPC 3500, eight 400 MHz UltraSparcII processors, KAI guidef90 preprocessor, Solaris f90 compiler. • SGI Origin 2000, 40 195 MHz MIPS R10000 processors, MIPSpro f90 compiler (access to 8 processors only) • Compaq Alpha server, four 525 MHz EV5/6 processors, Digital f90 compiler
Observations • PARALLEL directive uses 2 barriers • is this strictly necessary? • PARALLEL DO cost twice as much as DO • REDUCTION clause scales badly • should use a fan-in method? • SINGLE should not cost more than BARRIER • Mutual exclusion scales badly on Origin 2000 • CRITICAL directive very expensive on Compaq
Observations (cont.) • Small chunk sizes very expensive • compiler should generate code statically for block cyclic schedule. • DYNAMIC much more expensive than STATIC, especially on Origin 2000 • On Origin 2000 and Compaq, block cyclic is more expensive than block, even with one chunk per thread.
Conclusions and future work • Set of benchmarks to measure synchronisation and scheduling costs in OpenMP. • Show significant differences between systems. • Show some potential areas for optimisation. • Would like to run on more (and larger) systems.