COMP60621 Concurrent Programming for Numerical Applications

COMP60621Concurrent Programming for Numerical Applications Lecture 3 Parallel Performance and Execution Time Overheads Len Freeman, Graham Riley Centre for Novel Computing School of Computer Science University of Manchester

Overview • A Theory of Parallel Performance • Execution time and performance • The ideal model of parallel performance • Execution time overheads • The Overhead of Non-Parallel Code • Amdahl's Law revisited • Amdahl's Law as an execution overhead • The Overhead of Scheduling • Additional code => additional overhead • Experimental Technique • Methodology, measurement, derivation • Tabulating and plotting results • Summary

Parallel Performance • In establishing whether or not a parallel code is a success, performance on a range of parallel configurations is the main concern. • We define (temporal) performance using P processors (simultaneously active processes or threads) as where TP is the time to execute the parallel code. T1 is the time to execute the parallel code with only one processor, and Tref is the time to execute some reference code for the application. It is convenient, but not essential, to use the best known serial code as the reference code. It isessential that the same reference code is always used to compare different parallel versions of the same original serial code. • The quantities Tref , T1 and TP must all be measureable (or at least derivable from measurements that can be taken directly).

‘Ideal’ Parallel Execution • In an ideal world, performance would increase at a rate commensurate with the number of processors being applied, so that: • Moreover, on a single processor, we would ideally expect there to be no extra time involved in executing the parallel code, as compared to the reference (serial) code: i.e., • Put another way, in this ideal world, the time for P processors to execute the parallel code would be the time for the best known serial code divided by P:

‘Ideal’ Parallel Execution • Of course, the world is not ideal, and neither of these ideals can be met in practice. We can explain this lack of ideal behaviour by introducing the notion of execution time overheads which are incurred when a parallel code is run on P processors.

Execution Time Overheads • Compared to a serial code, a parallel code almost always requires extra execution time on a single processor: • But, even when this is not so, as P increases, the eventual lack of (P-fold) software parallelism must lead to TP exceeding Tref /P. The resulting excess time is the execution time overhead associated with P-processor execution,

Execution Time Overheads • Take, for example, the case of Amdahl's Law, where we have a code which takes time TS to be executed serially, and which is perfectly parallelisableexceptfor a completely serial, combined start-up and tidy-up time of Tser. The time to execute this hypothetical code, in parallel, on a P-processor system is:

Amdahl's Law as Overhead • Let be the fraction of the serial code that is perfectly parallelisable; then • This is the classic equation for Amdahl's Law. Re-arrangment yields the following:

Amdahl's Law as Overhead • The ideal term is the time we would expect if all the code could be executed perfectly in parallel. The overhead term is entirely due to the existence of serial code that has not been (indeed, cannot be) parallelised.

Amdahl's Law sequential parallel execution execution

Amdahl's Law as Overhead • Note that the P-processor overhead,OP, is the mean idle time over the P processors. In this case, we have (P - 1) processors, all of which are idle for time (1 - a)TS, and one processor which is never idle. Hence, the meanidletime across the P processors is:

Sequential Code Overhead • To take a concrete example, consider a code with

Sequential Code Overhead • Calculate the Amdahl Law (serial code) overhead.

Performance Curves • It is often more illuminating to visualise such results in the form of a performance curve, that is, a plot of temporal performance, RP, against P, and compare achieved RP against the ideal P* RS(= P* R1). (Units are ms for TP and s-1 for RP and P* RS.)

Generalised Overhead • In general, we express the time for any parallel code to execute on a P-processor parallel computer as follows: where OPis the (generalised) overhead (time) associated with execution of the code using the P processors. • Note that some researchers report the ratio of serial execution time to parallel execution time – this is known as the (parallel) speedup, • We prefer to avoid this metric because it is not always clear which serial time (Trefor T1) has been used. And it may not be possible to decide whether performance has actually improved.

Performance Loss • Overhead time corresponds to a shortfall in performance, compared to the ideal. Hence, we could seek a general expression for parallel performance of the form: where LP is the performance shortfall (loss) associated with P-processor execution. In practice, such expressions are appreciably more complicated than those for execution time, and we shall mostly avoid them.

Efficiency • One general expression for performance is: • A useful related quantity is the efficiency of a P-processor parallel execution, given by:

Scheduling Overhead • Once a team of threads has been established to execute a parallel loop (by means of a call to PARALLEL DO directive), each constituent thread needs to determine its own particular set of responsibilities vis-à-vis the iterations of the loop. As explained earlier, this is achieved by scheduling code which is executed by each thread at the start of each parallelised loop; this code constitutes a further source of execution time overhead.

Scheduling Overhead • Suppose the time to execute the scheduling code is a constant,Tsch, for all threads. The execution profile will look something like:

Scheduling Overhead • The scheduling time, Tsch (shown as the fat vertical lines), is additional overhead. This is not idle time, as it was for Amdahl's Law; it is extra processing time that was unnecessary for serial execution. The additional overhead is the mean extra time, which in this case is simply Tsch (assumed to be the same for all P processors). Hence, overall: • Notice how the overhead times due to these two different causes simply add together. It is this property that makes us prefer to work with overhead time, rather than performance loss.

Experimental Method • High Performance Computing is essentially experimental; it involves the design and execution of scientific experiments, including observation and analysis of quantitative data. It is important that effective experimental methods are used. Thus • have a clear purpose in mind before commencing an experiment; design the experiment to fit the purpose; • use a log-book to record all details of experiments; include the motivation and the design, as well as the results; and • when writing up a formal report, be precise about what experiments were run and exactly what was measured and how; a reader must be able to fully understand (and repeat) what you have done.

Measuring Execution Time • It should be clear from the theory presented earlier that the experimental observation will require measurement of execution times on single- and multi-processor computers. This is not as straightforward as it might seem, because two measurements of an execution time rarely deliver the same value, even in seemingly identical circumstances. [Exercise: discuss possible causes for this phenomenon, and estimate the variation in measurement that might be expected to result from each cause.] Good experimental method requires us to repeat such measurements several times and compute a suitable 'consensus' value. • In the laboratory, we ask you to determine sources and magnitudes of various execution time overheads. Typically, you will need to measure execution times by inserting timer statements at appropriate points in your FORTRAN source code. Overhead times can then be derived using an appropriate formula from the theory.

Calculating Overheads • Often you will need to time the execution of parts of your code. On most systems it is easy to measure TP, the overall time for a P-processor execution. It is also relatively easy to measure the time around, say, an outer DO loop. However, it is frequently quite difficult to measure time inside a parallel loop. Also, as the part of the code that is being measured gets smaller, so the relative variation of the timing measurements becomes more severe. • The laboratory exercises encourage you to collect execution time data for varying numbers of processors, and to plot graphs such as, for example, performance curves. • A good way of recording the data is to put it in a table which expands as the sources and magnitudes of overheads become clearer. For example, the table overleaf anticipates the data that might be collected for a simple code exhibiting just Amdahl's Law and Scheduling overheads.

Overhead Analysis Table • All times are in ms, rates are in s-1

Overhead Analysis Table • P 1 2 5 10 20 50 100 • TP 11.10 6.10 3.10 2.10 1.60 1.30 1.20

Overhead Analysis Table • P 1 2 5 10 20 50 100 • TP 11.10 6.10 3.10 2.10 1.60 1.30 1.20 • Tref /P 11.00 5.50 2.20 1.10 0.55 0.22 0.11 • OP 0.10 0.60 0.90 1.00 1.05 1.08 1.09

Overhead Analysis Table • P 1 2 5 10 20 50 100 • TP 11.10 6.10 3.10 2.10 1.60 1.30 1.20 • Tref /P 11.00 5.50 2.20 1.10 0.55 0.22 0.11 • OP 0.10 0.60 0.90 1.00 1.05 1.08 1.09 • EP 99% 90% 71% 52% 34% 17% 9%

Overhead Analysis Table • In the following tables,

Overhead Analysis Table • P 1 2 5 10 20 50 100 • TP 11.10 6.10 3.10 2.10 1.60 1.30 1.20 • Tref /P 11.00 5.50 2.20 1.10 0.55 0.22 0.11 • OP 0.10 0.60 0.90 1.00 1.05 1.08 1.09 • Osch 0.10 0.10 0.10 0.10 0.10 0.10 0.10 • Oser 0.00 0.50 0.80 0.90 0.95 0.98 0.99 • Note that in this case so that the observed • overhead is fully covered by the scheduling overheadand theserial (insufficient parallelism) overhead.

Interpolation • It is time-consuming to execute a code several times each for a large number of different values of P, so it is common for only a few representative experiments to be performed. In such a case, some method is needed for interpolating results between the values of P for which measurements have actually been taken. The simplest technique, and the one we recommend, is linear interpolation: • For P = P1, P2, P3, where P1 < P2 < P3, and we have measured values for and , but for no intermediate value of P, we calculate an interpolated value for as follows:

Summary • We have introduced much of the necessary background for the first laboratory exercises. • Programming for these exercises will follow the thread-based approach, but the OpenMP PARALLEL DO compiler directive abstracts away from threads almost entirely. • Performance is the inverse of execution time. Execution time overheads associated with parallel threads translate into performance shortfall, compared with the expected ideal. In practice, it is important to be able to identify the source and magnitude of each overhead contributing to poor performance. We have introduced two specific sources of overhead: non-parallel code and scheduling.

COMP60621 Concurrent Programming for Numerical Applications