Performance and Scalability of Parallel Systems

Performance and Scalability of Parallel Systems • Evaluation • sequential: run time (execution time) • Ts = T(Insize) • parallel: run time (start -> last PE ends) • Tp = T(Insize, p, arch.) • cannot be eval. in isolation for the par. archit. • parallel system: par alg  par archit • Metrics: evaluate performance of par. system • scalability: ability of par. alg. to achieve performance gains proportional w/ n. of PEs.

Performance Metrics • Run-Time: Ts, Tp • Speedup: • how much performance is gained by running the application on p” identical processors. • S = Ts / Tp • Ts: fastest sequential algorithm for solving the same problem • If : not known yet (only lower bound known) or known, with large constants at run-time that make it impossible to implement Then: Take the fastest known sequential algorithm that can be practically implemented • Speedup: relative metric

Alg. of adding “n” number on “n” processors (Hypercube n = p = ) • Ts = (n) • Tp = (logn) • S = ( ) • Efficiency: measure of how effective the problem is solved on p processors • E[0,1) E = • if p=n, E = ( ) • Cost • Cseq(fast) = Ts Cpar = pTp • cost-optimal: Cpar ~ Cseq(fast) • not cost-optimal: if p=n, fine granularity; if p<n, coars granularity • Cseq(fast) = (n) Cpar = (nlogn) E = ( )

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 3 5 7 9 11 13 15 0 2 4 6 8 10 12 14 (a) Initial data distribution and the fast communication step 1 3 5 7 9 11 13 15 0 2 4 6 8 10 12 14 (b) Second communication step 1 3 5 7 9 11 13 15 0 2 4 6 8 10 12 14 ( c) Third communication step 1 3 5 7 9 11 13 15 0 2 4 6 8 10 12 14 (d) Fourth communication step 1 3 5 7 9 11 13 15 0 2 4 6 8 10 12 14 (e) Accumulation of the sum at processor 0 after the final communication

Effects of Granularity on Cost-Optimality • Assume: n initial PEs • If: p -- physical PEs, then each PE simulates • If: -- virtual PEs, then the computation at each PE increase by a factor of • Note: Even if p<n, this does not necessarily provide cost-optimal alg.

12 13 14 15 8 9 10 11 4 5 6 7 0 1 2 3 12 13 14 15 8 9 10 11 4 5 6 7 1 3 1 3 0 2 0 2 Substep 1 Substep 2 12 13 14 15 8 9 10 11 12 13 14 15 1 3 1 3 0 2 0 2 Substep 3 Substep 4 (a) Four processors simulating the first communication step of 16 processors

1 3 1 3 0 2 0 2 Substep 1 Substep 2 1 3 1 3 0 2 0 2 Substep 3 Substep 4 (a) Four processors simulating the first communication step of 16 processors

1 3 1 3 0 2 0 2 Substep 1 Substep 2 (c ) Simulation of the third step in two substeps 1 3 1 3 0 2 0 2 (e) Final result (d) Simulation of the fourth step

Adding “n” numbers on “p” processors (Hypercube p<n) • n = e.g., n=16 • p = p = 4 • comput + communication ( ) • comput ( ) • par. exec. time Tpar = ( ) • Cpar = p ( ) = ( ) • Cseq(fast) = (n)

A Cost Optimal Algorithm • comput ( ) • comput + communication ( ) • Tpar = ( ) • Cpar = (n) = p Tpar • Cseq(fast) = (n) Cost-optimal

(a) if algorithm is cost-optimal: • p -- physical PEs • each PE simulates virtual PEs • Then • if the overall communication does not grow far more than • => total parallel run-time grows by at most Tc + Tcomm = Ttotal = Tp = • => = p • => = = p Tp = n Tp = • => new algorithm using processor is cost-optimal (p<n)

(b) if algorithm is not cost-optimal for p=n • if we increase granularity • =>the new algorithm using (p<n) may still not be cost optimal • e.g., adding “n” numbers on “p” processors (Hypeecube: p<n) • n = (n=16) • p = (p=4) • each virtual PE simulated by physical PE • First log p (2 step) of the log n (4 steps) in the original algorithms are simulated in • The remaining steps do not require communication (the PE that continues to communicate in the original algorithm are simulated by the same PE here)

The Role of Mapping Computations Onto Processors in Parallel Algorithm Design • For a cost-optimal parallel algorithms E=(1) • If a par. Alg. on p=n processors is not cost-optimal or cost-nonoptimal, then>if p<n you can find a cost-optimal algorithm • even if you find a cost-optimal alg. For p<n >you found an algorithm with best parallel run-time • performance (parallel run-time) depends on (1) #processors (2) data-mapping

Parallel run-time of the same problem (problem size) depends on the mappings of virtual PEs onto physical PEs • Performance artificially depends on the data mapping onto a coarse grain parallel computer • e.g., matrix multiply n*n by a vector on p processors hypercube • parallel FFT on a hypercube w/ cut through • W--computation steps => Pmax = W • For Pmax -- each PE extends one step of the algorithm • For p<W, each PE extends a large number of steps • The choice of the best algorithm to perform local computations depends on #PEs

Optional algorithm for solving a problem on an arbitrary #PEs cannot be obtained from the most fine-grained par. Alg. • The analysis on fine-grained par. Alg. May not reveal important facts such as: • if message short( one word only) => transfer time between 2 PE is the same for store-and-forward and cut-through-routing • if message is long=>cut-through is faster than store-and-forward • performance on HyC mesh is identical with cut-through routing • Perf. On a mesh with store-and-forward is worse

Design • devise the par. Alg. For the finest-grain • mapping data onto PEs • description of alg. Implementation on an arbitrary # PEs • Variables: • problem size • # PEs

Scalability • Adding n number on a p processors hypercube • assume: 1 unit time: • adding 2 numbers • communicate w/ connected PE • adding locally numbers: -1 • p partial sums added in logp steps • for each sum: 1 addition and 1 communication => 2 log p • Tp = -1 + 2log p • Tp = + 2 log p • Ts = n-1 ---> n • S= = => S(n, p)

E = => E(n, p) can be computed for any pair n & p • As p  to increase S => need to increase n => E  • large problem size S , E  but they drop with p • Scalability: of a parallel system is a measure of its capability to increase speedup on proportion to the number of processors

Efficiency of adding n numbers on a p processor hypercube • for the cost-optimal alg: • S = E= = E(n,p) n P = 1 P = 4 P = 8 P = 16 P = 32 64 1.0 .80 .17 .57 .33 1.0 .92 .38 192 .80 .60 320 1.0 .95 .50 .87 .71 512 1.0 .97 .62 .91 .80

n =  (plogp) • E = 0.80 constant • for n=64 & p=4: n=8plogp • for n=192 & p=8 : n=8plogp • for n=512 & p=16: n=8plogp • Conclusion: • for adding n number on p processors w/ cost-optimal algorithm: • the alg. is cost-optimal if n =  (plogp) • the alg. is scalable if n increase proportional w/  (plogp) as p is increased

Problem size • for matrix multiply • input n => O( ) n’=2n => O( ) = O(8 ) • for matrix addition • input n => O( ) n’=2n => O( ) = O(4 )

Doubling the size of the problem means performing twice the amount of computation • Computation step: assume takes 1 time unit • message start-up time, per-word transfer time, and per-hop time can be normalized w.r.t. unit computation time => W = Ts • for the fastest sequential algorithm on a sequential computer

Overheads Function • ideal: E=1 & S=p • reality: E<1 & S<p • overhead function: • the time collectively spent by all processors in addition to that required by the fastest known sequential algorithm to solve the same problem on a single PE • for cost-optimal alg. of adding n number on a p processor hypercube

Isoefficiency Function If W=ct, p => E If p=ct, W  => E  for parallel scalable systems we need E=ct

E.g.1 • p , W  exponentially with p => poorly scalable, since we need to increase the problem size v. reach to obtain good speedups • E.g.2 • p , W  linearly with p => highly scalable, since the speedup is now proportional to the number of processors given Function dictates growth rate of W required to keep the E ct as p increase

Isoefficiency is unscalable parallel systems because E cannot be kept constant as p increases, no matter how much a how fast W increase.

Overhead Function • Adding n numbers on p processors hypercube: cost-optimal

Isoefficiecy Function => asymptotic isoefficiency function is (plogp) meaning: (1) #PE  p’>p => problem size has to be increased by to have the same efficiency as on p processors. (2) #PE  p’>p by a factor requires the problem size to grow by a factor to increase the speedup by

Here communication overhead is an exclusive function of p: In general • E=ct need ratio fixed as p  and W  to obtain nondecreasing efficiency (E’>=E) => should not grow faster than W

If has multiple terms, we balance W against each term of and compute the (respective) isoefficiency function for individual features. • The component of that requires the problem size to grow at the highest rate with respect to p, determines the overall asymptotic isoefficiency of the parallel system.

Take the highest of the 2 rates to ensure E does not decrease, the problem size needs to grow as ( ) (overall asymptotic isoefficiency)

Isoefficiency Functions • Captures characteristics of the parallel algorithm and parallel architecture • predicts the impact on performance as #PE  • characterize the amount of parallelism in a parallel problem alg. • Study of parallel alg behavior due to HW changes • Speed. PE. • Communication channels

Cost-optimality and Isoefficiency A parallel system is cost optimal iff its overhead function does not grow asymptotically more than the problem size.

Relationship between Cost-optimality and Isoefficiency Function • E.g. add “n” numbers on “p” processors hypercube • Non-optimal cost

(b) cost-optimal Problem size should grow at least as plogp s.t. parallel algorithm is scalable.

Isoefficiency Function • Determines the easy with which a parallel system can maintain a constant efficiency and thus, achieve speedups increasing in proportion to the number of processors • A small isoefficiency function means that small increments in the problem size are sufficient for the efficient utilization of an increasing number of processors => indicates the parallel system is highly scalable • a large isoeff. function indicates a poorly scalable parallel system

The isoefficiency function does not exist for unscalable parallel systems, because in such systems the efficiency cannot be kept at any constant value as p  , no matter how fast the problem size is increased.

Lower bound on the isoefficiency • Small isoefficiency function => high scalability • for a problem with W, Pmax<=W for a cost-optimal system ( if Pmax>W, some PE are idle) • if W<(p), as p  => at one point #PE>W => E • => asymptotically: • W= (p) problem size must increase proportional as (p) to maintain fixed efficiency. • W=(p) (p) is the asymptotic lower bound on the isoefficiency function, but Pmax= (p) => the isoeff function for an ideal parallel system is W= (p)

Degree of Concurrency & Isoefficiency Function • Max # tasks that can be executed simultaneously at any time • indep. on the parallel architecture • no more than C(W) processor can be employed efficienctly

Effect of Concurrency on Isoefficiency Function • E.g. => at most processors can be used efficiently given p, => problem size should be at least to use them all

The isoefficiency due to concurrency is , that is, (p) only is the degree of concurrency of the parallel algorithm is (W) • if the degree of concurrency of an algorithm is less than (W), then the isoefficiency function due to concurrency is worse, I.e., greater than (p) • Overall isoeff. Function of a parallel system:

Sources of Parallel Overhead • The overhead function characterizes a parallel system • Given the overhead function we can express: Tp, s, E, pTp as fi(W, p) • The overhead function encapsulates all causes of inefficiencies of a parallel system, due to: • algorithm • architecture • alg-architect. interaction

Sources of Overhead • Interprocessor communication • each PE speeds: tcomm • overall interproc. Communication: ptcomm • Load imbalance • idle vs. busy PEs • idle PEs contribute to overhead • extra-computation • Redundant computation • W---- for best sequential alg W’----for a seqential alg easily parallelizable W’-W ---- contributes to overhead W=Ws+Wp => Ws executed by 1 PE only => (p-1)Ws contributes to overhead

Minimum Execution Time (assume p is not a constraint) • Adding n numbers on a hypercube • in sequential : For a given W, => p0 for which

Cost-sequential: • Cost-parallel: • not cost-optimal, since • running this alg. for is not cost-optimal but this alg. is cost-optimal

-- par run-time s.t. cost is optimal -- W fixed • Derive: lower bound for Tp s.t. parallel cost is optimal If Isoeff. function is ( f (p) ) then problem of size w can be executed cost-optimally only iff: W=( f (p) ) p=O( f (W) ) : required for a cost-optimal solution Tp for cost-optimal alg is since

Minimum Cost-optimal Time for Adding n Numbers on a Hypercube • Isoefficiency function: if if

The cost-optimal solution For a cost optimal solution: For Note: Cost-optimal solution is the best asymptotic solution in terms of execution time.

Parallel system when

Isoeff. function max # PE for which alg is cost-optimal asymptotically

Performance and Scalability of Parallel Systems