400 likes | 569 Views
SCALABILITY ANALYSIS. PERFORMANCE AND SCALABILITY OF PARALLEL SYSTEMS. Evaluation Sequential: runtime (Execution time) Ts =T (InputSize) Parallel: runtime (Start-->Last PE ends) Tp =T (InputSize,p,architecture) Note: Cannot be Evaluated in Isolation from the Parallel architecture
E N D
PERFORMANCE AND SCALABILITY OF PARALLEL SYSTEMS • Evaluation • Sequential: runtime (Execution time) Ts =T (InputSize) • Parallel: runtime (Start-->Last PE ends) Tp =T (InputSize,p,architecture) Note: Cannot be Evaluated in Isolation from the Parallel architecture -- Parallel System : Parallel Algorithm ∞ Parallel Architecture • Metrics - Evaluate the Performance of Parallel System SCALABILITY: Ability of Parallel Algorithm to Achieve Performance Gains proportional to number of PE
PERFORMANCE METRICS • Run-Time : Ts, Tp • Speedup: • How much Performance is gained by running the application on “p”(Identical) processors S= Ts/Tp where Ts: Fastest Sequential Algorithm for solving the same problem IF – not known yet(only lower bound known) or – known, with large constants at runtime that make it impossible to implement THEN Take the fastest known sequential algorithm that can be practically implemented Speedup – relative metric
Algorithm of adding “n” numbers on “n” processors(HyperCube) • S ≤ p S>p (SuperLinear) • Ts=Θ(n) S= Θ(n/logn) (n=p=2^k) Tp= Θ(logn) • Efficiency(E): measure of how effective the problem is solved on P processors E= S/P E є (0,1) • Measures the fraction of time for which a processor is usefully employed If p=n E= Θ(1/logn) Cost Cost(Sequential_fast) = Ts Cost(Parallel) = pTp Cost(Sequential_fast) = Cost(Parallel) Cost-Optimal
Algorithm of adding “n” numbers on “n” processors(HyperCube) Cost(Sequential_fast) = Θ (n) Cost(Parallel) = Θ (nlogn) • P=n Fine granularity • E= Θ(1/logn) • Not Cost-Optimal
Effects of Granularity on Cost-Optimality • Scaling Down(P<n) √n √n √n √n ≡ √n N n/p nPEs pPEs • Assume: n virtual PEs; If p – physical PEs, then each PE simulates n/p – virtual PEs the computation at each PE increases by a factor: n/p • Note: Even if p<n, this doesn't necessarily provide Cost –Optimal algorithm n/p
Algorithm of adding “n” numbers on “n” processors(HyperCube)(p<n) n=2^k Eg: n=16, p=4 p=2^m • Computation + Communication(First 8 Steps) • Θ(n/p logp) • Computation (last 4 Steps) Θ(n/p) Parallel Execution Time=Θ(n/p logp) Cost(Parallel) = p Θ(n/p logp) = Θ(nlogp) Cost(Sequential_fast) = Θ(n) P↑ asymptotic – Not Cost Optimal
Algorithm of adding “n” numbers on “n” processors(HyperCube)(p<n) • A Cost Optimal Algorithm • Computation + Communication • Θ(n/p + logp) = Θ(logp) (n > plogp) • Computation Θ(n/p) Parallel Execution Time=Θ(n/p) Cost(Parallel) = p Θ(n/p) = Θ(n) Cost(Sequential_fast) = Θ(n) Cost(Parallel) = Cost(Sequential_fast) = Θ(n) – Cost Optimal
If Algorithm is Cost-Optimal • P – Physical PEs • Each PE stimulates n/p virtual PEs Then, • If the overall communication does not grow more than: n/p (Proper Mapping) • Total parallel run-time grows at most: n/pTcomp + n/pTcomm = n/pTtotal=n/pTp = Tn/p Cost(Parallel, n=p) = pTp Cost(Parallel, n<p) = pTn/p = p.n/pTp • nTp = Cost(Parallel, n=p) New algorithm using n/p processors is cost-optimal (p<n)
If Algorithm is not Cost-Optimal for p=n • If we increase the granularity The new algorithm using n/p,(p<n) May still not be cost optimal Eg: Adding “n” numbers on “p” processors HYPERCUBE n=2^k Eg: n=16, p=4 p=2^m • Each virtual PE(i) is simulated by physical PE(i mod p) • First logp(2 steps) of the logn(4 steps) in the original algorithm are simulated in n/plogp (16/4*2 = 8 Steps on p=4 processors) • The remaining steps do not require communication (the PE that continue to communicate in the original algorithm are simulated by the same PE here)
The Role of Mapping Computations onto Processors in Parallel Algorithm Design • For a cost-optimal parallel algorithm E=Θ(1) • If a parallel algorithm on p=n processors is not cost-optimal or cost-non optimal then It doesn't imply that if p<n you can find a cost optimal algorithm • Even if you find a cost-optimal algorithm for p<n then It doesn't imply that you found an algorithm with best parallel run-time. • Performance(Parallel run-time) depends on 1) Number of processors 2) Data-Mapping (Assignment)
The Role of Mapping Computations onto Processors in Parallel Algorithm Design • Parallel run-time of the same problem (problem size) depends upon the mapping of the virtual PEs onto Physical PEs • Performance critically depends on the data mapping onto a coarse grained parallel computer • Eg: Matrix multiply nxn by a vector on p processor hypercube [p square blocks vs p slices of n/p rows] Parallel FFT on a hypercube with Cut- Through Routing • W – Computation Steps =>Pmax=W • For Pmax – each PE executes one step of the algorithm • For p<W , each PE executes a larger number of steps • The choice of the best algorithm to perform local computations depends upon #PEs (how much fragmentation is available)
Optimal algorithm for solving a problem on an arbitrary #PEs cannot be obtained from the most fine-grained parallel algorithm • The analysis on fine-grained parallel algorithm may not reveal important facts such as: • Analysis of coarse grain Parallel algorithm: Notes: 1) if message is short (one word only) => transfer time between two PE is the same for store and forward and cut-through routing 2) if the message is long => cut-through routing is faster than store and forward 3)Performance on Hypercube and Mesh is identical with cut-through routing 4) Performance on a mesh with store and forward is worse. Design: 1) Devise the parallel algorithm for the finest grain 2) mapping data onto PEs 3) description of algorithm implementation on an arbitrary #Pes Variables: Problem size, #PEs
Scalability S≤P S(p) E(p) Eg: Adding n numbers on a p processors Hypercube Assume : 1 unit time(For adding 2 numbers or to communicate with connected PE) 1) adding locally n/p numbers Takes : n/p-1 2) p partial sums added in logp steps ( for each sum : 1 addition + 1 communication)=> 2logp Tp=n/p-1 + 2logp Tp= n/p + 2logp (n↑, p↑) Ts= n-1 =n S=n/(n/p + 2logp) =np/(n+2plogp) => S(n,p) E= S/p = n/(n+2plogp) => E(n,p) Can be computed for any par of n and p
As p↑ to increase S => need to increase n(Otherwise saturation) • => E↓ • Larger Problem sizes. S ↑, E ↑ but they drop with p ↑ • E=Ct • Scalability : of a parallel system is a measure of its capacity to increase speedup in proportion to the number of processors.
Efficiency of adding n numbers on a p processor Hypercube For cost optimal algorithm: S= np/(n+2plogp) E=n/(n+2plogp) E=E(n,p) n=Ω(plogp) E=0.80 constant for n=64 , p =4 , n=8plogp For n=192, p=8 , n=8plogp For n=512, p=16, n=8plogp
Efficiency of adding n numbers on a p processor Hypercube • Conclusions: • For adding n numbers on p processors with cost optimal algorithm • The algorithm is cost-optimal if n=Ω(plogp) • The algorithm is scalable if n increases proportional with Θ(plogp) as p is increased. • For Matrix multiply: Input size n => O(n^3) n’=2n=>O(n’^3)≡O(8n^3) • Matrix addition Input size n => O(n^2) n’=2n=>O(n’^2)≡O(4n^2) Doubling the size of the problem means performing twice the amount of computation.
Computation Step: • Assume takes 1 time unit • Message start-up time, per-word transfer time, per-hop time can be normalized with respect to unit computation time • W=Ts (for the fastest sequential algorithm on a sequential computer) • Overhead Function E=1, S=p(Ideal) E<1, S<p(In reality) • Reasons overhead( interprocessor communication, etc) => overhead function
Overhead Function • The time collectively spent by all processors in addition to that required by the fastest sequential algorithm to solve the same problem on a single PE. To= To(W,p) To= pTp – W • For cost-optimal algorithm of adding n numbers on p processors hypercube Ts=W=n Tp= n/p + 2logp To= p(n/p + 2logp) – n = 2plogp To = 2plogp
Isoefficiency Function Tp= T(W,To,p) To =pTp –W S=Ts/Tp = W/Tp E= S/p = W/(W + To(W,p) If W = ct, P↑ then E↓ If p=ct, W ↑ then E ↑ for parallel scalable systems • we need E=ct for scalable effective systems
Isoefficiency Function • Eg.1 If p↑, W ↑ exponentially with p then problem is poorly scalable since we need to increase the problem size very much to obtain good speedups • Eg.2 If p↑, W ↑ linearly with p then problem is highly scalable since the speedup is now proportional to the number of processors • E=ct=> E/1-E= ct Given E= E/1-E = K • W = KTo(W,p) Function dictates growth rate of W required to Keep the E Constant as P increases • Isoefficiency in unscalable parallel systems because E cannot be kept constant as p increases, no matter how much or how fast W increases
Overhead Function( adding n numbers on p processors Hypercube) Ts=n Tp= n/p + 2logp To= pTp – Ts =p(n/p + 2logp) – n = 2plogp Isoefficiency Function W=kTo(W,p) To= 2plogp(Note: To=To(p)) W=2kplogp => asymptotic isoefficiency function is Θ(plogp) Meaning 1) #PE↑ p’>p => problem size has to be increased by (p’logp’/plogp) to have the same effeciency as on p processors
2) #PE↑ p’>p by a factor p’/p requires problem size to grow by a factor (p’logp’/plogp) to increase the speedup by p’/p Here communication overhead is an exclusive function of p: To=To(p) In general To=To(W,p) W=kTo(W,p)(may involve many terms) Sometimes hard to solve in terms of p E=ct need ratio To/W fixed As p ↑, W ↑ to obtain nondecreasing efficiency E’≥E => To should not grow faster than W • None of To terms should grow faster than W
If To has multiple terms, we balance W against each term of To and compute the respective isoefficiency functions for corresponding individual terms • The component of To that requires the problem size to grow at the highest rate with respect to p, determines the overall asymptotic isoefficiency of the parallel system Eg: To = p3/2 + p3/4 W3/4 W= k p3/2 =>Θ(p3/2) W= k p3/4 W3/4 W 1/4= Kp3/4 W=k4p3 =>Θ(P3) Take the highest of the two rates To ensure E doesn't decrease, the problem size needs to grow as Θ(P3)(Overall asymptotic isoefficiency )
Isoefficiency functions: • Captures characteristics of the parallel algorithm and architecture • Predicts the impact on performance as #PE↑ • Characterizes the amount of parallelism in a parallel algorithm • Study of algorithm(parallel system) behaviour due to hardware changes(speed, PE, communication channels) Cost-Optimality and Isoefficiency Cost-Optimality =Ts/pTp = ct Ptp =Θ(W) W + To(W,p) = Θ(W) ( To=pTp – W) To(W,p) = O(W) W=Ω(To(W,p)) • A parallel system is cost optimal iff its overhead function doesn't grow asymptotically more than the problem size
Relationship between Cost-optimality and Isoefficiency Function Eg: Add “n” numbers on “p” processors hypercube a) Non-optimal cost W=O(n) Tp=O(n/plogp) To=pTp – W =Θ(nlogp) W=k Θ(nlogp) not true for all K and E Algorithm is not cost-optimal, not scalable and isoefficiency function b) Cost-Optimal W=O(n) Tp=O(n/p + logp) To= Θ(n + plogp) –O(n) W=k Θ(plogp) Problem size should grow at least as plogp such that parallel system is scalable W=Ω(plogp)(n>>p for cost optimality)
Isoefficiency Function • Determines the ease with which a parallel system can maintain a constant efficiency and thus, achieve speedups increasing in proportion to the number of processors • A small isoefficiency function means that small increments in the problem size are sufficient for the efficient utilization of an increasing number of processors => indicates the parallel system is highly scalable • A large isoefficiency function indicates a poorly scalable parallel system • The isoefficiency function does not exist for unscalable parallel systems, because in such systems the efficiency cannot be kept at any constant value as p↑, no matter how fast the problem size is increased
Lower Bound on the Isoefficiency • Small isoefficiency function => high scalability For a problem with W, Pmax ≤ W for cost-optimal system(if Pmax > W, some PE are idle) If W<Θ(p) i.e problem size grows slower than p, as p↑ => at one point #PE > W =>E↓ => asymptotically W= Θ(p) Problem size must increase proportional as Θ(p) to maintain fixed efficiency W = Ω(p) (W should grow at least as fast as p) Ω(p) is the asymptotic lower bound on the isoefficiency function But Pmax= Θ(W) (p should grow at most as fast as W) => The isoefficiency function for an ideal parallel system is: W= Θ(p)
Degree of Concurrency and Isoefficiency function • Maximum number of tasks that can be executed simultaneously at any time • Independent of parallel architecture C(W) – no more than C(W) processors can be employed effectively Effect of Concurrency on Isoefficiency function Eg: Gaussian Elimination : W=Θ(n3) P= Θ(n2) C(W)= Θ(W2/3)=> at most Θ(W2/3) processors can be used efficiently Given p W=Ω(p3/2) => problem size should be at least Ω(p3/2) to use them all => The Isoefficiency due to concurrency is Θ(p3/2) The Isoefficiency function due to concurrency is optimal, that is , Θ(p) only is the degree of concurrency of the parallel algorithm is Θ(W)
Sources of Overhead • Interprocessor Communication Each PE Spends tcomm Overall interprocessor communication : ptcomm (Architecture impact) • Load imbalance Idle vs busy PEs( Contributes to overhead) Eg: In sequential part 1PE : Ws Useful p-1 PEs: (p-1)Ws contribution to overhead function • Extra-Computation 1) Redundant Computation(eg: Fast fourier transform) 2) W – for best sequential algorithm W’ – for a sequential algorithm easily parallelizable W’-W contributes to overhead. W=Ws + Wp => Ws executed by 1PE only =>(p-1)Ws contributes to overhead • Overhead of scheduling
If the degree of concurrency of an algorithm is less than Θ(W), then the Isoefficiency function due to concurrency is worse, i.e. greater than Θ(p) • Overall Isoefficiency function of a parallel system: Isoeffsystem = max(Isoeffconcurr, Isoeffcommun, Isoeffoverhead) Sources of Parallel Overhead • The overhead function characterizes a parallel system • Given the overhead function To=To(W,p) We can express: Tp, S,E,pTp(cost) as fi(W,p) • The overhead function encapsulates all causes of inefficiencies of a parallel system, due to: • Algorithm • Architecture • Algorithm –architecture interaction
Minimum Execution TimeAdding n numbers on p processors on a hypercube Assume p is not a constraint • In general: Tp = Tp(W,p) For a given W,Tp(min) 2 dTp/dp = 0 => po for which Tp=Tp(min) Eg: Tp= n/p + 2logp dTp/dp = 0 => p0=n/2 Tp(min)= 2logn Cost-Sequential :Θ(n) Cost-Parallel : Θ(nlogn) since p0Tp(min)=n/2 X 2logn Not Cost optimal => running this algorithm for Tp(min) is not cost-optimal but this algorithm is Cost-Optimal
Derive :Lower bound for Tp such that parallel cost is optimal Tp(cost-optimal) • Parallel run time such that cost is optimal • W fixed • If Isoefficiency function is Θ(f(p)) Then problem of size W can be executed Cost-optimally only iff: W = Ω(f(p)) P= Θ(f-1(p)){Required for a cost optimal solution } Tp for cost cost-optimal solution is = Θ(W/p) Since pTp = Θ(W) Tp= Θ(W/p) P= Θ(f-1(W)) => Tp(cost-optimal) = Ω(W/f-1(w))
Minimum Cost-Optimal Time for adding N numbers on a hypercube A) isoefficiency function: To= pTp-W Tp =n/p + 2logp => To= p(n/p + slogp) – n = 2plogp W= kTo= k2plogp W=Θ(plogp){isoefficiency function} • If W=n=f(p)=plogp =>logn = logp +loglogp logn=logp • If n=f(p)=plogp P=f-1(n) n=plogp=>p= n/logp=n/logn f-1(n) = n/logn f-1(W)=n/logn f-1 (W)=Θ(n/logn)
B) The cost-optimal solution p= O(f-1(W)) => for a cost optimal solution P=Θ(nlogn) { the max for cost-optimal solution} For p=n/logn Tp=Tp(cost-optimal) Tp=n/p + 2logp Tp(cost-optimal) = logn + 2log(n/logn) =>3logn – 2loglogn Tp(cost-optimal) = Θ(logn) Note: Tp(min) = Θ(logn) Tp(cost-optimal) = Θ(logn) { cost optimal solution is the best asymptotic solution in terms of execution time} Tp(min)=> po = n/2 > po = n/logn (Tp(cost-optimal)) =>Tp(cost-optimal) = Θ(Tp(min))
Parallel system where Tp(cost-optimal) > Tp(min) To = p3/2 + p3/4W3/4 Tp= (W+To)/p =>Tp=W/p + p1/2 +W3/4/P1/4 dTp/dp=0 => p3/4 = 1/4W3/4 (1/16W3/2 + 2W)1/2 =Θ(W3/4) Po= Θ(W) Tp(min) = Θ(W1/2) • Isoefficiency Function: W=kTo = k4p3 = Θ(p3) Pmax = Θ(W1/3){ Max #PE for which algorithm is cost-optimal} Tp = W/p + p1/2 + W3/4/p1/4 p = Θ(W) => Tp(cost-optimal) = Θ(W2/3) Tp(cost-optimal) > Tp(min) asymptotically