1 / 40

SCALABILITY ANALYSIS

SCALABILITY ANALYSIS. PERFORMANCE AND SCALABILITY OF PARALLEL SYSTEMS. Evaluation Sequential: runtime (Execution time) Ts =T (InputSize) Parallel: runtime (Start-->Last PE ends) Tp =T (InputSize,p,architecture) Note: Cannot be Evaluated in Isolation from the Parallel architecture

sandro
Download Presentation

SCALABILITY ANALYSIS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SCALABILITY ANALYSIS

  2. PERFORMANCE AND SCALABILITY OF PARALLEL SYSTEMS • Evaluation • Sequential: runtime (Execution time) Ts =T (InputSize) • Parallel: runtime (Start-->Last PE ends) Tp =T (InputSize,p,architecture) Note: Cannot be Evaluated in Isolation from the Parallel architecture -- Parallel System : Parallel Algorithm ∞ Parallel Architecture • Metrics - Evaluate the Performance of Parallel System SCALABILITY: Ability of Parallel Algorithm to Achieve Performance Gains proportional to number of PE

  3. PERFORMANCE METRICS • Run-Time : Ts, Tp • Speedup: • How much Performance is gained by running the application on “p”(Identical) processors S= Ts/Tp where Ts: Fastest Sequential Algorithm for solving the same problem IF – not known yet(only lower bound known) or – known, with large constants at runtime that make it impossible to implement THEN Take the fastest known sequential algorithm that can be practically implemented  Speedup – relative metric

  4. Algorithm of adding “n” numbers on “n” processors(HyperCube) • S ≤ p S>p (SuperLinear) • Ts=Θ(n) S= Θ(n/logn) (n=p=2^k) Tp= Θ(logn) • Efficiency(E): measure of how effective the problem is solved on P processors E= S/P E є (0,1) • Measures the fraction of time for which a processor is usefully employed If p=n E= Θ(1/logn) Cost Cost(Sequential_fast) = Ts Cost(Parallel) = pTp Cost(Sequential_fast) = Cost(Parallel)  Cost-Optimal

  5. Algorithm of adding “n” numbers on “n” processors(HyperCube) Cost(Sequential_fast) = Θ (n) Cost(Parallel) = Θ (nlogn) • P=n Fine granularity • E= Θ(1/logn) •  Not Cost-Optimal

  6. Effects of Granularity on Cost-Optimality • Scaling Down(P<n) √n √n √n √n ≡ √n N n/p nPEs pPEs • Assume: n virtual PEs; If p – physical PEs, then each PE simulates n/p – virtual PEs the computation at each PE increases by a factor: n/p • Note: Even if p<n, this doesn't necessarily provide Cost –Optimal algorithm n/p

  7. Algorithm of adding “n” numbers on “n” processors(HyperCube)(p<n) n=2^k Eg: n=16, p=4 p=2^m • Computation + Communication(First 8 Steps) • Θ(n/p logp) • Computation (last 4 Steps) Θ(n/p) Parallel Execution Time=Θ(n/p logp) Cost(Parallel) = p Θ(n/p logp) = Θ(nlogp) Cost(Sequential_fast) = Θ(n) P↑ asymptotic – Not Cost Optimal

  8. Algorithm of adding “n” numbers on “n” processors(HyperCube)(p<n) • A Cost Optimal Algorithm • Computation + Communication • Θ(n/p + logp) = Θ(logp) (n > plogp) • Computation Θ(n/p) Parallel Execution Time=Θ(n/p) Cost(Parallel) = p Θ(n/p) = Θ(n) Cost(Sequential_fast) = Θ(n)  Cost(Parallel) = Cost(Sequential_fast) = Θ(n) – Cost Optimal

  9. If Algorithm is Cost-Optimal • P – Physical PEs • Each PE stimulates n/p virtual PEs Then, • If the overall communication does not grow more than: n/p (Proper Mapping) • Total parallel run-time grows at most: n/pTcomp + n/pTcomm = n/pTtotal=n/pTp = Tn/p Cost(Parallel, n=p) = pTp Cost(Parallel, n<p) = pTn/p = p.n/pTp • nTp = Cost(Parallel, n=p) New algorithm using n/p processors is cost-optimal (p<n)

  10. If Algorithm is not Cost-Optimal for p=n • If we increase the granularity The new algorithm using n/p,(p<n) May still not be cost optimal Eg: Adding “n” numbers on “p” processors HYPERCUBE n=2^k Eg: n=16, p=4 p=2^m • Each virtual PE(i) is simulated by physical PE(i mod p) • First logp(2 steps) of the logn(4 steps) in the original algorithm are simulated in n/plogp (16/4*2 = 8 Steps on p=4 processors) • The remaining steps do not require communication (the PE that continue to communicate in the original algorithm are simulated by the same PE here)

  11. The Role of Mapping Computations onto Processors in Parallel Algorithm Design • For a cost-optimal parallel algorithm E=Θ(1) • If a parallel algorithm on p=n processors is not cost-optimal or cost-non optimal then It doesn't imply that if p<n you can find a cost optimal algorithm • Even if you find a cost-optimal algorithm for p<n then It doesn't imply that you found an algorithm with best parallel run-time. • Performance(Parallel run-time) depends on 1) Number of processors 2) Data-Mapping (Assignment)

  12. The Role of Mapping Computations onto Processors in Parallel Algorithm Design • Parallel run-time of the same problem (problem size) depends upon the mapping of the virtual PEs onto Physical PEs • Performance critically depends on the data mapping onto a coarse grained parallel computer • Eg: Matrix multiply nxn by a vector on p processor hypercube [p square blocks vs p slices of n/p rows] Parallel FFT on a hypercube with Cut- Through Routing • W – Computation Steps =>Pmax=W • For Pmax – each PE executes one step of the algorithm • For p<W , each PE executes a larger number of steps • The choice of the best algorithm to perform local computations depends upon #PEs (how much fragmentation is available)

  13. Optimal algorithm for solving a problem on an arbitrary #PEs cannot be obtained from the most fine-grained parallel algorithm • The analysis on fine-grained parallel algorithm may not reveal important facts such as: • Analysis of coarse grain Parallel algorithm: Notes: 1) if message is short (one word only) => transfer time between two PE is the same for store and forward and cut-through routing 2) if the message is long => cut-through routing is faster than store and forward 3)Performance on Hypercube and Mesh is identical with cut-through routing 4) Performance on a mesh with store and forward is worse. Design: 1) Devise the parallel algorithm for the finest grain 2) mapping data onto PEs 3) description of algorithm implementation on an arbitrary #Pes Variables: Problem size, #PEs

  14. Scalability S≤P S(p) E(p) Eg: Adding n numbers on a p processors Hypercube Assume : 1 unit time(For adding 2 numbers or to communicate with connected PE) 1) adding locally n/p numbers Takes : n/p-1 2) p partial sums added in logp steps ( for each sum : 1 addition + 1 communication)=> 2logp Tp=n/p-1 + 2logp Tp= n/p + 2logp (n↑, p↑) Ts= n-1 =n S=n/(n/p + 2logp) =np/(n+2plogp) => S(n,p) E= S/p = n/(n+2plogp) => E(n,p) Can be computed for any par of n and p

  15. As p↑ to increase S => need to increase n(Otherwise saturation) • => E↓ • Larger Problem sizes. S ↑, E ↑ but they drop with p ↑ • E=Ct • Scalability : of a parallel system is a measure of its capacity to increase speedup in proportion to the number of processors.

  16. Efficiency of adding n numbers on a p processor Hypercube For cost optimal algorithm: S= np/(n+2plogp) E=n/(n+2plogp) E=E(n,p) n=Ω(plogp) E=0.80 constant for n=64 , p =4 , n=8plogp For n=192, p=8 , n=8plogp For n=512, p=16, n=8plogp

  17. Efficiency of adding n numbers on a p processor Hypercube • Conclusions: • For adding n numbers on p processors with cost optimal algorithm • The algorithm is cost-optimal if n=Ω(plogp) • The algorithm is scalable if n increases proportional with Θ(plogp) as p is increased. • For Matrix multiply: Input size n => O(n^3) n’=2n=>O(n’^3)≡O(8n^3) • Matrix addition Input size n => O(n^2) n’=2n=>O(n’^2)≡O(4n^2) Doubling the size of the problem means performing twice the amount of computation.

  18. Computation Step: • Assume takes 1 time unit • Message start-up time, per-word transfer time, per-hop time can be normalized with respect to unit computation time • W=Ts (for the fastest sequential algorithm on a sequential computer) • Overhead Function E=1, S=p(Ideal) E<1, S<p(In reality) • Reasons overhead( interprocessor communication, etc) => overhead function

  19. Overhead Function • The time collectively spent by all processors in addition to that required by the fastest sequential algorithm to solve the same problem on a single PE. To= To(W,p) To= pTp – W • For cost-optimal algorithm of adding n numbers on p processors hypercube Ts=W=n Tp= n/p + 2logp To= p(n/p + 2logp) – n = 2plogp To = 2plogp

  20. Isoefficiency Function Tp= T(W,To,p) To =pTp –W S=Ts/Tp = W/Tp E= S/p = W/(W + To(W,p) If W = ct, P↑ then E↓ If p=ct, W ↑ then E ↑ for parallel scalable systems • we need E=ct for scalable effective systems

  21. Isoefficiency Function • Eg.1 If p↑, W ↑ exponentially with p then problem is poorly scalable since we need to increase the problem size very much to obtain good speedups • Eg.2 If p↑, W ↑ linearly with p then problem is highly scalable since the speedup is now proportional to the number of processors • E=ct=> E/1-E= ct Given E= E/1-E = K • W = KTo(W,p) Function dictates growth rate of W required to Keep the E Constant as P increases • Isoefficiency in unscalable parallel systems because E cannot be kept constant as p increases, no matter how much or how fast W increases

  22. Overhead Function( adding n numbers on p processors Hypercube) Ts=n Tp= n/p + 2logp To= pTp – Ts =p(n/p + 2logp) – n = 2plogp Isoefficiency Function W=kTo(W,p) To= 2plogp(Note: To=To(p)) W=2kplogp => asymptotic isoefficiency function is Θ(plogp) Meaning 1) #PE↑ p’>p => problem size has to be increased by (p’logp’/plogp) to have the same effeciency as on p processors

  23. 2) #PE↑ p’>p by a factor p’/p requires problem size to grow by a factor (p’logp’/plogp) to increase the speedup by p’/p Here communication overhead is an exclusive function of p: To=To(p) In general To=To(W,p) W=kTo(W,p)(may involve many terms) Sometimes hard to solve in terms of p E=ct need ratio To/W fixed As p ↑, W ↑ to obtain nondecreasing efficiency E’≥E => To should not grow faster than W • None of To terms should grow faster than W

  24. If To has multiple terms, we balance W against each term of To and compute the respective isoefficiency functions for corresponding individual terms • The component of To that requires the problem size to grow at the highest rate with respect to p, determines the overall asymptotic isoefficiency of the parallel system Eg: To = p3/2 + p3/4 W3/4 W= k p3/2 =>Θ(p3/2) W= k p3/4 W3/4 W 1/4= Kp3/4 W=k4p3 =>Θ(P3) Take the highest of the two rates To ensure E doesn't decrease, the problem size needs to grow as Θ(P3)(Overall asymptotic isoefficiency )

  25. Isoefficiency functions: • Captures characteristics of the parallel algorithm and architecture • Predicts the impact on performance as #PE↑ • Characterizes the amount of parallelism in a parallel algorithm • Study of algorithm(parallel system) behaviour due to hardware changes(speed, PE, communication channels) Cost-Optimality and Isoefficiency Cost-Optimality =Ts/pTp = ct Ptp =Θ(W) W + To(W,p) = Θ(W) ( To=pTp – W) To(W,p) = O(W) W=Ω(To(W,p)) • A parallel system is cost optimal iff its overhead function doesn't grow asymptotically more than the problem size

  26. Relationship between Cost-optimality and Isoefficiency Function Eg: Add “n” numbers on “p” processors hypercube a) Non-optimal cost W=O(n) Tp=O(n/plogp) To=pTp – W =Θ(nlogp) W=k Θ(nlogp) not true for all K and E Algorithm is not cost-optimal, not scalable and isoefficiency function b) Cost-Optimal W=O(n) Tp=O(n/p + logp) To= Θ(n + plogp) –O(n) W=k Θ(plogp) Problem size should grow at least as plogp such that parallel system is scalable W=Ω(plogp)(n>>p for cost optimality)

  27. Isoefficiency Function • Determines the ease with which a parallel system can maintain a constant efficiency and thus, achieve speedups increasing in proportion to the number of processors • A small isoefficiency function means that small increments in the problem size are sufficient for the efficient utilization of an increasing number of processors => indicates the parallel system is highly scalable • A large isoefficiency function indicates a poorly scalable parallel system • The isoefficiency function does not exist for unscalable parallel systems, because in such systems the efficiency cannot be kept at any constant value as p↑, no matter how fast the problem size is increased

  28. Lower Bound on the Isoefficiency • Small isoefficiency function => high scalability For a problem with W, Pmax ≤ W for cost-optimal system(if Pmax > W, some PE are idle) If W<Θ(p) i.e problem size grows slower than p, as p↑ => at one point #PE > W =>E↓ => asymptotically W= Θ(p) Problem size must increase proportional as Θ(p) to maintain fixed efficiency W = Ω(p) (W should grow at least as fast as p) Ω(p) is the asymptotic lower bound on the isoefficiency function But Pmax= Θ(W) (p should grow at most as fast as W) => The isoefficiency function for an ideal parallel system is: W= Θ(p)

  29. Degree of Concurrency and Isoefficiency function • Maximum number of tasks that can be executed simultaneously at any time • Independent of parallel architecture C(W) – no more than C(W) processors can be employed effectively Effect of Concurrency on Isoefficiency function Eg: Gaussian Elimination : W=Θ(n3) P= Θ(n2) C(W)= Θ(W2/3)=> at most Θ(W2/3) processors can be used efficiently Given p W=Ω(p3/2) => problem size should be at least Ω(p3/2) to use them all => The Isoefficiency due to concurrency is Θ(p3/2) The Isoefficiency function due to concurrency is optimal, that is , Θ(p) only is the degree of concurrency of the parallel algorithm is Θ(W)

  30. Sources of Overhead • Interprocessor Communication Each PE Spends tcomm Overall interprocessor communication : ptcomm (Architecture impact) • Load imbalance Idle vs busy PEs( Contributes to overhead) Eg: In sequential part 1PE : Ws Useful p-1 PEs: (p-1)Ws contribution to overhead function • Extra-Computation 1) Redundant Computation(eg: Fast fourier transform) 2) W – for best sequential algorithm W’ – for a sequential algorithm easily parallelizable W’-W  contributes to overhead. W=Ws + Wp => Ws executed by 1PE only =>(p-1)Ws contributes to overhead • Overhead of scheduling

  31. If the degree of concurrency of an algorithm is less than Θ(W), then the Isoefficiency function due to concurrency is worse, i.e. greater than Θ(p) • Overall Isoefficiency function of a parallel system: Isoeffsystem = max(Isoeffconcurr, Isoeffcommun, Isoeffoverhead) Sources of Parallel Overhead • The overhead function characterizes a parallel system • Given the overhead function To=To(W,p) We can express: Tp, S,E,pTp(cost) as fi(W,p) • The overhead function encapsulates all causes of inefficiencies of a parallel system, due to: • Algorithm • Architecture • Algorithm –architecture interaction

  32. Minimum Execution TimeAdding n numbers on p processors on a hypercube Assume p is not a constraint • In general: Tp = Tp(W,p) For a given W,Tp(min) 2 dTp/dp = 0 => po for which Tp=Tp(min) Eg: Tp= n/p + 2logp dTp/dp = 0 => p0=n/2 Tp(min)= 2logn Cost-Sequential :Θ(n) Cost-Parallel : Θ(nlogn) since p0Tp(min)=n/2 X 2logn Not Cost optimal => running this algorithm for Tp(min) is not cost-optimal but this algorithm is Cost-Optimal

  33. Derive :Lower bound for Tp such that parallel cost is optimal Tp(cost-optimal) • Parallel run time such that cost is optimal • W fixed • If Isoefficiency function is Θ(f(p)) Then problem of size W can be executed Cost-optimally only iff: W = Ω(f(p)) P= Θ(f-1(p)){Required for a cost optimal solution } Tp for cost cost-optimal solution is = Θ(W/p) Since pTp = Θ(W) Tp= Θ(W/p) P= Θ(f-1(W)) => Tp(cost-optimal) = Ω(W/f-1(w))

  34. Minimum Cost-Optimal Time for adding N numbers on a hypercube A) isoefficiency function: To= pTp-W Tp =n/p + 2logp => To= p(n/p + slogp) – n = 2plogp W= kTo= k2plogp W=Θ(plogp){isoefficiency function} • If W=n=f(p)=plogp =>logn = logp +loglogp logn=logp • If n=f(p)=plogp P=f-1(n) n=plogp=>p= n/logp=n/logn f-1(n) = n/logn f-1(W)=n/logn f-1 (W)=Θ(n/logn)

  35. B) The cost-optimal solution p= O(f-1(W)) => for a cost optimal solution P=Θ(nlogn) { the max for cost-optimal solution} For p=n/logn Tp=Tp(cost-optimal) Tp=n/p + 2logp Tp(cost-optimal) = logn + 2log(n/logn) =>3logn – 2loglogn Tp(cost-optimal) = Θ(logn) Note: Tp(min) = Θ(logn) Tp(cost-optimal) = Θ(logn) { cost optimal solution is the best asymptotic solution in terms of execution time} Tp(min)=> po = n/2 > po = n/logn (Tp(cost-optimal)) =>Tp(cost-optimal) = Θ(Tp(min))

  36. Parallel system where Tp(cost-optimal) > Tp(min) To = p3/2 + p3/4W3/4 Tp= (W+To)/p =>Tp=W/p + p1/2 +W3/4/P1/4 dTp/dp=0 => p3/4 = 1/4W3/4 (1/16W3/2 + 2W)1/2 =Θ(W3/4) Po= Θ(W) Tp(min) = Θ(W1/2) • Isoefficiency Function: W=kTo = k4p3 = Θ(p3) Pmax = Θ(W1/3){ Max #PE for which algorithm is cost-optimal} Tp = W/p + p1/2 + W3/4/p1/4 p = Θ(W) => Tp(cost-optimal) = Θ(W2/3) Tp(cost-optimal) > Tp(min) asymptotically

More Related