180 likes | 444 Views
Lecture 4 Analytical Modeling of Parallel Programs. Parallel Computing Fall 2008. Performance Metrics for Parallel Systems. Number of processing elements p Execution Time
E N D
Lecture 4 Analytical Modeling of Parallel Programs Parallel Computing Fall 2008
Performance Metrics for Parallel Systems • Number of processing elements p • Execution Time • Parallel runtime: the time that elapses from the moment a parallel computation starts to the moment the last processing element finishes execution. • Ts: serial runtime • Tp: parallel runtime • Total Parallel Overhead T0 • Total time collectively spent by all the processing elements – running time required by the fastest known sequential algorithm for solving the same problem on a single processing element. • T0=pTp-Ts
Performance Metrics for Parallel Systems • Speedup S: • The ratio of the serial runtime of the best sequential algorithm for solving a problem to the time taken by the parallel algorithm to solve the same problem on p processing elements. • S=Ts(best)/Tp • Example: adding n numbers: Tp=Θ(logn), Ts= Θ(n), S= Θ(n/logn) • Theoretically, speedup can never exceed the number of processing elements p(S<=p). • Proof: Assume a speedup is greater than p, then each processing element can spend less than time Ts/p solving the problem. In this case, a single processing element could emulate the p processing elements and solve the problem in fewer than Ts units of time. This is a contradiction because speedup, by definition, is computed with respect to the best sequential algorithm. • Superlinear speedup: In practice, a speedup greater than p is sometimes observed, this usually happens when the work performed by a serial algorithm is greater than its parallel formulation or due to hardware features that put the serial implementation at a disadvantage.
Example for Superlinear speedup • Superlinear speedup: • Example1: Superlinear effects from caches: With the problem instance size of A and 64KB cache, the cache hit rate is 80%. Assume latency to cache of 2ns and latency of DRAM of 100ns, then memory access time is 2*0.8+100*0.2=21.6ns. If the computation is memory bound and performs one FLOP/memory access, this corresponds to a processing rate of 46.3 MFLOPS. With the problem instance size of A/2 and 64KB cache, the cache hit rate is higher, i.e., 90%, 8% the remaining data comes from local DRAM and the other 2% comes from the remote DRAM with latency of 400ns, then memory access time is 2*0.9+100*0.08+400*0.02=17.8. The corresponding execution rate at each processor is 56.18MFLOPS, and for two processors the total processing rate is 112.36MFLOPS. Then the speedup will be 112.36/46.3=2.43!
Example for Superlinear speedup • Superlinear speedup: • Example2: Superlinear effects due to exploratory decomposition: explore leaf nodes of an unstructured tree. Each leaf has a label associated with it and the objective is to find a node with a specified label, say ‘S’. The solution node is the rightmost leaf in the tree. A serial formulation of this problem based on depth-first tree traversal explores the entire tree, i.e. all 14 nodes, time is 14 units time. Now a parallel formulation in which the left subtree is explored by processing element 0 and the right subtree is explored by processing element 1. The total work done by the parallel algorithm is only 9 nodes and corresponding parallel time is 5 units time. Then the speedup is 14/5=2.8.
Performance Metrics for Parallel Systems(cont.) • Efficiency E • Ratio of speedup to the number of processing element. • E=S/p • A measure of the fraction of time for which a processing element is usefully employed. • Examples: adding n numbers on n processing elements: Tp=Θ(logn), Ts= Θ(n), S= Θ(n/logn), E= Θ(1/logn) • Cost(also called Work or processor-time product) W • Product of parallel runtime and the number of processing elements used. • W=Tp*p • Examples: adding n numbers on n processing elements: W= Θ(nlogn). • Cost-optimal: if the cost of solving a problem on a parallel computer has the same asymptotic growth(in Θ terms) as a function of the input size as the fastest-known sequential algorithm on a single processing element. • Problem Size W2 • The number of basic computation steps in the best sequential algorithm to solve the problem on a single processing element. • W2=Ts of the fastest known algorithm to solve the problem on a sequential computer.
Parallel vs Sequential Computing: Amdahl’s • Theorem 0.1 (Amdahl’s Law) Let f, 0 ≤ f ≤ 1, be the fraction of a computation that is inherently sequential. Then the maximum obtainable speedup S on p processors is S ≤1/(f + (1 − f)/p) • Proof. Let T be the sequential running time for the named computation. fT is the time spent on the inherently sequential part of the program. On p processors the remaining computation, if fully parallelizable, would achieve a running time of at most (1−f)T/p. This way the running time of the parallel program on p processors is the sum of the execution time of the sequential and parallel components that is, fT + (1 − f)T/p. The maximum allowable speedup is therefore S ≤ T/(fT + (1 − f)T/p) and the result is proven.
Amdahl’s Law • Amdahl used this observation to advocate the building of even more powerful sequential machines as one cannot gain much by using parallel machines. For example if f = 10%, then S ≤ 10 as p → ∞. The underlying assumption in Amdahl’s Law is that the sequential component of a program is a constant fraction of the whole program. In many instances as problem size increases the fraction of computation that is inherently sequential decreases with time. In many cases even a speedup of 10 is quite significant by itself. • In addition Amdahl’s law is based on the concept that parallel computing always tries to minimize parallel time. In some cases a parallel computer is used to increase the problem size that can be solved in a fixed amount of time. For example in weather prediction this would increase the accuracy of say a three-day forecast or would allow a more accurate five-day forecast.
Parallel vs Sequential Computing: Gustaffson’s Law • Theorem 0.2 (Gustafson’s Law) Let the execution time of a parallel algorithm consist of a sequential segment fT and a parallel segment (1 − f)T and the sequential segment is constant. The scaled speedup of the algorithm is then. S =(fT + (1 − f)Tp)/(fT + (1 − f)T) = f + (1 − f)p • For f = 0.05, we get S = 19.05, whereas Amdahl’s law gives an S ≤ 10.26. 1 proc p proc fT fT (1-f)Tp (1-f)T T(f+(1-f)p) T • Amdahl’s Law assumes that problem size is fixed when it deals with scalability. Gustafson’s Law assumes that running time is fixed.
Brent’s Scheduling Principle(Emulations) • Suppose we have an unlimited parallelism efficient parallel algorithm, i.e. an algorithm that runs on zillions of processors. In practice zillions of processors may not available. Suppose we have only p processors. A question that arises is what can we do to “run” the efficient zillion processor algorithm on our limited machine. • One answer is emulation: simulate the zillion processor algorithm on the p processor machine. • Theorem 0.3 (Brent’s Principle) Let the execution time of a parallel algorithm requires m operations and runs in parallel time t. Then running this algorithm on a limited processor machine with only p processors would require time m/p + t. • Proof: Let mi be the number of computational operations at the i-th step, i.e. .If we assign the p processors on the i-th step to work on these mi operations they can conclude in time . Thus the total running time on p processors would be
End Thank you!