280 likes | 694 Views
Performance of parallel and distributed systems. What is the purpose of measurement? To evaluate a system (or an architecture) To compare two or more systems To compare different algorithms metric to be used? speedup. Workload-Driven Evaluation . Approach
E N D
Performance of parallel and distributed systems • What is the purpose of measurement? • To evaluate a system (or an architecture) • To compare two or more systems • To compare different algorithms • metric to be used? • speedup
Workload-Driven Evaluation • Approach • Run a workload (trace) and measure performance of the system • Traces • Real trace • Synthetic trace • Other issues • How representative is the workload?
Type of systems • For existing systems • Run workload and evaluate performance of the system • Problem: Is the workload representative? • For future systems (an architectural idea): • Develop a simulator of the system • run workload and evaluate the system • Problem: • Developing a simulator is difficult and expensive • How do define system parameters, such as memory access time and communication cost?
Time(1) Time(p) Speedup= Measuring Performance • Performance metric most important to end user • Performance =Work / Time unit • Performance improvement due to parallelism
Performance evaluation of a parallel computer • Speedup(p) = Time(1) / Time(p) • What is Time(1)? 1. Parallel program on one processor of parallel machine? 2. A sequential algorithm on one processor of the parallel machine? 3. “Best” sequential program on one processor of the parallel machine? 4. “Best” sequential program on agreed-upon standard machine? • Which one is reasonable?
Speedup • What is Time(p)? • The time needed by the parallel machine to run the same workload? • Is it fair? • How does the size affects our measurement?
Example 1: Our experimence Parallel simulation of Multistage Interconnection Network (MIN) d: number of stages n: number of nodes n= (d+1)* 2 d
Speedup of MIN on CM-2Speedup=T(1)/T(p), where T(1)=execution time of sequential simulator on a sun sparcT(p)=execution time of parallel simulator on CM-2 with 8k processors
Why problem size is important? • The problem size is too small: • May be appropriate for small machine, but not for the parallel machine • Not enough work for the PM • Parallelism overheads begin to dominate benefits for the PM • Load imbalance • Communication to computation ratio • May even achieve slowdowns • Doesn’t reflect real usage, and inappropriate for large machines • Can exaggerate benefits of architectural improvements, especially when measured as percentage improvement in performance
Size is too large • May not “fit” in small machine • Can’t run • Thrashing to disk • Working set doesn’t fit in cache • May lead to super linear speedup • What is the right size? • How do we find the right size?
Scaling: Example 2Small and big equation solvers on SGI Origin2000(fom Parallel Computer Architecture, Culler & Singh)
Scaling issues • Important issues • Reasonable problem size • Scaling problem size • Scaling machine size • Example • Consider a dispatcher based cluster and compare three load balancing algorithms • Round Robin (RR) • Least connection (LC) first • Least loaded first (LL)
arrival rate (requests/sec) average waiting time (ms) average response time (ms) average utilization Baseline RR LC Baseline RR LC Baseline RR LC 250 0.2 10.5 0.9 3.8 14.1 4.5 0.226 0.001 0.226 0.002 0.23 0.001 500 1.8 32.99 3.9 5.4 36.6 7.5 0.453 0.002 0.453 0.003 0.453 0.002 750 49.5 127.5 53.1 53.2 131.1 56.7 0.679 0.001 0.679 0.004 0.680 0.001 1000 849.5 1112.3 853.0 853.1 1115.9 856.7 0.906 0.00 0.905 0.006 0.905 0.001 1250 70084 70118 70085 70087 70121 70088 0.998 0.001 0.991 0.006 0.997 0.001 Scale problem size, but keep machine size fixed Table 3: Performance of a 4-Server Cluster
Scaling problem size (cont’d) Conclusion: for low arrival rate LC is much better than RR, but for high arrival rate both converge to the BL algorithm Is it a fare conclusion?
no. of servers arrival rate average response time (ms) average waiting time (ms) average utilization baseline RR LC baseline RR LC baseline RR LC 1 250 3467 3467 0.906 2 500 1722.0 1913.2 1724.6 1718.4 1909.5 1721.0 0.906 0.000 0.905 0.002 0.905 0.000 4 1000 853.1 1115.9 856.7 849.5 1112.3 853.0 0.906 0.00 0.905 0.006 0.905 0.001 8 2000 419.7 741.4 421.6 416.0 737.8 418.0 0.906 0.001 0.905 0.007 0.906 0.001 16 4000 213.0 608.0 215.6 209.4 604.4 212.0 0.906 0.001 0.903 0.017 0.905 0.002 Scaling problem and machine size
Scaling problem and machine size (cont’d) Conclusion: LC is much better than RR Is it a fare conclusion?
Questions in Scaling • How should the application be scaled? • Look at the web server • Scaling machine size e.g., by adding identical nodes, each bringing memory • Memory size is increased • Locality may be changed • Extra work (e.g., overhead for task scheduling) will be increased • Problem size: scaling problem size may change • locality • working set size • Communication cost
Why Scaling? • Two main reasons for scaling: • to increase performance, e.g. increase number of transactions per second • Of interest to users • to utilize resources (processor and memory) more efficiently • more interesting for managers • More difficult • scaling models: • Problem constrained (PC) • Memory constrained (MC) • Time constrained (TC)
Time(1) Time(p) Problem Constrained Scaling • Problem size is kept fixed, but the machine is scaled • Motivation: User wants to solve the same problem, only faster. • Some examples: • Video compression • Computer graphics • Message routing in a router (or switch) Speedup(p) =
Machine Constrained Scaling • Scale problem size, but the machine (memory) remains fixed • Motivation: It is good to find limits of a given machine e.g., what is the maximum problem size that can avoid memory thrashing? • Performance measurement: • previous definition of Speedup: Time(1) / Time(p) NOT valid • New definition: • Performance improvement = increase in work/increase in time • How to measure work? • Work can be defined as the number of instructions, operations, or transactions
Work(p) Work(1) Time Constrained Scaling • Time is kept fixed as the machine is scaled • Motivation: User has fixed time to use the machine (or wait for result as in real-time systems), but wish to do more work during this time • Performance = Work/Time as usual, and time is fixed, so • SpeedupTC(p) = • How Work(1) affects the result? • Work(1) must be reasonable to avoid thrashing
Evaluation using Workload • Must consider three major factors: • Workload characteristics • Problem Size • machine size
Impact of Workload • Should adequately represent domains of interest • Easy to mislead with workloads • Choose those with features for which machine is good, avoid others • Some features of interest: • Working set size and spatial locality • Fine-grained or coarse-grained tasks • Synchronization patterns • Contention, and Communication patterns • Should have enough to utilize the processors • If load imbalance dominates, may not be much machine can do
Problem size • Many critical characteristics depend on problem size • Communication pattern (IPC) • Synchronization pattern • Load imbalance • Need to choose problem sizes appropriately • Insufficient to use a single problem size
Steps in Choosing Problem Sizes • Expert view • May know that users care only about a few problem sizes • 2. Determine range of useful sizes Below which bad performance or unrealistic time distribution in phases Above which execution time or memory usage too large • 3. Use understanding of inherent characteristics Communication-to-computation ratio, load balance...
Summary • Performance improvement due to parallelism is often measured by speedup • Problem size is important • Scaling is often needed • Scaling models are fundamental to proper evaluation • Time constrained scaling is a realistic method for many applications • Scaling only data problem size can yield misleading results • Proper scaling requires understanding the workload