1 / 36

Lecture 5: Part 1 Performance Laws: Speedup and Scalability

Lecture 5: Part 1 Performance Laws: Speedup and Scalability. Sequential Execution Time. Execution time = Seconds = Instructions x Cycles x Seconds (T e ) Program Program Instruction Cycle. 500 MHz CPU: 500 x 10 6 clock cycles/sec

martha-horn
Download Presentation

Lecture 5: Part 1 Performance Laws: Speedup and Scalability

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 5: Part 1Performance Laws:Speedup and Scalability

  2. Sequential Execution Time Execution time = Seconds = Instructions x Cycles x Seconds (Te) Program Program Instruction Cycle • 500 MHz CPU: 500 x 106 clock cycles/sec • My program consists of operations: • addition: 500 x 106 (1 cycle per inst) • mul: 180 x 106 (3 cycles/inst) • div: 120 x 106 (2 cycles/inst) • data move: 300 x 106 (1 cycle/inst) • Expected execution time: • 2 ns x (500x1+180x3+120x2+300x1)x 106

  3. Basic Performance Metrics • MIPS = (instructions/second)x 10-6 • MFLOPS = (floating point ops/second)x 10-6 • CPI = Average cycles per instruction • Throughput: number of results per second • Workload: W, number of Ops. required to complete the program • Speed: W/TE • Speedup (S)= Te / Timprove • Efficiency (using P processors) = Speedup / P Life is so easy !!

  4. Part I: Speedup

  5. Speedup • General concepts of Speedup in parallel computing: • How much faster an application runs on parallel computer ? • What benefits derive from the use of parallelism? • General agreement : • speedup = serial time/parallel time

  6. Speedup • Fixed-Workload Speedup (Amdahl’s Law) • Fixed-Time Speedup (Gustafson’s Law) • Fixed-Memory Speedup (Sun and Ni)

  7. Amdahl's Law Speedup due to enhancement E: ExTime w/o E Speedup(E) = ------------- ExTime w/ E Suppose that enhancement E accelerates a fraction F of the task by a factor S, and the remainder of the task is unaffected F/S F 1/S ExTime: execution time

  8. Amdahl’s Law ExTimenew = ExTimeold x (1 - Fractionenhanced) + Fractionenhanced Speedupenhanced 1 ExTimeold ExTimenew Speedupoverall = = (1 - Fractionenhanced) + Fractionenhanced Speedupenhanced

  9. Amdahl’s Law • Floating point instructions improved to run 2X; but only 10% of actual instructions are FP ExTimenew= Speedupoverall =

  10. Amdahl’s Law • Floating point instructions improved to run 2X; but only 10% of actual instructions are FP ExTimenew= ExTimeold x (0.9 + .1/2) = 0.95 x ExTimeold 1 Speedupoverall = = 1.053 0.95

  11. Amdahl’s Law • Some applications need real-time response • Amdahl’s law shows the upper bound of the achievable speedup for a given problem size. • Limit on the achievable speedup: • W: total workload •  of W must be executed sequentially • Sp = W / ( W+(1-  )(W/P)) = P / (1+(P-1) )  1 /  as P 

  12. Fixed-Time Speedup • Gustafson’s Law (1988) • Scaling the problem size along with the increase of machine size within the same execution time • Scaling for higher accuracy • Parameters: • W: workload done by a single node • W’ =  W+(1-  )WP = work done by P nodes • Sp = ( W+(1-  )WP) / W =  +(1-)P  Speedup is linear function of P

  13. Fixed-Memory Speedup • Sun and Ni’s Law: Memory bounding (1993) • To solve the largest possible problem, limited only by the available memory space. • As P increases, use up all the increased memory by scaling the problem size also • See Kai’s book (Section 3.6.3)

  14. More on Speedup • Can you measure the sequential time? • What if the memory is too small ? • Is the sequential machine using the same processor as the one in parallel machine ? • Can the sequential time be shorter than the parallel time ? • Is speedup really important ? Or only the execution time is important ?

  15. More on Speedup • Speedup larger than “P” (Superlinear) ? • What if the disk swapping effects happen in sequential execution? • More caches/memory used in the parallel machines (PxC, PxM) • Both sequential and parallel machines use the same OS/compiler ? • Is the data distribution/collection time included in the parallel time? • Some parallel machines provide parallel I/O !!

  16. More on Speedup • What if we use different algorithms for sequential and parallel solutions • Take a look at the parallel sorting assignment. What is the maximum speedup for parallel sorting?

  17. Definitions of Speedup • Diverse definitions of serial and parallel execution times [Sahni:1996] • Relative • Real • Absolute • Asymptotic • Asymptotic relative

  18. Parameters Used in Speedup Definitions • I= problem instance • P = number of processors • Q= parallel program • n = size of I

  19. Relative Speedup • serial time : execution time of the parallel program on a single node of the parallel computer. (mpirun -np 1) • Relative speedup (I,P) = time to solveI using program Q and 1 processor ÷ time to solveI using Qprogram and P processors

  20. Relative Speedup • Depends on the characteristics of the instance I being solved as well as the size P of the parallel computer • Same OS, same node architecture, using same compiler • Extra overheads in serial time (distribution/collection).

  21. Real Speedup • Parallel time vs. the fastest serial algorithm or program running on a single node of the parallel computer • Real speedup (I,P) = time to solve Iusing best serial program and 1 processor ÷ time to solveIusing Qprogram and P processors

  22. Problems on Real Speedup • The fastest algorithm might not be known. • No single algorithm might be fastest in all instances for some applications • In practice, we use the runtime of the most frequently used sequential algorithm.

  23. Absolute Speedup • Parallel time vs. the fastest sequential algorithm run on the fastest serial computer • Absolute speedup (I,P) = time to solve Iusing best serial program and 1 fastest processor ÷ time to solve I using Qprogram and P processors

  24. Absolute Speedup • Can also use the sequential algorithm most often used in practice. • Time-variant: researchers keep designing new algorithms • Speedup could be less than 1

  25. Asymptotic Real Speedup • Compares the execution time of the best serial algorithm for a particular problem with the asymptotic complexity of the parallel algorithm • Assuming the the parallel computer has all the processors it can use

  26. Asymptotic Real Speedup • Asymptotic real speedup (n) = asymptotic complexity of bestserial algorithm ÷ asymptotic complexity of Q using as many processors as needed

  27. Asymptotic Real Speedup • For problem such as sorting where the asymptotic complexity is not uniquely characterized by the instance size n, the worst-case complexityis used -- O(n2), O(n log n) • Unbounded number of processors.

  28. Asymptotic Relative Speedup • Uses the asymptotic time complexity of the parallel algorithm when run on a single processor. • Asymptotic relative speedup= asymptotic complexity of Q using 1 processor ÷ asymptotic complexity of Q using as many processors as needed

  29. Asymptotic Relative Speedup • Matrix Multiplication (n X n) • Serial time: O(n3) • Parallel time (on n3/log n processors hypercube): O(log n) time • Asymptotic Relative Speedup: O(n3/log n) • Others are measured Speedup • Relative • Real • Absolute

  30. Part II: Scalability

  31. Scalability • Algorithmic Scalability: • the available parallelism increases at least linearly with problem size. • Architectural (Size) Scalability: • the architecture continues to yield the same performance per processor, as the number of processors is increased and as the problem size is increased.

  32. Architectural Scalability Processor Speed No. of issues, depth of pipelining Interconnection Network: latency/bandwidth Architectural scalability Memory Subsystem: cache/memory speed OS support (lock, processes synchronization overheads) I/O performance Performance grows in all aspects

  33. Discussion of Architectural Scalability: bus-based SMPs • Bus (single set of wires) : fixed bandwidth shared by all processors (P) • Bandwidth accessible by each processor decreases as P increases • Physical constraints: fixed number of slots • Electrical constraints: • bus loading, wire length determine frequency • power • Cost scaling: X (bus is the core) • OS scaling : scheduling, locking, I/O

  34. Scalable Interconnection Network • Bandwidth (P)? • Latency (P)? • Cost (P)? • More bandwidth, smaller L  greater cost • SGI XL: 1.5 M HK$ (1.2 GB/s) • ALR SMP: 0.1 M HK$ (533 MB/s) Network (bus ? multistage, mesh, torus) PE PE PE PE

  35. Bottom Line • There is NO “truly scalable” machine • Goal is to design for a given “range of scale” • one (instruction level) • two to tens (SMP): • Top-end: Sun Enterprise 1000 : 64 processors, COMPACQ 64 Alpha processors • tens to a few thousands (Distributed-Memory Multicomputers: T3E, SP2, Paragon, ...) • tens of thousands (ASCI machines: cluster of SMPs) • Techniques at one scale may not be cost effect at another

  36. Bottom Line • Even though we can meet the engineering requirements to physically scale over the range of interest, we must ensure that the communication and synchronization operations required to support target programming models also scale and are cost effective.

More Related