1 / 38

Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder

Chapter 3: Reasoning about Performance. Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder. Parallelism versus Performance. Pipelining increases performance T time sequentially ideally takes T/P on P processors Harder to achieve as P becomes large.

lenaa
Download Presentation

Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapter 3:Reasoning about Performance Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder

  2. Parallelism versus Performance Pipelining increases performance T time sequentially ideally takes T/P on P processors Harder to achieve as P becomes large

  3. Threads vs Processes • Threads • Share memory • Low cost • Processes • Local memory • Expensive • Message passing

  4. Performance meaning • Latency • Amt of time it takes to do some work • Throughput • Amt of work done in unit time

  5. Pipelining • Increases throughput • A single instruction may execute slower • Possible 5 times greater speed up • Other hardware speed ups • Caches • Memory prefetching

  6. Figure 3.1 Simplified processor pipeline. By dividing instruction execution into five equal-size parts, five instructions can ideally be executing simultaneously, giving (ideally) a five-fold improvement over executing each individual instruction to completion.

  7. Why expected speed up may not be achieved Overhead Non-parallelizable computation Idle processors Contention for resources

  8. Figure 3.2 A schematic diagram of setup and tear down overhead for threads and processes.

  9. Sources of overhead • Communication • Synchronization • Usage of locks • Waiting for other threads/processes • Computation • More than what is in sequential • What portion of data should I do • Memory

  10. Table 3.1 Sources of communication overhead by communication mechanism.

  11. Limits in speed up • Amdahl’s law • Describes potential benefit of parallelism • If 1/S is sequential, then max performance is limited to a factor of S • Tp = 1/S * Ts + (1 – 1/S) * Ts / P • Reality is probably worse than this since the time for the parallel portion is unlikely to vanish

  12. Contention • Spin lock • Increases bus traffic effecting other threads • False sharing • Mutex

  13. Idle time • Load imbalance • Memory bound computations • DRAM is slower than CPU • DRAM latency has not improved as fast as CPU speed has • Keep in mind locality of reference • Hardware threading helps

  14. Parallel structure Dependences Granularity Locality

  15. Dependency • Ordering relationship bt two computations • Caused by read and write operations • Or mutex • Data dependence • Preserved to maintain correctness • Flow dependence: read after write • Anti dependence: write after read • Output dependence: write after write • Input dependence: read after read (memory reuse)

  16. Figure 3.3 A dependence between a memory write (store) operation of one thread and the memory read (load) of another thread.

  17. True versus False dependency sum = a+1 first_term=sum*scale sum=b+1 second_term=sum*scale2 False dependency with sum, reduces parallelism

  18. True versus False dependency first_sum = a+1 first_term=first_sum*scale second_sum=b+1 second_term=second_sum*scale2 By removing the false dependency, the 2 sets of code can execute in parallel

  19. Dependency first_term=(a+1)*scale; second_term=(b+1)*scale; Dependency can’t be removed by renaming, add must occur before multiply. But, same improvement exists from previous slide.

  20. Figure 3.4 Schematic diagram of sequential and tree-based addition algorithms; edges not connected to a leaf represent flow dependences.

  21. Granularity • Coarse/fine large/small • Defined by • Degree to which threads/processes interact • Degree to which dependences cross boundaries

  22. Locality • Temporal • Memory references over time • Spatial • Memory references via address

  23. Performance Trade-Offs • 90/10 rule—avoid premature optimization • 90% of time is spent in 10% of code • Amdahl’s law says we should pay attention to the 10% or we limit our speed up by a factor of 10.

  24. Communication Costs • May reduce comm by additional computation • Overlap comm and computation • Perform redundant computation • Recompute a value rather than send it • May reduce dependences as well reducing synchronization

  25. Memory vs Parallelism • By using more memory parallelism can be enhanced • Privatization (local sum) • Padding (to force variables to lie in their own cache line)

  26. Overhead vs Parallelism • Often adding more processors does not help • Parallelize overhead – in count 3 do parallel sum accumulation • Load Balance vs Overhead • Granularity trade-offs • Reduce parallelism results from increased granularity

  27. Measuring Performance • Execution time (latency) • Speedup = Ts / Tp • Superlinear speedup (as shown next) • Does less work • All data access in cache whereas sequential must access DRAM • Another example – sequential search

  28. Figure 3.5 A typical speedup graph showing performance for two programs; the dashed line represents linear speedup.

  29. Measuring Performance (cont) Efficiency = Speedup/P Ideally Efficiency = 1, all processors busy Communication has not kept up w/ CPU enhances so efficiency has suffered. Relative speedup – run with P=1 Create large Ts makes speed up look better Cold starts – not warming cache Peripheral charges – I/O can mask parallel improvement

  30. Measuring Performance (cont) • Scaled speedup vs fixed-size speedup • Problem size may be OK for 1 processor but too small for many processors • Problem size often lends itself to a specific number of processors • Problem size is difficult

  31. Scalable Performance Consider alphabetizing Assume 20% not parallelizable For 2 processors T2 = Ts / 2 + 0.2Ts E2 = (Ts / T2) / 2 = 10/7/2 = 5/7 = .71 T10 = Ts /10 + 0.2Ts = 0.3Ts E10= (10/3)/10 = 0.33 E100 = 0.047 hardly worth it.

  32. Implications for Hardware The previous slide indicates that as the number of processors grow, their own individual computing power isn’t as important.

  33. Implications for Software Considerations of computation versus communication.

  34. Scaling the Problem Size Typically the problem size increases as more processors are used. Before we kept the problem size constant Consider how parallelism affects problem size Next slide does computation and assumes perfect speedup

  35. Processors needed for Scaling • Consider sequential algorithm whose execution is O(nX) thus T = cnx • Assume P processors and m times larger problem for same execution time T = c(mn)x / P = cnx • Solve for m, m = P(1/x) • Consider O(n4), increase pblm by 100, requires 100,000,000 processors

  36. Processors needed for Scaling (cont) Consider O(n2) increase by 100, needs 10,000 processors Lesson: algorithm needs to be as scalable as possible, adding more processors makes scaling harder. Concept is: “the corollary of modest potential”

  37. Summary • Increase parallelism may not yield increased performance • Many interconnecting issues • Memory • Processors • Synchronization • Communication

  38. Summary (cont) • Make tradeoffs to achieve goals • Increase locality • Reduce cross thread dependences • Consider granularity

More Related