380 likes | 399 Views
Chapter 3: Reasoning about Performance. Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder. Parallelism versus Performance. Pipelining increases performance T time sequentially ideally takes T/P on P processors Harder to achieve as P becomes large.
E N D
Chapter 3:Reasoning about Performance Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder
Parallelism versus Performance Pipelining increases performance T time sequentially ideally takes T/P on P processors Harder to achieve as P becomes large
Threads vs Processes • Threads • Share memory • Low cost • Processes • Local memory • Expensive • Message passing
Performance meaning • Latency • Amt of time it takes to do some work • Throughput • Amt of work done in unit time
Pipelining • Increases throughput • A single instruction may execute slower • Possible 5 times greater speed up • Other hardware speed ups • Caches • Memory prefetching
Figure 3.1 Simplified processor pipeline. By dividing instruction execution into five equal-size parts, five instructions can ideally be executing simultaneously, giving (ideally) a five-fold improvement over executing each individual instruction to completion.
Why expected speed up may not be achieved Overhead Non-parallelizable computation Idle processors Contention for resources
Figure 3.2 A schematic diagram of setup and tear down overhead for threads and processes.
Sources of overhead • Communication • Synchronization • Usage of locks • Waiting for other threads/processes • Computation • More than what is in sequential • What portion of data should I do • Memory
Table 3.1 Sources of communication overhead by communication mechanism.
Limits in speed up • Amdahl’s law • Describes potential benefit of parallelism • If 1/S is sequential, then max performance is limited to a factor of S • Tp = 1/S * Ts + (1 – 1/S) * Ts / P • Reality is probably worse than this since the time for the parallel portion is unlikely to vanish
Contention • Spin lock • Increases bus traffic effecting other threads • False sharing • Mutex
Idle time • Load imbalance • Memory bound computations • DRAM is slower than CPU • DRAM latency has not improved as fast as CPU speed has • Keep in mind locality of reference • Hardware threading helps
Parallel structure Dependences Granularity Locality
Dependency • Ordering relationship bt two computations • Caused by read and write operations • Or mutex • Data dependence • Preserved to maintain correctness • Flow dependence: read after write • Anti dependence: write after read • Output dependence: write after write • Input dependence: read after read (memory reuse)
Figure 3.3 A dependence between a memory write (store) operation of one thread and the memory read (load) of another thread.
True versus False dependency sum = a+1 first_term=sum*scale sum=b+1 second_term=sum*scale2 False dependency with sum, reduces parallelism
True versus False dependency first_sum = a+1 first_term=first_sum*scale second_sum=b+1 second_term=second_sum*scale2 By removing the false dependency, the 2 sets of code can execute in parallel
Dependency first_term=(a+1)*scale; second_term=(b+1)*scale; Dependency can’t be removed by renaming, add must occur before multiply. But, same improvement exists from previous slide.
Figure 3.4 Schematic diagram of sequential and tree-based addition algorithms; edges not connected to a leaf represent flow dependences.
Granularity • Coarse/fine large/small • Defined by • Degree to which threads/processes interact • Degree to which dependences cross boundaries
Locality • Temporal • Memory references over time • Spatial • Memory references via address
Performance Trade-Offs • 90/10 rule—avoid premature optimization • 90% of time is spent in 10% of code • Amdahl’s law says we should pay attention to the 10% or we limit our speed up by a factor of 10.
Communication Costs • May reduce comm by additional computation • Overlap comm and computation • Perform redundant computation • Recompute a value rather than send it • May reduce dependences as well reducing synchronization
Memory vs Parallelism • By using more memory parallelism can be enhanced • Privatization (local sum) • Padding (to force variables to lie in their own cache line)
Overhead vs Parallelism • Often adding more processors does not help • Parallelize overhead – in count 3 do parallel sum accumulation • Load Balance vs Overhead • Granularity trade-offs • Reduce parallelism results from increased granularity
Measuring Performance • Execution time (latency) • Speedup = Ts / Tp • Superlinear speedup (as shown next) • Does less work • All data access in cache whereas sequential must access DRAM • Another example – sequential search
Figure 3.5 A typical speedup graph showing performance for two programs; the dashed line represents linear speedup.
Measuring Performance (cont) Efficiency = Speedup/P Ideally Efficiency = 1, all processors busy Communication has not kept up w/ CPU enhances so efficiency has suffered. Relative speedup – run with P=1 Create large Ts makes speed up look better Cold starts – not warming cache Peripheral charges – I/O can mask parallel improvement
Measuring Performance (cont) • Scaled speedup vs fixed-size speedup • Problem size may be OK for 1 processor but too small for many processors • Problem size often lends itself to a specific number of processors • Problem size is difficult
Scalable Performance Consider alphabetizing Assume 20% not parallelizable For 2 processors T2 = Ts / 2 + 0.2Ts E2 = (Ts / T2) / 2 = 10/7/2 = 5/7 = .71 T10 = Ts /10 + 0.2Ts = 0.3Ts E10= (10/3)/10 = 0.33 E100 = 0.047 hardly worth it.
Implications for Hardware The previous slide indicates that as the number of processors grow, their own individual computing power isn’t as important.
Implications for Software Considerations of computation versus communication.
Scaling the Problem Size Typically the problem size increases as more processors are used. Before we kept the problem size constant Consider how parallelism affects problem size Next slide does computation and assumes perfect speedup
Processors needed for Scaling • Consider sequential algorithm whose execution is O(nX) thus T = cnx • Assume P processors and m times larger problem for same execution time T = c(mn)x / P = cnx • Solve for m, m = P(1/x) • Consider O(n4), increase pblm by 100, requires 100,000,000 processors
Processors needed for Scaling (cont) Consider O(n2) increase by 100, needs 10,000 processors Lesson: algorithm needs to be as scalable as possible, adding more processors makes scaling harder. Concept is: “the corollary of modest potential”
Summary • Increase parallelism may not yield increased performance • Many interconnecting issues • Memory • Processors • Synchronization • Communication
Summary (cont) • Make tradeoffs to achieve goals • Increase locality • Reduce cross thread dependences • Consider granularity