Maximizing Performance through Parallel Programming Principles

Chapter 3:Reasoning about Performance Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder

Parallelism versus Performance Pipelining increases performance T time sequentially ideally takes T/P on P processors Harder to achieve as P becomes large

Threads vs Processes • Threads • Share memory • Low cost • Processes • Local memory • Expensive • Message passing

Performance meaning • Latency • Amt of time it takes to do some work • Throughput • Amt of work done in unit time

Pipelining • Increases throughput • A single instruction may execute slower • Possible 5 times greater speed up • Other hardware speed ups • Caches • Memory prefetching

Figure 3.1 Simplified processor pipeline. By dividing instruction execution into five equal-size parts, five instructions can ideally be executing simultaneously, giving (ideally) a five-fold improvement over executing each individual instruction to completion.

Why expected speed up may not be achieved Overhead Non-parallelizable computation Idle processors Contention for resources

Figure 3.2 A schematic diagram of setup and tear down overhead for threads and processes.

Sources of overhead • Communication • Synchronization • Usage of locks • Waiting for other threads/processes • Computation • More than what is in sequential • What portion of data should I do • Memory

Table 3.1 Sources of communication overhead by communication mechanism.

Limits in speed up • Amdahl’s law • Describes potential benefit of parallelism • If 1/S is sequential, then max performance is limited to a factor of S • Tp = 1/S * Ts + (1 – 1/S) * Ts / P • Reality is probably worse than this since the time for the parallel portion is unlikely to vanish

Contention • Spin lock • Increases bus traffic effecting other threads • False sharing • Mutex

Idle time • Load imbalance • Memory bound computations • DRAM is slower than CPU • DRAM latency has not improved as fast as CPU speed has • Keep in mind locality of reference • Hardware threading helps

Parallel structure Dependences Granularity Locality

Dependency • Ordering relationship bt two computations • Caused by read and write operations • Or mutex • Data dependence • Preserved to maintain correctness • Flow dependence: read after write • Anti dependence: write after read • Output dependence: write after write • Input dependence: read after read (memory reuse)

Figure 3.3 A dependence between a memory write (store) operation of one thread and the memory read (load) of another thread.

True versus False dependency sum = a+1 first_term=sum*scale sum=b+1 second_term=sum*scale2 False dependency with sum, reduces parallelism

True versus False dependency first_sum = a+1 first_term=first_sum*scale second_sum=b+1 second_term=second_sum*scale2 By removing the false dependency, the 2 sets of code can execute in parallel

Dependency first_term=(a+1)*scale; second_term=(b+1)*scale; Dependency can’t be removed by renaming, add must occur before multiply. But, same improvement exists from previous slide.

Figure 3.4 Schematic diagram of sequential and tree-based addition algorithms; edges not connected to a leaf represent flow dependences.

Granularity • Coarse/fine large/small • Defined by • Degree to which threads/processes interact • Degree to which dependences cross boundaries

Locality • Temporal • Memory references over time • Spatial • Memory references via address

Performance Trade-Offs • 90/10 rule—avoid premature optimization • 90% of time is spent in 10% of code • Amdahl’s law says we should pay attention to the 10% or we limit our speed up by a factor of 10.

Communication Costs • May reduce comm by additional computation • Overlap comm and computation • Perform redundant computation • Recompute a value rather than send it • May reduce dependences as well reducing synchronization

Memory vs Parallelism • By using more memory parallelism can be enhanced • Privatization (local sum) • Padding (to force variables to lie in their own cache line)

Overhead vs Parallelism • Often adding more processors does not help • Parallelize overhead – in count 3 do parallel sum accumulation • Load Balance vs Overhead • Granularity trade-offs • Reduce parallelism results from increased granularity

Measuring Performance • Execution time (latency) • Speedup = Ts / Tp • Superlinear speedup (as shown next) • Does less work • All data access in cache whereas sequential must access DRAM • Another example – sequential search

Figure 3.5 A typical speedup graph showing performance for two programs; the dashed line represents linear speedup.

Measuring Performance (cont) Efficiency = Speedup/P Ideally Efficiency = 1, all processors busy Communication has not kept up w/ CPU enhances so efficiency has suffered. Relative speedup – run with P=1 Create large Ts makes speed up look better Cold starts – not warming cache Peripheral charges – I/O can mask parallel improvement

Measuring Performance (cont) • Scaled speedup vs fixed-size speedup • Problem size may be OK for 1 processor but too small for many processors • Problem size often lends itself to a specific number of processors • Problem size is difficult

Scalable Performance Consider alphabetizing Assume 20% not parallelizable For 2 processors T2 = Ts / 2 + 0.2Ts E2 = (Ts / T2) / 2 = 10/7/2 = 5/7 = .71 T10 = Ts /10 + 0.2Ts = 0.3Ts E10= (10/3)/10 = 0.33 E100 = 0.047 hardly worth it.

Implications for Hardware The previous slide indicates that as the number of processors grow, their own individual computing power isn’t as important.

Implications for Software Considerations of computation versus communication.

Scaling the Problem Size Typically the problem size increases as more processors are used. Before we kept the problem size constant Consider how parallelism affects problem size Next slide does computation and assumes perfect speedup

Processors needed for Scaling • Consider sequential algorithm whose execution is O(nX) thus T = cnx • Assume P processors and m times larger problem for same execution time T = c(mn)x / P = cnx • Solve for m, m = P(1/x) • Consider O(n4), increase pblm by 100, requires 100,000,000 processors

Processors needed for Scaling (cont) Consider O(n2) increase by 100, needs 10,000 processors Lesson: algorithm needs to be as scalable as possible, adding more processors makes scaling harder. Concept is: “the corollary of modest potential”

Summary • Increase parallelism may not yield increased performance • Many interconnecting issues • Memory • Processors • Synchronization • Communication

Summary (cont) • Make tradeoffs to achieve goals • Increase locality • Reduce cross thread dependences • Consider granularity

Maximizing Performance through Parallel Programming Principles

Maximizing Performance through Parallel Programming Principles

Presentation Transcript

Fluency with Information Technology Third Edition by Lawrence Snyder

Fluency with Information Technology Third Edition by Lawrence Snyder

Fluency with Information Technology Third Edition by Lawrence Snyder

Fluency with Information Technology Third Edition by Lawrence Snyder

Fluency with Information Technology Third Edition by Lawrence Snyder

Fluency with Information Technology Third Edition by Lawrence Snyder

Fluency with Information Technology Third Edition by Lawrence Snyder

Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder

Fluency with Information Technology Third Edition by Lawrence Snyder

Fluency with Information Technology Third Edition by Lawrence Snyder

Fluency with Information Technology Third Edition by Lawrence Snyder

Fluency with Information Technology Third Edition by Lawrence Snyder

Fluency with Information Technology Third Edition by Lawrence Snyder

Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder

Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder

Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder

Fluency with Information Technology Third Edition by Lawrence Snyder

Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder