Improving Parallel Performance

Introduction to Parallel Programming – Part 7 Improving Parallel Performance Intel Software College

Objectives • At the end of this module, you should be able to • Give two reasons why one sequential algorithm may more suitable than another for parallelization • Use loop fusion, loop fission, and loop inversion to create or improve opportunities for parallel execution • Explain the pros and cons of static versus dynamic loop scheduling • Explain why it can be difficult both to optimize load balancing and maximize locality Improving Parallel Performance

General Rules of Thumb • Start with best sequential algorithm • Maximize locality Improving Parallel Performance

Start with Best Sequential Algorithm • Don’t confuse “speedup” with “speed” • Speedup: ratio of program’s execution time on 1 processor to its execution time on p processors • What if start with inferior sequential algorithm? • Naïve, higher complexity algorithms • Easier to make parallel • Usually don’t lead to fastest parallel algorithm Improving Parallel Performance

Example: Search for Chess Move • Naïve minimax algorithm • Exhaustive search of game tree • Branching factor around 35 • Nodes evaluated in search of depth d: 35d • Alpha-beta search algorithm • Prunes useless subtrees • Branching factor around 6 • Nodes evaluated in search of depth d: 6d Improving Parallel Performance

Minimax Search My move— choose max 3 0 3 His move— choose min 3 4 0 6 My move— choose max 3 1 4 1 0 -5 -2 6 His move— choose min 7 1 4 1 8 3 -2 7 3 5 6 2 0 -5 4 6 Improving Parallel Performance

Alpha-Beta Pruning My move— choose max 3 0 3 His move— choose min 3 4 0 My move— choose max 3 1 4 0 -5 His move— choose min 7 1 4 8 3 3 6 0 -5 Improving Parallel Performance

How Deep the Search? Improving Parallel Performance

Maximize Locality • Temporal locality: If a processor accesses a memory location, there is a good chance it will revisit that memory location soon • Data locality: If a processor accesses a memory location, there is a good chance it will visit a nearby location soon • Programs tend to exhibit locality because they tend to have loops indexing through arrays • Principle of locality makes cache memory worthwhile Improving Parallel Performance

Parallel Processing and Locality • Multiple processors  multiple caches • When a processor writes a value, the system must ensure no processor tries to reference an obsolete value (cache coherence problem) • A write by one processor can cause the invalidation of another processor’s cache line, leading to a cache miss • Rule of thumb: Better to have different processors manipulating totally different chunks of arrays • We say a parallel program has good locality if processors’ memory writes tend not to interfere with the work being done by other processors Improving Parallel Performance

Example: Array Initialization for (i = 0; i < N; i++) a[i] = 0; Terrible allocation of work to processors 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 Better allocation of work to processors... 0 0 0 0 0 0 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 unless sub-arrays map to same cache lines! Improving Parallel Performance

Loop Transformations • Loop fission • Loop fusion • Loop inversion Improving Parallel Performance

Loop Fission • Begin with single loop having loop-carried dependence • Split loop into two or more loops • New loops can be executed in parallel Improving Parallel Performance

Perfectly parallel Before Loop Fission • float *a, *b; • int i; • for (i = 1; i < N; i++) { • if (b[i] > 0.0) a[i] = 2.0 * b[i]; • else a[i] = 2.0 * fabs(b[i]); • b[i] = a[i-1]; • } Loop-carried dependence Improving Parallel Performance

After Loop Fission • #pragma omp parallel • { • #pragma omp for • for (i = 1; i < N; i++) { • if (b[i] > 0.0) a[i] = 2.0 * b[i]; • else a[i] = 2.0 * fabs(b[i]); • } • #pragma omp for • for (i = 1; i < N; i++) { • b[i] = a[i-1]; • } • } This works because there is a barrier synchronization after a parallel for loop Improving Parallel Performance

Loop Fission and Locality • Another use of loop fission is to increase data locality • Before fission, nested loops reference too many data values, leading to poor cache hit rate • Break nested loops into multiple nested loops • New nested loops have higher cache hit rate Improving Parallel Performance

Before Fission • for (i = 0; i < list_len; i++) • for (j = prime[i]; j < N; j += prime[i]) • marked[j] = 1; marked Improving Parallel Performance

After Fission • for (k = 0; k < N; k += CHUNK_SIZE) • for (i = 0; i < list_len; i++) { • start = f(prime[i], k); • end = g(prime[i], k); • for (j = start; j < end; j += prime[i]) • marked[j] = 1; • } marked etc. Improving Parallel Performance

Loop Fusion • The opposite of loop fission • Combine loops increase grain size Improving Parallel Performance

Before Loop Fusion • float *a, *b, x, y; • int i; • ... • for (i = 0; i < N; i++) a[i] = foo(i); • x = a[N-1] – a[0]; • for (i = 0; i < N; i++) b[i] = bar(a[i]); • y = x * b[0] / b[N-1]; • Functions foo and bar are side-effect free. Improving Parallel Performance

After Loop Fusion • #pragma omp parallel for • for (i = 0; i < N; i++) { • a[i] = foo(i); • b[i] = bar(a[i]); • } • x = a[N-1] – a[0]; • y = x * b[0] / b[N-1]; • Now one barrier instead of two Improving Parallel Performance

Loop Inversion • Nested for loops may have data dependences that prevent parallelization • Inverting the nesting of for loops may • Expose a parallelizable loop • Increase grain size • Improve parallel program’s locality Improving Parallel Performance

Before Loop Inversion • for (j = 1; j < n; j++) • #pragma omp parallel for • for (i = 0; i < m; i++) • a[i][j] = 2 * a[i][j-1]; Can execute inner loop in parallel, but grain size small Improving Parallel Performance

After Loop Inversion • #pragma omp parallel for • for (i = 0; i < m; i++) • for (j = 1; j < n; j++) • a[i][j] = 2 * a[i][j-1]; Can execute outer loop in parallel Improving Parallel Performance

Reducing Parallel Overhead • Loop scheduling • Conditionally executing in parallel • Replicating work Improving Parallel Performance

Loop Scheduling • Loop schedule: how loop iterations are assigned to threads • Static schedule: iterations assigned to threads before execution of loop • Dynamic schedule: iterations assigned to threads during execution of loop Improving Parallel Performance

Loop Scheduling in OpenMP From Parallel Programming in OpenMP by Chandra et al. Improving Parallel Performance

Loop Scheduling Example • #pragma omp parallel for • for (i = 0; i < 12; i++) • for (j = 0; j <= i; j++) • a[i][j] = ...; Improving Parallel Performance

A B C D Improving Parallel Performance

Smaller Data Sets Locality v.Load Balance Locality Load Balance Larger Data Sets Improving Parallel Performance

Conditionally Enable Parallelism • Suppose sequential loop has execution time jn • Suppose barrier synchronization time is kp • We should make loop parallel only if • OpenMP’s if clause lets us conditionally enable parallelism Improving Parallel Performance

Example of if Clause • Suppose benchmarking shows a loop executes faster in parallel only when n > 1250 • #pragma omp parallel for if (n > 1250) • for (i = 0; i < n; i++) { • ... • } Improving Parallel Performance

Replicate Work • Every thread interaction has a cost • Example: Barrier synchronization • Sometimes it’s faster for threads to replicate work than to go through a barrier synchronization Improving Parallel Performance

Before Work Replication • for (i = 0; i < N; i++) a[i] = foo(i); • x = a[0] / a[N-1]; • for (i = 0; i < N; i++) b[i] = x * a[i]; • Both for loops are amenable to parallelization • Synchronization among threads required if x is shared and one thread performs assignment Improving Parallel Performance

After Work Replication • #pragma omp parallel private (x) • { • x = foo(0) / foo(N-1); • #pragma omp for • for (i = 0; i < N; i++) { • a[i] = foo(i); • b[i] = x * a[i]; • } • } Improving Parallel Performance

References • Rohit Chandra, Leonardo Dagum, Dave Kohr, Dror Maydan, Jeff McDonald, and Ramesh Menon, Parallel Programming in OpenMP, Morgan Kaufmann (2001). • Peter Denning, “The Locality Principle,” Naval Postgraduate School (2005). • Michael J. Quinn, Parallel Programming in C with MPI and OpenMP, McGraw-Hill (2004). Improving Parallel Performance

Improving Parallel Performance

Improving Parallel Performance

Improving Parallel Performance

Presentation Transcript

Improving Parallel I/O Performance with Data Layout Awareness

Parallel Performance

Improving the Performance of MPI Parallel I/O

Predicting Parallel Performance

Parallel Performance

Improving Parallel Performance

Improving Performance

Collective Buffering: Improving Parallel I/O Performance

Improving ACE Performance

Improving Performance

Improving performance

Performance: Parallel Each

Improving Student Performance

CMAQ Parallel Performance

Parallel Performance Diagnosis

Parallel Performance

Improving Ado.Net Performance

Parallel Performance Diagnosis

Parallel Performance