1 / 37

Improving Parallel Performance

Introduction to Parallel Programming – Part 7. Improving Parallel Performance. Intel Software College. Objectives. At the end of this module, you should be able to Give two reasons why one sequential algorithm may more suitable than another for parallelization

minna
Download Presentation

Improving Parallel Performance

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Parallel Programming – Part 7 Improving Parallel Performance Intel Software College

  2. Objectives • At the end of this module, you should be able to • Give two reasons why one sequential algorithm may more suitable than another for parallelization • Use loop fusion, loop fission, and loop inversion to create or improve opportunities for parallel execution • Explain the pros and cons of static versus dynamic loop scheduling • Explain why it can be difficult both to optimize load balancing and maximize locality Improving Parallel Performance

  3. General Rules of Thumb • Start with best sequential algorithm • Maximize locality Improving Parallel Performance

  4. Start with Best Sequential Algorithm • Don’t confuse “speedup” with “speed” • Speedup: ratio of program’s execution time on 1 processor to its execution time on p processors • What if start with inferior sequential algorithm? • Naïve, higher complexity algorithms • Easier to make parallel • Usually don’t lead to fastest parallel algorithm Improving Parallel Performance

  5. Example: Search for Chess Move • Naïve minimax algorithm • Exhaustive search of game tree • Branching factor around 35 • Nodes evaluated in search of depth d: 35d • Alpha-beta search algorithm • Prunes useless subtrees • Branching factor around 6 • Nodes evaluated in search of depth d: 6d Improving Parallel Performance

  6. Minimax Search My move— choose max 3 0 3 His move— choose min 3 4 0 6 My move— choose max 3 1 4 1 0 -5 -2 6 His move— choose min 7 1 4 1 8 3 -2 7 3 5 6 2 0 -5 4 6 Improving Parallel Performance

  7. Alpha-Beta Pruning My move— choose max 3 0 3 His move— choose min 3 4 0 My move— choose max 3 1 4 0 -5 His move— choose min 7 1 4 8 3 3 6 0 -5 Improving Parallel Performance

  8. How Deep the Search? Improving Parallel Performance

  9. Maximize Locality • Temporal locality: If a processor accesses a memory location, there is a good chance it will revisit that memory location soon • Data locality: If a processor accesses a memory location, there is a good chance it will visit a nearby location soon • Programs tend to exhibit locality because they tend to have loops indexing through arrays • Principle of locality makes cache memory worthwhile Improving Parallel Performance

  10. Parallel Processing and Locality • Multiple processors  multiple caches • When a processor writes a value, the system must ensure no processor tries to reference an obsolete value (cache coherence problem) • A write by one processor can cause the invalidation of another processor’s cache line, leading to a cache miss • Rule of thumb: Better to have different processors manipulating totally different chunks of arrays • We say a parallel program has good locality if processors’ memory writes tend not to interfere with the work being done by other processors Improving Parallel Performance

  11. Example: Array Initialization for (i = 0; i < N; i++) a[i] = 0; Terrible allocation of work to processors 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 Better allocation of work to processors... 0 0 0 0 0 0 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 unless sub-arrays map to same cache lines! Improving Parallel Performance

  12. Loop Transformations • Loop fission • Loop fusion • Loop inversion Improving Parallel Performance

  13. Loop Fission • Begin with single loop having loop-carried dependence • Split loop into two or more loops • New loops can be executed in parallel Improving Parallel Performance

  14. Perfectly parallel Before Loop Fission • float *a, *b; • int i; • for (i = 1; i < N; i++) { • if (b[i] > 0.0) a[i] = 2.0 * b[i]; • else a[i] = 2.0 * fabs(b[i]); • b[i] = a[i-1]; • } Loop-carried dependence Improving Parallel Performance

  15. After Loop Fission • #pragma omp parallel • { • #pragma omp for • for (i = 1; i < N; i++) { • if (b[i] > 0.0) a[i] = 2.0 * b[i]; • else a[i] = 2.0 * fabs(b[i]); • } • #pragma omp for • for (i = 1; i < N; i++) { • b[i] = a[i-1]; • } • } This works because there is a barrier synchronization after a parallel for loop Improving Parallel Performance

  16. Loop Fission and Locality • Another use of loop fission is to increase data locality • Before fission, nested loops reference too many data values, leading to poor cache hit rate • Break nested loops into multiple nested loops • New nested loops have higher cache hit rate Improving Parallel Performance

  17. Before Fission • for (i = 0; i < list_len; i++) • for (j = prime[i]; j < N; j += prime[i]) • marked[j] = 1; marked Improving Parallel Performance

  18. After Fission • for (k = 0; k < N; k += CHUNK_SIZE) • for (i = 0; i < list_len; i++) { • start = f(prime[i], k); • end = g(prime[i], k); • for (j = start; j < end; j += prime[i]) • marked[j] = 1; • } marked etc. Improving Parallel Performance

  19. Loop Fusion • The opposite of loop fission • Combine loops increase grain size Improving Parallel Performance

  20. Before Loop Fusion • float *a, *b, x, y; • int i; • ... • for (i = 0; i < N; i++) a[i] = foo(i); • x = a[N-1] – a[0]; • for (i = 0; i < N; i++) b[i] = bar(a[i]); • y = x * b[0] / b[N-1]; • Functions foo and bar are side-effect free. Improving Parallel Performance

  21. After Loop Fusion • #pragma omp parallel for • for (i = 0; i < N; i++) { • a[i] = foo(i); • b[i] = bar(a[i]); • } • x = a[N-1] – a[0]; • y = x * b[0] / b[N-1]; • Now one barrier instead of two Improving Parallel Performance

  22. Loop Inversion • Nested for loops may have data dependences that prevent parallelization • Inverting the nesting of for loops may • Expose a parallelizable loop • Increase grain size • Improve parallel program’s locality Improving Parallel Performance

  23. Before Loop Inversion • for (j = 1; j < n; j++) • #pragma omp parallel for • for (i = 0; i < m; i++) • a[i][j] = 2 * a[i][j-1]; Can execute inner loop in parallel, but grain size small Improving Parallel Performance

  24. After Loop Inversion • #pragma omp parallel for • for (i = 0; i < m; i++) • for (j = 1; j < n; j++) • a[i][j] = 2 * a[i][j-1]; Can execute outer loop in parallel Improving Parallel Performance

  25. Reducing Parallel Overhead • Loop scheduling • Conditionally executing in parallel • Replicating work Improving Parallel Performance

  26. Loop Scheduling • Loop schedule: how loop iterations are assigned to threads • Static schedule: iterations assigned to threads before execution of loop • Dynamic schedule: iterations assigned to threads during execution of loop Improving Parallel Performance

  27. Loop Scheduling in OpenMP From Parallel Programming in OpenMP by Chandra et al. Improving Parallel Performance

  28. Loop Scheduling Example • #pragma omp parallel for • for (i = 0; i < 12; i++) • for (j = 0; j <= i; j++) • a[i][j] = ...; Improving Parallel Performance

  29. A B C D Improving Parallel Performance

  30. Smaller Data Sets Locality v.Load Balance Locality Load Balance Larger Data Sets Improving Parallel Performance

  31. Conditionally Enable Parallelism • Suppose sequential loop has execution time jn • Suppose barrier synchronization time is kp • We should make loop parallel only if • OpenMP’s if clause lets us conditionally enable parallelism Improving Parallel Performance

  32. Example of if Clause • Suppose benchmarking shows a loop executes faster in parallel only when n > 1250 • #pragma omp parallel for if (n > 1250) • for (i = 0; i < n; i++) { • ... • } Improving Parallel Performance

  33. Replicate Work • Every thread interaction has a cost • Example: Barrier synchronization • Sometimes it’s faster for threads to replicate work than to go through a barrier synchronization Improving Parallel Performance

  34. Before Work Replication • for (i = 0; i < N; i++) a[i] = foo(i); • x = a[0] / a[N-1]; • for (i = 0; i < N; i++) b[i] = x * a[i]; • Both for loops are amenable to parallelization • Synchronization among threads required if x is shared and one thread performs assignment Improving Parallel Performance

  35. After Work Replication • #pragma omp parallel private (x) • { • x = foo(0) / foo(N-1); • #pragma omp for • for (i = 0; i < N; i++) { • a[i] = foo(i); • b[i] = x * a[i]; • } • } Improving Parallel Performance

  36. References • Rohit Chandra, Leonardo Dagum, Dave Kohr, Dror Maydan, Jeff McDonald, and Ramesh Menon, Parallel Programming in OpenMP, Morgan Kaufmann (2001). • Peter Denning, “The Locality Principle,” Naval Postgraduate School (2005). • Michael J. Quinn, Parallel Programming in C with MPI and OpenMP, McGraw-Hill (2004). Improving Parallel Performance

  37. Improving Parallel Performance

More Related