1 / 17

Exploiting Parallelism

Exploiting Parallelism. We can exploit parallelism in two ways: At the instruction level Single Instruction Multiple Data (SIMD) Make use of extensions to the ISA At the core level Rewrite the code to parallelize operations across many cores

harriettea
Download Presentation

Exploiting Parallelism

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Exploiting Parallelism • We can exploit parallelism in two ways: • At the instruction level • Single Instruction Multiple Data (SIMD) • Make use of extensions to the ISA • At the core level • Rewrite the code to parallelize operations across many cores • Make use of extensions to the programming language

  2. + Exploiting Parallelism at the Instruction level (SIMD) • Consider adding together two arrays: void array_add(int A[], int B[], int C[], int length) { int i; for (i = 0; i < length; ++i) { C[i] = A[i] + B[i]; } } Operating on one element at a time

  3. + Exploiting Parallelism at the Instruction level (SIMD) • Consider adding together two arrays: void array_add(int A[], int B[], int C[], int length) { int i; for (i = 0; i < length; ++i) { C[i] = A[i] + B[i]; } } Operating on one element at a time

  4. + + + + Exploiting Parallelism at the Instruction level (SIMD) • Consider adding together two arrays: void array_add(int A[], int B[], int C[], int length) { int i; for (i = 0; i < length; ++i) { C[i] = A[i] + B[i]; } } Operate on MULTIPLE elements Single Instruction, Multiple Data (SIMD)

  5. 4.0 (32 bits) 4.0 (32 bits) 3.5 (32 bits) -2.0 (32 bits) + -1.5 (32 bits) 2.0 (32 bits) 1.7 (32 bits) 2.3 (32 bits) 2.5 (32 bits) 6.0 (32 bits) 5.2 (32 bits) 0.3 (32 bits) Intel SSE/SSE2 as an example of SIMD • Added new 128 bit registers (XMM0 – XMM7), each can store • 4 single precision FP values (SSE) 4 * 32b • 2 double precision FP values (SSE2) 2 * 64b • 16 byte values (SSE2) 16 * 8b • 8 word values (SSE2) 8 * 16b • 4 double word values (SSE2) 4 * 32b • 1 128-bit integer value (SSE2) 1 * 128b

  6. SIMD Extensions More than 70 instructions. Arithmetic Operations supported: Addition, Subtraction, Mult, Division, Square Root, Maximum, Minimum. Can operate on Floating point or Integer data.

  7. Is it always that easy? • No, not always. Let’s look at a little more challenging one: unsigned sum_array(unsigned *array, int length) { int total = 0; for (int i = 0; i < length; ++i) { total += array[i]; } return total; } • Can we exploit SIMD-style parallelism?

  8. We first need to restructure the code unsigned sum_array2(unsigned *array, int length) { unsigned total, i; unsigned temp[4] = {0, 0, 0, 0}; for (i = 0; i < length & ~0x3; i += 4) { temp[0] += array[i]; temp[1] += array[i+1]; temp[2] += array[i+2]; temp[3] += array[i+3]; } total = temp[0] + temp[1] + temp[2] + temp[3]; for ( ; i < length; ++i) { total += array[i]; } return total; }

  9. Then we can write SIMD code for the hot part unsigned sum_array2(unsigned *array, int length) { unsigned total, i; unsigned temp[4] = {0, 0, 0, 0}; for (i = 0; i < length & ~0x3; i += 4) { temp[0] += array[i]; temp[1] += array[i+1]; temp[2] += array[i+2]; temp[3] += array[i+3]; } total = temp[0] + temp[1] + temp[2] + temp[3]; for ( ; i < length; ++i) { total += array[i]; } return total; }

  10. Exploiting Parallelism at the core level • Consider the code from Midterm 3: (valgrind) for(int i = 0; i < N; ++i) { c[i] = 0; for(int j = 0; j < N; ++j) c[i] += a[i][j] * b[j]; } parallel_for split the iterations across multiple cores (fork/join) • If there is a choice, what • should we parallelize? • Inner loop? Outer? Both? • How are N iterations • split across c cores?  c[i] row i ... ...

  11. OpenMP • Many APIs exist/being developed to exploit multiple cores • OpenMP is easy to use, but has limitations #include <omp.h> #pragmaomp parallel for for(inti = 0; i < N; ++i) { c[i] = 0; for(int j = 0; j < N; ++j) c[i] += a[i][j] * b[j]; } • Compile your code with: g++ –fopenmp mp6.cxx • Google for OpenMP LLNL for a handy reference-page

  12. Race conditions • If we parallelize the inner loop, the code produces wrong answers: for(int i = 0; i < N; ++i) { c[i] = 0; #pragma omp parallel for for(int j = 0; j < N; ++j) c[i] += a[i][j] * b[j]; } • In each iteration, each thread does: • What if this happens: load c[i] compute new c[i] update c[i] thread 1 thread 2 load c[i] load c[i] compute c[i] compute c[i] update c[i] update c[i] data race

  13. Another example • Assume a[N][N] and b[N] are integer arrays as before: int total = 0; #pragmaomp parallel for (plus a bit more!) for(inti = 0; i < N; ++i) { for(int j = 0; j < N; ++j) total += a[i][j] * b[j]; } • Which loop should we parallelize? • Answer: Outer loop, if we can avoid the race condition • Each thread maintains a private copy of total (initialized to zero) • When the threads are done, the private copies are reduced into one • Works because + (and many other operations) are associative

  14. Let’s try the version with race conditions • We’re expecting a significant speedup on the 8-core EWS machines • Not quite 8-times speedup because of Amdahl’s Law, plus the overhead for forking and joining multiple threads • But its actually slower!! Why?? • Here’s the mental picture that we have: two processors, shared memory total shared variable in memory

  15. This mental picture is wrong! • We’ve forgotten about caches! • The memory may be shared, but each processor has its own L1 cache • As each processor updates total, it bounces between L1 caches Multiple bouncing slows performance

  16. Summary • Performance is of primary concern in some applications • Games, servers, mobile devices, super computers • Many important applications have parallelism • Exploiting it is a good way to speed up programs. • Single Instruction Multiple Data (SIMD) does this at ISA level • Registers hold multiple data items, instruction operate on them • Can achieve factor or 2, 4, 8 speedups on kernels • May require some restructuring of code to expose parallelism • Exploiting core-level parallelism • Watch out for data races! (Makes code incorrect) • Watch out for false sharing! (May be correct, but may be slower!!)

  17. Cachegrind comparison (N = 10,000) Version 1 Version 2 • I refs: 9,803,358,171 9,803,328,171 I1 misses: 1,146 1,148 L2i misses: 1,140 1,142 I1 miss rate: 0.00% 0.00% L2i miss rate: 0.00% 0.00% • D refs: 4,702,690,591 4,602,670,591 D1 misses: 18,956,794 15,844,477 L2d misses: 12,528,911 12,528,910 D1 miss rate: 0.403% 0.344% L2d miss rate: 0.2% 0.2% • L2 refs: 18,957,940 15,845,625 L2 misses: 12,530,051 12,530,052 L2 miss rate: 0.0% 0.0% 14.6% better

More Related