Optimizing Parallelism in Code Execution

Exploiting Parallelism • We can exploit parallelism in two ways: • At the instruction level • Single Instruction Multiple Data (SIMD) • Make use of extensions to the ISA • At the core level • Rewrite the code to parallelize operations across many cores • Make use of extensions to the programming language

+ Exploiting Parallelism at the Instruction level (SIMD) • Consider adding together two arrays: void array_add(int A[], int B[], int C[], int length) { int i; for (i = 0; i < length; ++i) { C[i] = A[i] + B[i]; } } Operating on one element at a time

+ + + + Exploiting Parallelism at the Instruction level (SIMD) • Consider adding together two arrays: void array_add(int A[], int B[], int C[], int length) { int i; for (i = 0; i < length; ++i) { C[i] = A[i] + B[i]; } } Operate on MULTIPLE elements Single Instruction, Multiple Data (SIMD)

4.0 (32 bits) 4.0 (32 bits) 3.5 (32 bits) -2.0 (32 bits) + -1.5 (32 bits) 2.0 (32 bits) 1.7 (32 bits) 2.3 (32 bits) 2.5 (32 bits) 6.0 (32 bits) 5.2 (32 bits) 0.3 (32 bits) Intel SSE/SSE2 as an example of SIMD • Added new 128 bit registers (XMM0 – XMM7), each can store • 4 single precision FP values (SSE) 4 * 32b • 2 double precision FP values (SSE2) 2 * 64b • 16 byte values (SSE2) 16 * 8b • 8 word values (SSE2) 8 * 16b • 4 double word values (SSE2) 4 * 32b • 1 128-bit integer value (SSE2) 1 * 128b

SIMD Extensions More than 70 instructions. Arithmetic Operations supported: Addition, Subtraction, Mult, Division, Square Root, Maximum, Minimum. Can operate on Floating point or Integer data.

Is it always that easy? • No, not always. Let’s look at a little more challenging one: unsigned sum_array(unsigned *array, int length) { int total = 0; for (int i = 0; i < length; ++i) { total += array[i]; } return total; } • Can we exploit SIMD-style parallelism?

We first need to restructure the code unsigned sum_array2(unsigned *array, int length) { unsigned total, i; unsigned temp[4] = {0, 0, 0, 0}; for (i = 0; i < length & ~0x3; i += 4) { temp[0] += array[i]; temp[1] += array[i+1]; temp[2] += array[i+2]; temp[3] += array[i+3]; } total = temp[0] + temp[1] + temp[2] + temp[3]; for ( ; i < length; ++i) { total += array[i]; } return total; }

Then we can write SIMD code for the hot part unsigned sum_array2(unsigned *array, int length) { unsigned total, i; unsigned temp[4] = {0, 0, 0, 0}; for (i = 0; i < length & ~0x3; i += 4) { temp[0] += array[i]; temp[1] += array[i+1]; temp[2] += array[i+2]; temp[3] += array[i+3]; } total = temp[0] + temp[1] + temp[2] + temp[3]; for ( ; i < length; ++i) { total += array[i]; } return total; }

Exploiting Parallelism at the core level • Consider the code from Midterm 3: (valgrind) for(int i = 0; i < N; ++i) { c[i] = 0; for(int j = 0; j < N; ++j) c[i] += a[i][j] * b[j]; } parallel_for split the iterations across multiple cores (fork/join) • If there is a choice, what • should we parallelize? • Inner loop? Outer? Both? • How are N iterations • split across c cores?  c[i] row i ... ...

OpenMP • Many APIs exist/being developed to exploit multiple cores • OpenMP is easy to use, but has limitations #include <omp.h> #pragmaomp parallel for for(inti = 0; i < N; ++i) { c[i] = 0; for(int j = 0; j < N; ++j) c[i] += a[i][j] * b[j]; } • Compile your code with: g++ –fopenmp mp6.cxx • Google for OpenMP LLNL for a handy reference-page

Race conditions • If we parallelize the inner loop, the code produces wrong answers: for(int i = 0; i < N; ++i) { c[i] = 0; #pragma omp parallel for for(int j = 0; j < N; ++j) c[i] += a[i][j] * b[j]; } • In each iteration, each thread does: • What if this happens: load c[i] compute new c[i] update c[i] thread 1 thread 2 load c[i] load c[i] compute c[i] compute c[i] update c[i] update c[i] data race

Another example • Assume a[N][N] and b[N] are integer arrays as before: int total = 0; #pragmaomp parallel for (plus a bit more!) for(inti = 0; i < N; ++i) { for(int j = 0; j < N; ++j) total += a[i][j] * b[j]; } • Which loop should we parallelize? • Answer: Outer loop, if we can avoid the race condition • Each thread maintains a private copy of total (initialized to zero) • When the threads are done, the private copies are reduced into one • Works because + (and many other operations) are associative

Let’s try the version with race conditions • We’re expecting a significant speedup on the 8-core EWS machines • Not quite 8-times speedup because of Amdahl’s Law, plus the overhead for forking and joining multiple threads • But its actually slower!! Why?? • Here’s the mental picture that we have: two processors, shared memory total shared variable in memory

This mental picture is wrong! • We’ve forgotten about caches! • The memory may be shared, but each processor has its own L1 cache • As each processor updates total, it bounces between L1 caches Multiple bouncing slows performance

Summary • Performance is of primary concern in some applications • Games, servers, mobile devices, super computers • Many important applications have parallelism • Exploiting it is a good way to speed up programs. • Single Instruction Multiple Data (SIMD) does this at ISA level • Registers hold multiple data items, instruction operate on them • Can achieve factor or 2, 4, 8 speedups on kernels • May require some restructuring of code to expose parallelism • Exploiting core-level parallelism • Watch out for data races! (Makes code incorrect) • Watch out for false sharing! (May be correct, but may be slower!!)

Cachegrind comparison (N = 10,000) Version 1 Version 2 • I refs: 9,803,358,171 9,803,328,171 I1 misses: 1,146 1,148 L2i misses: 1,140 1,142 I1 miss rate: 0.00% 0.00% L2i miss rate: 0.00% 0.00% • D refs: 4,702,690,591 4,602,670,591 D1 misses: 18,956,794 15,844,477 L2d misses: 12,528,911 12,528,910 D1 miss rate: 0.403% 0.344% L2d miss rate: 0.2% 0.2% • L2 refs: 18,957,940 15,845,625 L2 misses: 12,530,051 12,530,052 L2 miss rate: 0.0% 0.0% 14.6% better

Optimizing Parallelism in Code Execution

Optimizing Parallelism in Code Execution

Presentation Transcript

Exploiting Instruction-Level Parallelism with Software Approaches

Chapter 4 Exploiting Instruction-Level Parallelism with Software Approaches

Qilin: Exploiting Parallelism on Heterogeneous Multiprocessors with Adaptive Mapping

Exploiting Parallelism

Qilin : Exploiting Parallelism on Heterogeneous Multiprocessors with Adaptive Mapping

Automatically Exploiting Cross-Invocation Parallelism Using Runtime Information

Exploiting Parallelism on GPUs

Janus : exploiting parallelism via hindsight

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Exploiting Spatial Parallelism in Ethernet-based Cluster Interconnects

Exploiting SIMD parallelism with the CGiS compiler framework

Exploiting Vector Parallelism in Software Pipelined Loops

Exploiting Superword Level Parallelism with Multimedia Instruction Sets

Exploiting Instruction-Level Parallelism with Software Approaches

Chapter 4 Exploiting Instruction-Level Parallelism with Software Approaches

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

PARALLELISM PARALLELISM PARALLELISM

Exploiting Parallelism

Exploiting Operation Level Parallelism through Dynamically Reconfigurable Datapahts

Exploiting Vector Parallelism in Software Pipelined Loops