Introduction to OpenMP: Directive-Based Programming for Multi-threaded Parallelism

Introduction to OpenMP For a more detailed tutorial see: http://www.openmp.org Look at the presentations also see: https://computing.llnl.gov/tutorials/openMP/

Concepts • An Application Program Interface (API) that may be used to explicitly direct multi-threaded, shared memory parallelism • Directive based programming • declare properties of language structures • sections of code, loops • scope variables • A few service routines • get information • Compiler options • Environment variables

Open MP Programming Model • Directive • #pragma omp directive [clause list] • C$OMP construct [clause [clause] … ] • Program executes serially until it encounters a parallel directive • #pragma omp parallel [clause list] • /* structured block of code */ • Clause list is used to specify conditions • Conditional parallelism - if (cond) • Degree of concurrency - num_threads(int) • Data Handling - • private(vlist), firstprivate(vlist), shared(vlist)

OpenMP Programming Model • fork-join parallelism • Master thread spawns a team of threads as needed.

Typical OpenMP Use • Generally used to parallelize loops • Find most time consuming loops • Split iterations up between threads void main() { double Res[1000]; #pragma omp parallel for for(int i=0;i<1000;i++) { do_huge_comp(Res[i]); } } void main() { double Res[1000]; for(int i=0;i<1000;i++) { do_huge_comp(Res[i]); } }

Thread Interaction • OpenMP operates using shared memory • Threads communicate via shared variables • Unintended sharing can lead to race conditions • output changes due to thread scheduling • Control race conditions using synchronization • synchronization is expensive • change the way data is stored to minimize the need for synchronization

Syntax format • Compiler directives • C/C++ • #pragma omp construct [clause [clause] …] • Fortran • C$OMP construct [clause [clause] … ] • !$OMP construct [clause [clause] … ] • *$OMP construct [clause [clause] … ] • Since we use directives, no changes need to be made to a program for a compiler that doesn’t support OpenMP

Using OpenMP • Some compilers can automatically place directives with option • -qsmp=auto (IBM xlc) • some loops may speed up, some may slow down • Compiler option required when you use directives • icc -openmp (Linux – Intel compiler) • gcc -fopenmp (gcc versions >= 4.2) • Scoping variables is the hard part! • shared variables, thread private variables

OpenMP Example #include <stdio.h> #include <omp.h> main () { int nthreads, tid; /* Fork a team of threads giving them their own copies of variables */ #pragma omp parallel private(tid) { /* Obtain and print thread id */ tid = omp_get_thread_num(); printf("Hello World from thread = %d\n", tid); /* Only master thread does this */ if (tid == 0) { nthreads = omp_get_num_threads(); printf("Number of threads = %d\n", nthreads); } } /* All threads join master thread and terminate */ } [snell@atropos openmp]$ gcc-4.2 -fopenmp hello.c -o hello [snell@atropos openmp]$ ./hello Hello World from thread = 0 Hello World from thread = 1 Number of threads = 2 [snell@atropos openmp]$

OpenMP Directives • 5 categories • Parallel Regions • Worksharing • Data Environment • Synchronization • Runtime functions / environment variables • Basically the same between C/C++ and Fortran

Parallel Regions • Create threads with omp parallel • Threads share A (default behavior) • Master thread creates the threads • Threads all start at same time then synchronize at a barrier at the end to continue with code. double A[1000] omp_set_num_threads(4); #pragma omp parallel { int ID = omp_get_thread_num(); dosomething(ID, A); }

Parallel Regions #pragma omp parallel [clause ...] newline structured_block Clauses if (scalar_expression) private (list) shared (list) default (shared | none) firstprivate (list) reduction (operator: list) copyin (list) num_threads (integer-expression)

Parallel Regions • The number of threads in a parallel region is determined by the following factors, in order of precedence: • 1. Evaluation of the IF clause • 2. Setting of the NUM_THREADS clause • 3. Use of the omp_set_num_threads() library function • 4. Setting of the OMP_NUM_THREADS environment variable • 5. Implementation default - usually the number of CPUs on a node, though it could be dynamic. • Threads are numbered from 0 (master thread) to N-1

Sections construct • The sections construct gives a different structured block to each thread • By default there is a barrier at the end. Use the nowait clause to turn off. #pragma omp parallel #pragma omp sections { X_calculation(); #pragma omp section y_calculation(); #pragma omp section z_calculation(); }

Work-sharing constructs • the for construct splits up loop iterations • By default, there is a barrier at the end of the “omp for”. Use the “nowait” clause to turn off the barrier. #pragma omp parallel #pragma omp for for (I=0;I<N;I++) { NEAT_STUFF(I); }

Short-hand notation • Can combine parallel and work sharing constructs • There is also a “parallel sections” construct #pragma omp parallel for for (I=0;I<N;I++){ NEAT_STUFF(I); }

A Rule • In order to be made parallel, a loop must have canonical “shape” index++; ++index; index--; --index; index += inc; index -= inc; index = index + inc; index = inc + index; index = index – inc; < <= >= > for (index=start; index end; )

An example #pragma omp parallel for private(j) for (i = 0; i < BLOCK_SIZE(id,p,n); i++) for (j = 0; j < n; j++) a[i][j] = MIN(a[i][j], a[i][k] + tmp[j]) By definition, private variable values are undefined at loop entry and exit To change this behavior, you can use the firstprivate(var) and lastprivate(var) clauses x[0] = complex_function(); #pragma omp parallel for private(j) firstprivate(x) for (i = 0; i < n; i++) for (j = 0; j < m; j++) x[j] = g(i, x[j-1]); answer[i] = x[j] – x[i];

Scheduling Iterations • The schedule clause effects how loop iterations are mapped onto threads • schedule(static [,chunk]) • Deal-out blocks of iterations of size “chunk” to each thread. • schedule(dynamic[,chunk]) • Each thread grabs “chunk” iterations off a queue until all iterations have been handled. • schedule(guided[,chunk]) • Threads dynamically grab blocks of iterations. The size of he block starts large and shrinks down to size “chunk” as the calculation proceeds. • schedule(runtime) • Schedule and chunk size taken from the OMP_SCHEDULE environment variable.

An example #pragma omp parallel for private(j) schedule(static, 2) for (i = 0; i < n; i++) for (j = 0; j < m; j++) x[j][j] = g(i, x[j-1]); You can play with the chunk size to meet load balancing issues, etc.

Synchronization Directives • BARRIER • inside PARALLEL, all threads synchronize • CRITICAL (lock) / END CRITICAL (lock) • section that can be executed by one thread only • lock is optional name to distinguish several critical constructs from each other

An example double area, pi, x; int i, n; area = 0.0; #pragma omp parallel for private(x) for (i = 0; i < n; i++) { x = (i + 0.5)/n; #pragma omp critical area += 4.0/(1.0 + x*x); } pi = area / n;

Reductions • Sometimes you want each thread to calculate part of a value then collapse all that into a single value • Done with reduction clause area = 0.0; #pragma omp parallel for private(x) reduction (+:area) for (i = 0; i < n; i++) { x = (i + 0.5)/n; area += 4.0/(1.0 + x*x); } pi = area / n;

OpenMP Issues Each thread needs different random number seeds count is shared we need the aggregate Another Example /* A Monte Carlo algorithm for calculating pi */ int count; /* points inside the unit quarter circle */ unsigned short xi[3]; /* random number seed */ int i; /* loop index */ int samples; /* Number of points to generate */ double x,y; /* Coordinates of points */ double pi; /* Estimate of pi */ xi[0] = 1; /* These statements set up the random seed */ xi[1] = 1; xi[2] = 0; count = 0; for (i = 0; i < samples; i++) { x = erand48(xi); y = erand48(xi); if (x*x + y*y <= 1.0) count++; } pi = 4.0 * count / samples; printf(“Estimate of pi: %7.5f\n”, pi);

OpenMP Version /* A Monte Carlo algorithm for calculating pi */ int count; /* points inside the unit quarter circle */ unsigned short xi[3]; /* random number seed */ int i; /* loop index */ int samples; /* Number of points to generate */ double x,y; /* Coordinates of points */ double pi; /* Estimate of pi */ omp_set_num_threads(omp_get_num_procs()); xi[0] = 1; xi[1] = 1; xi[2] = omp_get_thread_num(); count = 0; #pragma omp parallel for firstprivate(xi) private(x,y) reduction(+:count) for (i = 0; i < samples; i++) { x = erand48(xi); y = erand48(xi); if (x*x + y*y <= 1.0) count++; } pi = 4.0 * count / samples; printf(“Estimate of pi: %7.5f\n”, pi); What is wrong with this?

#include <stdio.h> #include <stdlib.h> #include <omp.h> main(int argc, char *argv[]) { /* A Monte Carlo algorithm for calculating pi */ int count; /* points inside the unit quarter circle */ int i; /* loop index */ int samples; /* Number of points to generate */ double x,y; /* Coordinates of points */ double pi; /* Estimate of pi */ samples = atoi(argv[1]); #pragma omp parallel { unsigned short xi[3]; /* random number seed */ xi[0] = 1; /* These statements set up the random seed */ xi[1] = 1; xi[2] = omp_get_thread_num(); count = 0; printf("I am thread %d\n", xi[2]); #pragma omp for firstprivate(xi) private(x,y) reduction(+:count) for (i = 0; i < samples; i++) { x = erand48(xi); y = erand48(xi); if (x*x + y*y <= 1.0) count++; } } pi = 4.0 * (double)count / (double)samples; printf("Count = %d, Samples = %d, Estimate of pi: %7.5f\n", count, samples, pi); } Corrected Version [snell@atropos openmp]$ time ./montecarlopi.single 50000000 I am thread 0 Count = 39268604, Samples = 50000000, Estimate of pi: 3.14149 real 0m4.628s user 0m4.568s sys 0m0.030s [snell@atropos openmp]$ time ./montecarlopi 50000000 I am thread 1 I am thread 0 Count = 39272420, Samples = 50000000, Estimate of pi: 3.14179 real 0m2.480s user 0m4.620s sys 0m0.030s [snell@atropos openmp]$

An alternate version … #pragma omp parallel private(xi, t, x, y, local_count) { xi[0] = 1; xi[1] = 1; xi[2] = tid = omp_get_thread_num(); t = omp_get_num_threads(); local_count = 0; for (i = tid; i < samples; i += t) { x = erand48(xi); y = erand48(xi); if (x*x + y*y <= 1.0) local_count++; } #pragma omp critical count += local_count; } pi = 4.0 * count / samples; printf(“Estimate of pi: %7.5f\n”, pi); } [snell@atropos openmp]$ time ./montecarlopi2 50000000 I am thread 0 I am thread 1 1: local_count is 16461003 1: local_count is 16392925 Count = 32853927, Samples = 50000000, Estimate of pi: 2.62831 real 0m5.053s user 0m9.697s sys 0m0.053s [snell@atropos openmp]$ Problems!

Corrected Version /* A Monte Carlo algorithm for calculating pi */ int count; /* points inside the unit quarter circle */ int local_count; /* points inside the unit quarter circle */ int t, tid; int i; /* loop index */ int samples; /* Number of points to generate */ double x,y; /* Coordinates of points */ double pi; /* Estimate of pi */ samples = atoi(argv[1]); #pragma omp parallel private(i,t,x,y,local_count) reduction(+:count) { unsigned short xi[3]; /* random number seed */ xi[0] = 1; /* These statements set up the random seed */ xi[1] = 1; xi[2] = tid = omp_get_thread_num(); t = omp_get_num_threads(); count = 0; //printf("I am thread %d\n", xi[2]); for (i = tid; i < samples; i+=t) { x = erand48(xi); y = erand48(xi); if (x*x + y*y <= 1.0) count++; } //printf("%d: count is %d\n", tid, count); } pi = 4.0 * (double)count / (double)samples; printf("Count = %d, Samples = %d, Estimate of pi: %7.5f\n", count, samples, pi); • Corrections: • i should be private! • random number seed array(xi) should also • be private • Use reduction [snell@atropos openmp]$ time ./montecarlopi3 50000000 Count = 39269476, Samples = 50000000, Estimate of pi: 3.14156 real 0m2.321s user 0m4.559s sys 0m0.014s

Serial Directives • MASTER / END MASTER • executed by master thread only • DO SERIAL / END DO SERIAL • loop immediately following should not be parallelized • useful with -qsmp=omp:auto • SINGLE • only one thread executes the block

Example Serial Execution /* A Monte Carlo algorithm for calculating pi */ … omp_set_num_threads(omp_get_num_procs()); xi[0] = 1; xi[1] = 1; xi[2] = omp_get_thread_num(); count = 0; #pragma omp parallel for firstprivate(xi) private(x,y) reduction(+:count) for (i = 0; i < samples; i++) { x = erand48(xi); y = erand48(xi); if (x*x + y*y <= 1.0) count++; #pragma omp single { printf(“Loop Iteration: %d\n”, i); } } pi = 4.0 * count / samples; printf(“Estimate of pi: %7.5f\n”, pi);

Conditional Execution • Overhead of fork/join is high • If a loop is small, you don’t want to parallellize • But, you may not know how big until runtime • Conditional clause for parallel execution • if ( expression ) area = 0.0; #pragma omp parallel for private(x) reduction (+:area) if (n > 5000) for (i = 0; i < n; i++) { x = (i + 0.5)/n; area += 4.0/(1.0 + x*x); } pi = area / n;

Scope Rules • Shared memory programming model • most variables are shared by default • Global variables are shared • But not everything is shared • loop index variables • stack variables in called functions from parallel region • variable set and then used in for-loop is PRIVATE • array whose subscript is constant w.r.t. PARALLEL for-loop and is set and then used within the for-loop is PRIVATE

Scope Clauses • for/DO directive has extra clauses, the most important • PRIVATE (variable list) • REDUCTION (op: variable list) • op is sum, min, max • variable is scalar, XLF allows array

Scope Clauses (2) • PARALLEL and PARALELL for/DO and PARALLEL SECTIONS have also • DEFAULT (variable list) • scope determined by rules • SHARED (variable list) • IF (scalar logical expression) • directives are like programming language extension, not compiler option

integer i,j,n real*8 a(n,n), b(n) read (1) b !$OMP PARALLEL DO !$OMP PRIVATE (i,j) SHARED (a,b,n) do j=1,n do i=1,n a(i,j) = sqrt(1.d0 + b(j)*i) end do end do !$OMP END PARALLEL DO

Matrix Multiply !$OMP PARALLEL DO PRIVATE(i,j,k) do j=1,n do i=1,n do k=1,n c(i,j) = c(i,j) + a(i,k) * b(k,j) end do end do end do

Analysis • Outer loop is parallel: columns of c • Not optimal for cache use • Can put more directives for each loop • Then granularity might be too fine

OMP Functions • int omp_get_num_procs() • int omp_get_num_threads() • int omp_get_thread_num() • void omp_set_num_threads(int)

OpenMP Environment Variables • OpenMP parallelism may be controlled via environment variables • OMP_NUM_THREADS • Sets number of threads in parallel sections • OMP_DYNAMIC • When = TRUE, allows number of threads to be set at runtime • OMP_NESTED • When = TRUE, enables nested parallelism • OMP_SCHEDULE • Controls the scheduling assignment • Example - export OMP_SCHEDULE=“static,4”

Fortran Parallel Directives • PARALLEL / END PARALLEL • PARALLEL SECTIONS / SECTION / SECTION / END PARALLEL SECTIONS • DO / END DO • work sharing directive for DO loop immediately following • PARALLEL DO / END PARALLEL DO • combined section and work sharing

Pthread Translation

Introduction to OpenMP: Directive-Based Programming for Multi-threaded Parallelism