Introduction to OpenMP

Introduction to OpenMP For a more detailed tutorial see: http://www.openmp.org Look at the presentations

Concepts • Directive based programming • declare properties of language structures (sections, loops) • scope variables • A few service routines • get information • Compiler options • Environment variables

OpenMP Programming Model • fork-join parallelism • Master thread spawns a team of threads as needed.

Typical OpenMP Use • Generally used to parallelize loops • Find most time consuming loops • Split iterations up between threads void main() { double Res[1000]; #pragma omp parallel for for(int i=0;i<1000;i++) { do_huge_comp(Res[i]); } } void main() { double Res[1000]; for(int i=0;i<1000;i++) { do_huge_comp(Res[i]); } }

Thread Interaction • OpenMP operates using shared memory • Threads communicate via shared variables • Unintended sharing can lead to race conditions • output changes due to thread scheduling • Control race conditions using synchronization • synchronization is expensive • change the way data is stored to minimize the need for synchronization

Syntax format • Compiler directives • C/C++ • #pragma omp construct [clause [clause] …] • Fortran • C$OMP construct [clause [clause] … ] • !$OMP construct [clause [clause] … ] • *$OMP construct [clause [clause] … ] • Since we use directives, no changes need to be made to a program for a compiler that doesn’t support OpenMP

Using OpenMP • Compilers can automatically place directives with option • -qsmp=auto (IBM) • xlf_r and xlc do a good job (IBM) • some loops may speed up, some may slow down • Compiler option required when you write in directives • -qsmp=omp (IBM) • -mp (sgi) • Can mix directives with automatic parallelization • -qsmp=auto:omp (IBM) • Scoping variables is the hard part! • shared variables, thread private variables

OpenMP Directives • 5 categories • Parallel Regions • Worksharing • Data Environment • Synchronization • Runtime functions / environment variables • Basically the same between C/C++ and Fortran

Parallel Regions • Create threads with omp parallel • Threads share A (default behavior) • Threads all start at same time then synchronize at a barrier at the end to continue with code. double A[1000] omp_set_num_threads(4); #pragma omp parallel { int ID = omp_get_thread_num(); dosomething(ID, A); }

Sections construct • The sections construct gives a different structured block to each thread • By default there is a barrier at the end. Use the nowait clause to turn off. #pragma omp parallel #pragma omp sections { X_calculation(); #pragma omp section y_calculation(); #pragma omp section z_calculation(); }

Work-sharing constructs • the for construct splits up loop iterations • By default, there is a barrier at the end of the “omp for”. Use the “nowait” clause to turn off the barrier. #pragma omp parallel #pragma omp for for (I=0;I<N;I++) { NEAT_STUFF(I); }

Short-hand notation • Can combine parallel and work sharing constructs • There is also a “parallel sections” construct #pragma omp parallel for for (I=0;I<N;I++){ NEAT_STUFF(I); }

A Rule • In order to be made parallel, a loop must have canonical “shape” index++; ++index; index--; --index; index += inc; index -= inc; index = index + inc; index = inc + index; index = index – inc; < <= >= > for (index=start; index end; )

An example #pragma omp parallel for private(j) for (i = 0; i < BLOCK_SIZE(id,p,n); i++) for (j = 0; j < n; j++) a[i][j] = MIN(a[i][j], a[i][k] + tmp[j]) By definition, private variable values are undefined at loop entry and exit To change this behavior, you can use the firstprivate(var) and lastprivate(var) clauses x[0] = complex_function(); #pragma omp parallel for private(j) firstprivate(x) for (i = 0; i < n; i++) for (j = 0; j < m; j++) x[j] = g(i, x[j-1]); answer[i] = x[j] – x[i];

Scheduling Iterations • The schedule clause effects how loop iterations are mapped onto threads • schedule(static [,chunk]) • Deal-out blocks of iterations of size “chunk” to each thread. • schedule(dynamic[,chunk]) • Each thread grabs “chunk” iterations off a queue until all iterations have been handled. • schedule(guided[,chunk]) • Threads dynamically grab blocks of iterations. The size of he block starts large and shrinks down to size “chunk” as the calculation proceeds. • schedule(runtime) • Schedule and chunk size taken from the OMP_SCHEDULE environment variable.

An example #pragma omp parallel for private(j) schedule(static, 2) for (i = 0; i < n; i++) for (j = 0; j < m; j++) x[j][j] = g(i, x[j-1]); You can play with the chunk size to meet load balancing issues, etc.

Scheduling considerations • Dynamic is most general and provides load balancing • If choice of scheduling has (big) impact on performance, something is wrong: • overhead too big => work in loop too small • n can be specification expression, not just constant

Synchronization Directives • BARRIER • inside PARALLEL, all threads synchronize • CRITICAL (lock) / END CRITICAL (lock) • section that can be executed by one thread only • lock is optional name to distinguish several critical constructs from each other

An example double area, pi, x; int i, n; area = 0.0; #pragma omp parallel for private(x) for (i = 0; i < n; i++) { x = (i + 0.5)/n; #pragma omp critical area += 4.0/(1.0 + x*x); } pi = area / n;

Reductions • Sometimes you want each thread to calculate part of a value then collapse all that into a single value • Done with reduction clause area = 0.0; #pragma omp parallel for private(x) reduction (+:area) for (i = 0; i < n; i++) { x = (i + 0.5)/n; area += 4.0/(1.0 + x*x); } pi = area / n;

OpenMP Issues Each thread needs different random number seeds count is shared we need the aggregate Another Example /* A Monte Carlo algorithm for calculating pi */ int count; /* points inside the unit quarter circle */ unsigned short xi[3]; /* random number seed */ int i; /* loop index */ int samples; /* Number of points to generate */ double x,y; /* Coordinates of points */ double pi; /* Estimate of pi */ xi[0] = 1; /* These statements set up the random seed */ xi[1] = 1; xi[2] = 0; count = 0; for (i = 0; i < samples; i++) { x = erand48(xi); y = erand48(xi); if (x*x + y*y <= 1.0) count++; } pi = 4.0 * count / samples; printf(“Estimate of pi: %7.5f\n”, pi);

OpenMP Version /* A Monte Carlo algorithm for calculating pi */ int count; /* points inside the unit quarter circle */ unsigned short xi[3]; /* random number seed */ int i; /* loop index */ int samples; /* Number of points to generate */ double x,y; /* Coordinates of points */ double pi; /* Estimate of pi */ omp_set_num_threads(omp_get_num_procs()); xi[0] = 1; xi[1] = 1; xi[2] = omp_get_thread_num(); count = 0; #pragma omp parallel for firstprivate(xi) private(x,y) reduction(+:count) for (i = 0; i < samples; i++) { x = erand48(xi); y = erand48(xi); if (x*x + y*y <= 1.0) count++; } pi = 4.0 * count / samples; printf(“Estimate of pi: %7.5f\n”, pi);

An alternate version … #pragma omp parallel private(xi, t,I,x,y,local_count) { xi[0] = 1; xi[1] = 1; xi[2] = tid = omp_get_thread_num(); t = omp_get_num_threads(); local_count = 0; for (i = tid; i < samples; i += t) { x = erand48(xi); y = erand48(xi); if (x*x + y*y <= 1.0) local_count++; } #pragma omp critical count += local_count; } pi = 4.0 * count / samples; printf(“Estimate of pi: %7.5f\n”, pi); }

Conditional Execution • Overhead of fork/join is high • If a loop is small, you don’t want to parallellize • But, you may not know how big until runtime • Conditional clause for parallel execution • if ( expression ) area = 0.0; #pragma omp parallel for private(x) reduction (+:area) if (n > 5000) for (i = 0; i < n; i++) { x = (i + 0.5)/n; area += 4.0/(1.0 + x*x); } pi = area / n;

Scope Rules • Shared memory programming model • most variables are shared by default • Global variables are shared • But not everything is shared • stack variables in functions are private • variable set and then used in DO is PRIVATE • array whose subscript is constant w.r.t. PARALLEL DO and is set and then used within the DO is PRIVATE

Scope Clauses • DO and for directive has extra clauses, the most important • PRIVATE (variable list) • REDUCTION (op: variable list) • op is sum, min, max • variable is scalar, XLF allows array

Scope Clauses (2) • PARALLEL and PARALELL DO and PARALLEL SECTIONS have also • DEFAULT (variable list) • scope determined by rules • SHARED (variable list) • IF (scalar logical expression) • directives are like programming language extension, not compiler option

integer i,j,n real*8 a(n,n), b(n) read (1) b !$OMP PARALLEL DO !$OMP PRIVATE (i,j) SHARED (a,b,n) do j=1,n do i=1,n a(i,j) = sqrt(1.d0 + b(j)*i) end do end do !$OMP END PARALLEL DO

Matrix Multiply !$OMP PARALLEL DO PRIVATE(i,j,k) do j=1,n do i=1,n do k=1,n c(i,j) = c(i,j) + a(i,k) * b(k,j) end do end do end do

Analysis • Outer loop is parallel: columns of c • Not optimal for cache use • Can put more directives for each loop • Then granularity might be too fine

OMP Functions • int omp_get_num_procs() • int omp_get_num_threads() • int omp_get_thread_num() • void omp_set_num_threads(int)

Serial Directives • MASTER / END MASTER • executed by master thread only • DO SERIAL / END DO SERIAL • loop immediately following should not be parallelized • useful with -qsmp=omp:auto • SINGLE • only one thread executes the block

Example Serial Execution /* A Monte Carlo algorithm for calculating pi */ … omp_set_num_threads(omp_get_num_procs()); xi[0] = 1; xi[1] = 1; xi[2] = omp_get_thread_num(); count = 0; #pragma omp parallel for firstprivate(xi) private(x,y) reduction(+:count) for (i = 0; i < samples; i++) { x = erand48(xi); y = erand48(xi); if (x*x + y*y <= 1.0) count++; #pragma omp single { printf(“Loop Iteration: %d\n”, i); } } pi = 4.0 * count / samples; printf(“Estimate of pi: %7.5f\n”, pi);

Fortran Parallel Directives • PARALLEL / END PARALLEL • PARALLEL SECTIONS / SECTION / SECTION / END PARALLEL SECTIONS • DO / END DO • work sharing directive for DO loop immediately following • PARALLEL DO / END PARALLEL DO • combined section and work sharing

Introduction to OpenMP