Introduction to OpenMP

Introduction to OpenMP For a more detailed tutorial see: http://www.openmp.org Look at the presentations

Concepts • Directive based programming • declare properties of language structures (sections, loops) • scope variables • A few service routines • get information • Compiler options • Environment variables

OpenMP Programming Model • fork-join parallelism • Master thread spawns a team of threads as needed.

Typical OpenMP Use • Generally used to parallelize loops • Find most time consuming loops • Split iterations up between threads void main() { double Res[1000]; #pragma omp parallel for for(int i=0;i<1000;i++) { do_huge_comp(Res[i]); } } void main() { double Res[1000]; for(int i=0;i<1000;i++) { do_huge_comp(Res[i]); } }

Thread Interaction • OpenMP operates using shared memory • Threads communicate via shared variables • Unintended sharing can lead to race conditions • output changes due to thread scheduling • Control race conditions using synchronization • synchronization is expensive • change the way data is stored to minimize the need for synchronization

Syntax format • Compiler directives • C/C++ • #pragma omp construct [clause [clause] …] • Fortran • C$OMP construct [clause [clause] … ] • !$OMP construct [clause [clause] … ] • *$OMP construct [clause [clause] … ] • Since we use directives, no changes need to be made to a program for a compiler that doesn’t support OpenMP

Using OpenMP • Compilers can automatically place directives with option • -qsmp=auto • xlf_r and xlc do a good job • some loops may speed up, some may slow down • Compiler option required when you write in directives • -qsmp=omp (ibm) • -mp (sgi) • Can mix directives with automatic parallelization • -qsmp=auto:omp • Scoping variables is the hard part! • shared variables, thread private variables

OpenMP Directives • 5 categories • Parallel Regions • Worksharing • Data Environment • Synchronization • Runtime functions / environment variables • Basically the same between C/C++ and Fortran

Parallel Regions • Create threads with omp parallel • Threads share A (default behavior) • Threads all start at same time then synchronize at a barrier at the end to continue with code. double A[1000] omp_set_num_threads(4); #pragma omp parallel { int ID = omp_get_thread_num(); dosomething(ID, A); }

Sections construct • The sections construct gives a different structured block to each thread • By default there is a barrier at the end. Use the nowait clause to turn off. #pragma omp parallel #pragma omp sections { X_calculation(); #pragma omp section y_calculation(); #pragma omp section z_calculation(); }

Work-sharing constructs • the for construct splits up loop iterations • By default, there is a barrier at the end of the “omp for”. Use the “nowait” clause to turn off the barrier. #pragma omp parallel #pragma omp for for (I=0;I<N;I++) { NEAT_STUFF(I); }

Short-hand notation • Can combine parallel and work sharing constructs • There is also a “parallel sections” construct #pragma omp parallel for for (I=0;I<N;I++){ NEAT_STUFF(I); }

A Rule • In order to be made parallel, a loop must have canonical “shape” index++; ++index; index--; --index; index += inc; index -= inc; index = index + inc; index = inc + index; index = index – inc; < <= >= > for (index=start; index end; )

An example #pragma omp parallel for private(j) for (i = 0; i < BLOCK_SIZE(id,p,n); i++) for (j = 0; j < n; j++) a[i][j] = MIN(a[i][j], a[i][k] + tmp[j]) By definition, private variable values are undefined at loop entry and exit To change this behavior, you can use the firstprivate(var) and lastprivate(var) clauses x[0] = complex_function(); #pragma omp parallel for private(j) firstprivate(x) for (i = 0; i < n; i++) for (j = 0; j < m; j++) x[j] = g(i, x[j-1]); answer[i] = x[j] – x[i];

Scheduling Iterations • The schedule clause effects how loop iterations are mapped onto threads • schedule(static [,chunk]) • Deal-out blocks of iterations of size “chunk” to each thread. • schedule(dynamic[,chunk]) • Each thread grabs “chunk” iterations off a queue until all iterations have been handled. • schedule(guided[,chunk]) • Threads dynamically grab blocks of iterations. The size of he block starts large and shrinks down to size “chunk” as the calculation proceeds. • schedule(runtime) • Schedule and chunk size taken from the OMP_SCHEDULE environment variable.

An example #pragma omp parallel for private(j) schedule(static, 2) for (i = 0; i < n; i++) for (j = 0; j < m; j++) x[j][j] = g(i, x[j-1]); You can play with the chunk size to meet load balancing issues, etc.

Scheduling considerations • Dynamic is most general and provides load balancing • If choice of scheduling has (big) impact on performance, something is wrong: • overhead too big => work in loop too small • n can be specification expression, not just constant

Reductions • Sometimes you want each thread to calculate part of a value then collapse all that into a single value • Done with reduction clause area = 0.0; #pragma omp parallel for private(x) reduction (+:area) for (i = 0; i < n; i++) { x = (i + 0.5)/n; area += 4.0/(1.0 + x*x); } pi = area / n;

Fortran Parallel Directives • PARALLEL / END PARALLEL • PARALLEL SECTIONS / SECTION / SECTION / END PARALLEL SECTIONS • DO / END DO • work sharing directive for DO loop immediately following • PARALLEL DO / END PARALLEL DO • combined section and work sharing

Serial Directives • MASTER / END MASTER • executed by master thread only • DO SERIAL / END DO SERIAL • loop immediately following should not be parallelized • useful with -qsmp=omp:auto

Synchronization Directives • BARRIER • inside PARALLEL, all threads synchronize • CRITICAL (lock) / END CRITICAL (lock) • section that can be executed by one thread only • lock is optional name to distinguish several critical constructs from each other

An example double area, pi, x; int i, n; area = 0.0; #pragma omp parallel for private(x) for (i = 0; i < n; i++) { x = (i + 0.5)/n; #pragma omp critical area += 4.0/(1.0 + x*x); } pi = area / n;

Scope Rules • Shared memory programming model • most variables are shared by default • Global variables are shared • But not everything is shared • stack variables in functions are private • variable set and then used in DO is PRIVATE • array whose subscript is constant w.r.t. PARALLEL DO and is set and then used within the DO is PRIVATE

Scope Clauses • DO and for directive has extra clauses, the most important • PRIVATE (variable list) • REDUCTION (op: variable list) • op is sum, min, max • variable is scalar, XLF allows array

Scope Clauses (2) • PARALLEL and PARALELL DO and PARALLEL SECTIONS have also • DEFAULT (variable list) • scope determined by rules • SHARED (variable list) • IF (scalar logical expression) • directives are like programming language extension, not compiler option

integer i,j,n real*8 a(n,n), b(n) read (1) b !$OMP PARALLEL DO !$OMP PRIVATE (i,j) SHARED (a,b,n) do j=1,n do i=1,n a(i,j) = sqrt(1.d0 + b(j)*i) end do end do !$OMP END PARALLEL DO

Matrix Multiply !$OMP PARALLEL DO PRIVATE(i,j,k) do j=1,n do i=1,n do k=1,n c(i,j) = c(i,j) + a(i,k) * b(k,j) end do end do end do

Analysis • Outer loop is parallel: columns of c • Not optimal for cache use • Can put more directives for each loop • Then granularity might be too fine

OMP Functions • int omp_get_num_procs() • int omp_get_num_threads() • int omp_get_thread_num() • void omp_set_num_threads(int)

Introduction to OpenMP