1 / 40

Introduction to OpenMP

Learn about OpenMP Programming Model, Directive-based Programming, Application Program Interface, Thread Interaction, Syntax format, Compiler Directives, and Thread Scoping in a structured tutorial. For a detailed guide, visit the provided links.

kates
Download Presentation

Introduction to OpenMP

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to OpenMP For a more detailed tutorial see: http://www.openmp.org Look at the presentations https://computing.llnl.gov/tutorials/openMP/

  2. Concepts • An Application Program Interface (API) that may be used to explicitly direct multi-threaded, shared memory parallelism • Directive based programming • declare properties of language structures (sections, loops) • scope variables • A few service routines • get information • Compiler options • Environment variables

  3. Open MP Programming Model • Directive • #pragma omp directive [clause list] • Program executes serially until it encounters a parallel directive • #pragma omp parallel [clause list] • /* structured block of code */ • Clause list is used to specify conditions • Conditional parallelism - if (cond) • Degree of concurrency - num_threads(int) • Data Handling - • private(vlist), firstprivate(vlist), shared(vlist)

  4. OpenMP Programming Model • fork-join parallelism • Master thread spawns a team of threads as needed.

  5. Typical OpenMP Use • Generally used to parallelize loops • Find most time consuming loops • Split iterations up between threads void main() { double Res[1000]; #pragma omp parallel for for(int i=0;i<1000;i++) { do_huge_comp(Res[i]); } } void main() { double Res[1000]; for(int i=0;i<1000;i++) { do_huge_comp(Res[i]); } }

  6. Thread Interaction • OpenMP operates using shared memory • Threads communicate via shared variables • Unintended sharing can lead to race conditions • output changes due to thread scheduling • Control race conditions using synchronization • synchronization is expensive • change the way data is stored to minimize the need for synchronization

  7. Syntax format • Compiler directives • C/C++ • #pragma omp construct [clause [clause] …] • Fortran • C$OMP construct [clause [clause] … ] • !$OMP construct [clause [clause] … ] • *$OMP construct [clause [clause] … ] • Since we use directives, no changes need to be made to a program for a compiler that doesn’t support OpenMP

  8. Using OpenMP • Some compilers can automatically place directives with option • -qsmp=auto (IBM xlc) • some loops may speed up, some may slow down • Compiler option required when you use directives • -qsmp=omp (IBM) • -mp (sgi) • -openmp (Linux – Intel compiler) • -fopenmp (gcc versions >= 4.2) • Scoping variables is the hard part! • shared variables, thread private variables

  9. OpenMP Example #include <stdio.h> #include <omp.h> main () { int nthreads, tid; /* Fork a team of threads giving them their own copies of variables */ #pragma omp parallel private(tid) { /* Obtain and print thread id */ tid = omp_get_thread_num(); printf("Hello World from thread = %d\n", tid); /* Only master thread does this */ if (tid == 0) { nthreads = omp_get_num_threads(); printf("Number of threads = %d\n", nthreads); } } /* All threads join master thread and terminate */ } [snell@atropos openmp]$ gcc-4.2 -fopenmp hello.c -o hello [snell@atropos openmp]$ ./hello Hello World from thread = 0 Hello World from thread = 1 Number of threads = 2 [snell@atropos openmp]$

  10. OpenMP Directives • 5 categories • Parallel Regions • Worksharing • Data Environment • Synchronization • Runtime functions / environment variables • Basically the same between C/C++ and Fortran

  11. Parallel Regions • Create threads with omp parallel • Threads share A (default behavior) • Master thread creates the threads • Threads all start at same time then synchronize at a barrier at the end to continue with code. double A[1000] omp_set_num_threads(4); #pragma omp parallel { int ID = omp_get_thread_num(); dosomething(ID, A); }

  12. Parallel Regions #pragma omp parallel [clause ...] newline structured_block Clauses if (scalar_expression) private (list) shared (list) default (shared | none) firstprivate (list) reduction (operator: list) copyin (list) num_threads (integer-expression)

  13. Parallel Regions • The number of threads in a parallel region is determined by the following factors, in order of precedence: • 1. Evaluation of the IF clause • 2. Setting of the NUM_THREADS clause • 3. Use of the omp_set_num_threads() library function • 4. Setting of the OMP_NUM_THREADS environment variable • 5. Implementation default - usually the number of CPUs on a node, though it could be dynamic (see next bullet). • Threads are numbered from 0 (master thread) to N-1

  14. Sections construct • The sections construct gives a different structured block to each thread • By default there is a barrier at the end. Use the nowait clause to turn off. #pragma omp parallel #pragma omp sections { X_calculation(); #pragma omp section y_calculation(); #pragma omp section z_calculation(); }

  15. Work-sharing constructs • the for construct splits up loop iterations • By default, there is a barrier at the end of the “omp for”. Use the “nowait” clause to turn off the barrier. #pragma omp parallel #pragma omp for for (I=0;I<N;I++) { NEAT_STUFF(I); }

  16. Short-hand notation • Can combine parallel and work sharing constructs • There is also a “parallel sections” construct #pragma omp parallel for for (I=0;I<N;I++){ NEAT_STUFF(I); }

  17. A Rule • In order to be made parallel, a loop must have canonical “shape” index++; ++index; index--; --index; index += inc; index -= inc; index = index + inc; index = inc + index; index = index – inc; < <= >= > for (index=start; index end; )

  18. An example #pragma omp parallel for private(j) for (i = 0; i < BLOCK_SIZE(id,p,n); i++) for (j = 0; j < n; j++) a[i][j] = MIN(a[i][j], a[i][k] + tmp[j]) By definition, private variable values are undefined at loop entry and exit To change this behavior, you can use the firstprivate(var) and lastprivate(var) clauses x[0] = complex_function(); #pragma omp parallel for private(j) firstprivate(x) for (i = 0; i < n; i++) for (j = 0; j < m; j++) x[j] = g(i, x[j-1]); answer[i] = x[j] – x[i];

  19. Scheduling Iterations • The schedule clause effects how loop iterations are mapped onto threads • schedule(static [,chunk]) • Deal-out blocks of iterations of size “chunk” to each thread. • schedule(dynamic[,chunk]) • Each thread grabs “chunk” iterations off a queue until all iterations have been handled. • schedule(guided[,chunk]) • Threads dynamically grab blocks of iterations. The size of he block starts large and shrinks down to size “chunk” as the calculation proceeds. • schedule(runtime) • Schedule and chunk size taken from the OMP_SCHEDULE environment variable.

  20. An example #pragma omp parallel for private(j) schedule(static, 2) for (i = 0; i < n; i++) for (j = 0; j < m; j++) x[j][j] = g(i, x[j-1]); You can play with the chunk size to meet load balancing issues, etc.

  21. Synchronization Directives • BARRIER • inside PARALLEL, all threads synchronize • CRITICAL (lock) / END CRITICAL (lock) • section that can be executed by one thread only • lock is optional name to distinguish several critical constructs from each other

  22. An example double area, pi, x; int i, n; area = 0.0; #pragma omp parallel for private(x) for (i = 0; i < n; i++) { x = (i + 0.5)/n; #pragma omp critical area += 4.0/(1.0 + x*x); } pi = area / n;

  23. Reductions • Sometimes you want each thread to calculate part of a value then collapse all that into a single value • Done with reduction clause area = 0.0; #pragma omp parallel for private(x) reduction (+:area) for (i = 0; i < n; i++) { x = (i + 0.5)/n; area += 4.0/(1.0 + x*x); } pi = area / n;

  24. OpenMP Issues Each thread needs different random number seeds count is shared we need the aggregate Another Example /* A Monte Carlo algorithm for calculating pi */ int count; /* points inside the unit quarter circle */ unsigned short xi[3]; /* random number seed */ int i; /* loop index */ int samples; /* Number of points to generate */ double x,y; /* Coordinates of points */ double pi; /* Estimate of pi */ xi[0] = 1; /* These statements set up the random seed */ xi[1] = 1; xi[2] = 0; count = 0; for (i = 0; i < samples; i++) { x = erand48(xi); y = erand48(xi); if (x*x + y*y <= 1.0) count++; } pi = 4.0 * count / samples; printf(“Estimate of pi: %7.5f\n”, pi);

  25. OpenMP Version /* A Monte Carlo algorithm for calculating pi */ int count; /* points inside the unit quarter circle */ unsigned short xi[3]; /* random number seed */ int i; /* loop index */ int samples; /* Number of points to generate */ double x,y; /* Coordinates of points */ double pi; /* Estimate of pi */ omp_set_num_threads(omp_get_num_procs()); xi[0] = 1; xi[1] = 1; xi[2] = omp_get_thread_num(); count = 0; #pragma omp parallel for firstprivate(xi) private(x,y) reduction(+:count) for (i = 0; i < samples; i++) { x = erand48(xi); y = erand48(xi); if (x*x + y*y <= 1.0) count++; } pi = 4.0 * count / samples; printf(“Estimate of pi: %7.5f\n”, pi); What is wrong with this?

  26. #include <stdio.h> #include <stdlib.h> #include <omp.h> main(int argc, char *argv[]) { /* A Monte Carlo algorithm for calculating pi */ int count; /* points inside the unit quarter circle */ unsigned short xi[3]; /* random number seed */ int i; /* loop index */ int samples; /* Number of points to generate */ double x,y; /* Coordinates of points */ double pi; /* Estimate of pi */ samples = atoi(argv[1]); #pragma omp parallel { xi[0] = 1; /* These statements set up the random seed */ xi[1] = 1; xi[2] = omp_get_thread_num(); count = 0; printf("I am thread %d\n", xi[2]); #pragma omp for firstprivate(xi) private(x,y) reduction(+:count) for (i = 0; i < samples; i++) { x = erand48(xi); y = erand48(xi); if (x*x + y*y <= 1.0) count++; } } pi = 4.0 * (double)count / (double)samples; printf("Count = %d, Samples = %d, Estimate of pi: %7.5f\n", count, samples, pi); } Corrected Version [snell@atropos openmp]$ time ./montecarlopi.single 50000000 I am thread 0 Count = 39268604, Samples = 50000000, Estimate of pi: 3.14149 real 0m4.628s user 0m4.568s sys 0m0.030s [snell@atropos openmp]$ time ./montecarlopi 50000000 I am thread 1 I am thread 0 Count = 39272420, Samples = 50000000, Estimate of pi: 3.14179 real 0m2.480s user 0m4.620s sys 0m0.030s [snell@atropos openmp]$

  27. An alternate version … #pragma omp parallel private(xi, t, x, y, local_count) { xi[0] = 1; xi[1] = 1; xi[2] = tid = omp_get_thread_num(); t = omp_get_num_threads(); local_count = 0; for (i = tid; i < samples; i += t) { x = erand48(xi); y = erand48(xi); if (x*x + y*y <= 1.0) local_count++; } #pragma omp critical count += local_count; } pi = 4.0 * count / samples; printf(“Estimate of pi: %7.5f\n”, pi); } [snell@atropos openmp]$ time ./montecarlopi2 50000000 I am thread 0 I am thread 1 1: local_count is 16461003 1: local_count is 16392925 Count = 32853927, Samples = 50000000, Estimate of pi: 2.62831 real 0m5.053s user 0m9.697s sys 0m0.053s [snell@atropos openmp]$ Problems!

  28. Corrected Version /* A Monte Carlo algorithm for calculating pi */ int count; /* points inside the unit quarter circle */ int local_count; /* points inside the unit quarter circle */ int t, tid; int i; /* loop index */ int samples; /* Number of points to generate */ double x,y; /* Coordinates of points */ double pi; /* Estimate of pi */ samples = atoi(argv[1]); #pragma omp parallel private(i,t,x,y,local_count) reduction(+:count) { unsigned short xi[3]; /* random number seed */ xi[0] = 1; /* These statements set up the random seed */ xi[1] = 1; xi[2] = tid = omp_get_thread_num(); t = omp_get_num_threads(); count = 0; //printf("I am thread %d\n", xi[2]); for (i = tid; i < samples; i+=t) { x = erand48(xi); y = erand48(xi); if (x*x + y*y <= 1.0) count++; } //printf("%d: count is %d\n", tid, count); } pi = 4.0 * (double)count / (double)samples; printf("Count = %d, Samples = %d, Estimate of pi: %7.5f\n", count, samples, pi); • Corrections: • i should be private! • random number seed array(xi) should also • be private [snell@atropos openmp]$ time ./montecarlopi3 50000000 Count = 39269476, Samples = 50000000, Estimate of pi: 3.14156 real 0m2.321s user 0m4.559s sys 0m0.014s

  29. Serial Directives • MASTER / END MASTER • executed by master thread only • DO SERIAL / END DO SERIAL • loop immediately following should not be parallelized • useful with -qsmp=omp:auto • SINGLE • only one thread executes the block

  30. Example Serial Execution /* A Monte Carlo algorithm for calculating pi */ … omp_set_num_threads(omp_get_num_procs()); xi[0] = 1; xi[1] = 1; xi[2] = omp_get_thread_num(); count = 0; #pragma omp parallel for firstprivate(xi) private(x,y) reduction(+:count) for (i = 0; i < samples; i++) { x = erand48(xi); y = erand48(xi); if (x*x + y*y <= 1.0) count++; #pragma omp single { printf(“Loop Iteration: %d\n”, i); } } pi = 4.0 * count / samples; printf(“Estimate of pi: %7.5f\n”, pi);

  31. Conditional Execution • Overhead of fork/join is high • If a loop is small, you don’t want to parallellize • But, you may not know how big until runtime • Conditional clause for parallel execution • if ( expression ) area = 0.0; #pragma omp parallel for private(x) reduction (+:area) if (n > 5000) for (i = 0; i < n; i++) { x = (i + 0.5)/n; area += 4.0/(1.0 + x*x); } pi = area / n;

  32. Scope Rules • Shared memory programming model • most variables are shared by default • Global variables are shared • But not everything is shared • loop index variables • stack variables in called functions from parallel region • variable set and then used in for-loop is PRIVATE • array whose subscript is constant w.r.t. PARALLEL for-loop and is set and then used within the for-loop is PRIVATE

  33. Scope Clauses • for/DO directive has extra clauses, the most important • PRIVATE (variable list) • REDUCTION (op: variable list) • op is sum, min, max • variable is scalar, XLF allows array

  34. Scope Clauses (2) • PARALLEL and PARALELL for/DO and PARALLEL SECTIONS have also • DEFAULT (variable list) • scope determined by rules • SHARED (variable list) • IF (scalar logical expression) • directives are like programming language extension, not compiler option

  35. integer i,j,n real*8 a(n,n), b(n) read (1) b !$OMP PARALLEL DO !$OMP PRIVATE (i,j) SHARED (a,b,n) do j=1,n do i=1,n a(i,j) = sqrt(1.d0 + b(j)*i) end do end do !$OMP END PARALLEL DO

  36. Matrix Multiply !$OMP PARALLEL DO PRIVATE(i,j,k) do j=1,n do i=1,n do k=1,n c(i,j) = c(i,j) + a(i,k) * b(k,j) end do end do end do

  37. Analysis • Outer loop is parallel: columns of c • Not optimal for cache use • Can put more directives for each loop • Then granularity might be too fine

  38. OMP Functions • int omp_get_num_procs() • int omp_get_num_threads() • int omp_get_thread_num() • void omp_set_num_threads(int)

  39. OpenMP Environment Variables • OpenMP parallelism may be controlled via environment variables • OMP_NUM_THREADS • Sets number of threads in parallel sections • OMP_DYNAMIC • When = TRUE, allows number of threads to be set at runtime • OMP_NESTED • When = TRUE, enables nested parallelism • OMP_SCHEDULE • Controls the scheduling assignment • Example - export OMP_SCHEDULE=“static,4”

  40. Fortran Parallel Directives • PARALLEL / END PARALLEL • PARALLEL SECTIONS / SECTION / SECTION / END PARALLEL SECTIONS • DO / END DO • work sharing directive for DO loop immediately following • PARALLEL DO / END PARALLEL DO • combined section and work sharing

More Related