1 / 37

OpenMP

OpenMP. Martin Kruliš Ji ří Dokulil. OpenMP. OpenMP Architecture Review Board Compaq, HP, Intel, IBM, KAI, SGI, SUN, U.S. Department of Energy,… http://www.openmp.org specifications (freely available) 1.0 – C/C++ and FORTRAN versions 2.0 – C/C++ and FORTRAN versions

Download Presentation

OpenMP

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. OpenMP Martin Kruliš Jiří Dokulil

  2. OpenMP • OpenMP Architecture Review Board • Compaq, HP, Intel, IBM, KAI, SGI, SUN, U.S. Department of Energy,… • http://www.openmp.org • specifications (freely available) • 1.0 – C/C++ and FORTRAN versions • 2.0 – C/C++ and FORTRAN versions • 2.5 – combined C/C++ and FORTRAN • 3.0 – combined C/C++ and FORTRAN • 4.0 – combined C/C++ and FORTRAN (July 2013)

  3. Basics • fork – join model • tailored mostly for large array operations • pragmas • #pragma omp … • only a few constructs • programs should run without OpenMP • possible but not enforced • #ifdef _OPENMP

  4. Simple example #define N 1024*1024 int* data=new int[N]; for(int i=0; i<N; ++i) { data[i]=i; }

  5. Simple example – cont. #define N 1024*1024 int* data=new int[N]; #pragma omp parallel for for(int i=0; i<N; ++i) { data[i]=i; }

  6. Another example int sum; #pragma omp parallel for for(int i=0; i<N; ++i) { sum+=data[i]; } WRONG

  7. Variable scope • shared • one instance for all threads • private • one instance for each thread • reduction • special variant for reduction operations • valid within lexical extent • no effect in called functions

  8. Variable scope – private • default for loop control variable • only for the parallelized loop • should (probably always) be made private • all loops in Fortran • all variables declared within the parallelized block • all non-static variables in called functions • allocated on stack – private for each thread • uninitialized values • at start of the block and after the block • except for classes • default constructor (must be accessible) • may not be shared among the threads

  9. Variable scope – private int j; #pragma omp parallel forprivate(j) for(int i=0; i<N/2; ++i) { j=i*2; data[j]=i; data[j+1]=i; }

  10. Variable scope – reduction • performing e.g. sum of an array • cannot use only private variable • shared requires explicit synchronization • combination is possible and (relatively) efficient but unnecessarily complex • each thread works on an private copy • initialized to a default value (0 for +, 1 for *,…) • final results are joined and available to the master thread

  11. Variable scope – reduction long long sum=0; #pragma omp parallel forreduction(+:sum) for(int i=0; i<N; ++i) { sum+=data[i]; }

  12. Variable scope – firstprivate and lastprivate • values of private variables at the start of the block and after end of the block are undefined • firstprivate • all values are initialized to the value of the master thread • lastprivate • variable after the parallelized block is set to the value of the last iteration (last in the serial version)

  13. parallel • #pragmaomp parallel • launches threads and executes block in parallel • modifiers • if (scalar expression) • variable scope modifiers (including reduction) • num_threads • especially useful in conjunction with omp_get_thread_num

  14. Loop-level parallelism • #pragmaomp parallel for • launch threads and execute loop in parallel • can be nested • #pragmaomp for • parallel loop within another parallel block • no (direct) nesting • “simple” for expression • implicit barrier at the end

  15. Loop-level parallelism – modifiers 1 • variable scope modifiers • nowait – removes barrier • cannot be used with #pragma omp parallel for • ordered • loop (or called function) may contain block marked #pragma omp ordered • such block is executed in the same order as in serial execution of the loop • at most one such block may exist

  16. Loop-level parallelism – modifiers 2 • schedule • schedule(static[, chunk_size]) • round robin • no chunk size → equal size to all threads • schedule(dynamic[, chunk_size]) • threads request chunks • default chunk size is 1 • schedule(guided[, chunk_size]) • like dynamic with size of chunks proportional to the amount of remaining work, but at least chunk_size • default chunk size is 1 • auto • selected by implementation • runtime • use default value stored in variable def-sched-var

  17. Parallel sections • #pragma omp sections • #pragma omp section • #pragma omp section • … • several blocks of code that should be evaluated in parallel • modifiers • private, firstprivate, lastprivate, reduction • nowait

  18. Single • #pragmaomp single • code is executed by only one thread of the team • modifiers • private, firstprivate • nowait • when not used, there is a barrier at the end of the block • copyprivate • final value of the variable is distributed to all threads in the team after the block is executed • incompatible with nowait

  19. Workshare • Fortran only… SUBROUTINE A11_1(AA, BB, CC, DD, EE, FF, N) INTEGER N REAL AA(N,N), BB(N,N), CC(N,N), DD(N,N), EE(N,N), FF(N,N) !$OMP PARALLEL !$OMP WORKSHARE AA = BB CC = DD EE = FF !$OMP END WORKSHARE !$OMP END PARALLEL END SUBROUTINE A11_1

  20. Master • #pragma omp master • executed only by the master thread

  21. Critical section • #pragma omp critical [name] • the well-known critical section • at most once thread can execute critical section with certain name • multiple pragmas with same name form one section • names have external linkage • all unnamed pragmas form one section

  22. Barrier • #pragma omp barrier • no associated block of code • some restrictions on placement if (a<10) #pragma omp barrier { do_something() }

  23. Atomic • #pragma omp atomic • followed by expression in the form • x op= expr • +, *, -, /, &, ^, |, <<,or >> • expr must not reference x • x++ • ++x • x-- • --y

  24. Flush • #pragma omp flush (variable list) • make thread’s view of variables consistent with the main memory • variable list may be omitted, flushes all • similar to volatile in C/C++ • influences memory operation reordering that can be performed by the compiler • cannot move read/write of the flushed variable to the other “side” of the flush operation • all values of flushed variables are saved to the memory before flush finishes • first read of flushed variable after flush is performed from the main memory • same placement restrictions as barrier

  25. threadprivate • #pragma omp threadprivate(list) • makes global variable private for each thread • complex restrictions

  26. copyin, copyprivate • copyin(list) • copy value of threadprivate variable from master thread to other members of the team • used as modifier in #pragmaomp parallel • values copied at the start of the block • copyprivate(list) • copy value from one thread’s threadprivate variable to all other members of the team • used as modifier in #pragmaomp single • values copied at the end of the block

  27. Task • new in OpenMP 3.0 • #pragma omp task • piece of code to be executed in parallel • immediately or later • if clause forces immediate execution when false • tied or untied (to a thread) • can be suspended, e.g. by launching nested task • modifiers • default, private, firstprivate, shared • untied • if

  28. Task scheduling points • after explicit generation of a task • after the last instruction of a task region • taskwait region • in implicit and explicit barriers • (almost) anywhere in untied tasks

  29. Taskwait • #pragma omp taskwait • wait for completion of all child tasks generated since the start of the current task

  30. Functions • omp_set_num_threads, omp_get_max_threads • number of threads used for parallel regions without num_threads clause • omp_get_num_threads • number of threads in the team • omp_get_thread_num • number of calling thread within the team • 0 = master • omp_get_num_procs • number of processors available to the program

  31. Functions – cont. • omp_in_parallel • checks if the caller is in active parallel region • active region is region without if or if the condition was true • omp_set_dynamic, omp_get_dynamic • dynamic adjustment of thread number • on/off • omp_set_nested, omp_get_nested • nested parallelism • on/off

  32. Locks • plain and nested • omp_lock_t, omp_nest_lock_t • omp_init_lock,omp_init_nest_lock • initializes the lock • omp_destroy_lock, omp_destroy_nest_lock • uninitializes • must be unlocked • omp_set_lock,omp_set_nest_lock • must be initialized • locks the lock • blocks until the lock is acquired • omp_unset_lock, omp_unset_nest_lock • must be locked and owned by the calling thread • unlocks • omp_test_lock, omp_test_nest_lock • like set but does not block

  33. Timing routines • double omp_get_wtime() • wall clocl time in seconds • since “time in the past” • may not be consistent between threads • double omp_get_wtick() • number of seconds between successive clock ticks of the timer used by omp_get_wtime

  34. Environment variables • OMP_NUM_THREADS • number of threads launched in parallel regions • omp_set_num_threads, omp_get_num_threads • OMP_SCHEDULE • used in loops with schedule(runtime) • "guided,4", "dynamic“ • OMP_DYNAMIC • set if implementation may change number of threads • omp_set_dynamic, omp_get_dynamic • true or false • OMP_NESTED • controls nested parallelism • true or false • default is false

  35. Nesting of regions • some limitations • “close nesting” • no #pragma omp parallel nested between the two regions • “work-sharing region” • for, sections, single, (workshare) • work-sharing region may not be closely nested inside a work-sharing, critical, ordered, or master region • barrier region may not be closely nested inside a work-sharing, critical, ordered, or master region • master region may not be closely nested inside a work-sharing region • ordered region may not be closely nested inside a critical region • ordered region must be closely nested inside a loop region (or parallel loop region) with an ordered clause • critical region may not be nested (closely or otherwise) inside a critical region with the same name • note that this restriction is not sufficient to prevent deadlock

  36. OpenMP 4.0 • The newest version (June 2013) • No implementations yet • Thread affinity • proc_bind(master | close | spread) • SIMD support • Explicit loop vectorization (by SSE, AVX, …) • User defined reduction • #pragma omp declare reduction (identifier : typelist : combiner-expr) [initializer-clause] • Atomic operations with sequentialconsistency (seq_cst)

  37. OpenMP 4.0 • Accelerator support • Xeon Phi cards, GPUs, … • #pragma omp target – offloads computation • device(idx) • map(variable map) • #pragma target update

More Related