370 likes | 551 Views
OpenMP. Martin Kruliš Ji ří Dokulil. OpenMP. OpenMP Architecture Review Board Compaq, HP, Intel, IBM, KAI, SGI, SUN, U.S. Department of Energy,… http://www.openmp.org specifications (freely available) 1.0 – C/C++ and FORTRAN versions 2.0 – C/C++ and FORTRAN versions
E N D
OpenMP Martin Kruliš Jiří Dokulil
OpenMP • OpenMP Architecture Review Board • Compaq, HP, Intel, IBM, KAI, SGI, SUN, U.S. Department of Energy,… • http://www.openmp.org • specifications (freely available) • 1.0 – C/C++ and FORTRAN versions • 2.0 – C/C++ and FORTRAN versions • 2.5 – combined C/C++ and FORTRAN • 3.0 – combined C/C++ and FORTRAN • 4.0 – combined C/C++ and FORTRAN (July 2013)
Basics • fork – join model • tailored mostly for large array operations • pragmas • #pragma omp … • only a few constructs • programs should run without OpenMP • possible but not enforced • #ifdef _OPENMP
Simple example #define N 1024*1024 int* data=new int[N]; for(int i=0; i<N; ++i) { data[i]=i; }
Simple example – cont. #define N 1024*1024 int* data=new int[N]; #pragma omp parallel for for(int i=0; i<N; ++i) { data[i]=i; }
Another example int sum; #pragma omp parallel for for(int i=0; i<N; ++i) { sum+=data[i]; } WRONG
Variable scope • shared • one instance for all threads • private • one instance for each thread • reduction • special variant for reduction operations • valid within lexical extent • no effect in called functions
Variable scope – private • default for loop control variable • only for the parallelized loop • should (probably always) be made private • all loops in Fortran • all variables declared within the parallelized block • all non-static variables in called functions • allocated on stack – private for each thread • uninitialized values • at start of the block and after the block • except for classes • default constructor (must be accessible) • may not be shared among the threads
Variable scope – private int j; #pragma omp parallel forprivate(j) for(int i=0; i<N/2; ++i) { j=i*2; data[j]=i; data[j+1]=i; }
Variable scope – reduction • performing e.g. sum of an array • cannot use only private variable • shared requires explicit synchronization • combination is possible and (relatively) efficient but unnecessarily complex • each thread works on an private copy • initialized to a default value (0 for +, 1 for *,…) • final results are joined and available to the master thread
Variable scope – reduction long long sum=0; #pragma omp parallel forreduction(+:sum) for(int i=0; i<N; ++i) { sum+=data[i]; }
Variable scope – firstprivate and lastprivate • values of private variables at the start of the block and after end of the block are undefined • firstprivate • all values are initialized to the value of the master thread • lastprivate • variable after the parallelized block is set to the value of the last iteration (last in the serial version)
parallel • #pragmaomp parallel • launches threads and executes block in parallel • modifiers • if (scalar expression) • variable scope modifiers (including reduction) • num_threads • especially useful in conjunction with omp_get_thread_num
Loop-level parallelism • #pragmaomp parallel for • launch threads and execute loop in parallel • can be nested • #pragmaomp for • parallel loop within another parallel block • no (direct) nesting • “simple” for expression • implicit barrier at the end
Loop-level parallelism – modifiers 1 • variable scope modifiers • nowait – removes barrier • cannot be used with #pragma omp parallel for • ordered • loop (or called function) may contain block marked #pragma omp ordered • such block is executed in the same order as in serial execution of the loop • at most one such block may exist
Loop-level parallelism – modifiers 2 • schedule • schedule(static[, chunk_size]) • round robin • no chunk size → equal size to all threads • schedule(dynamic[, chunk_size]) • threads request chunks • default chunk size is 1 • schedule(guided[, chunk_size]) • like dynamic with size of chunks proportional to the amount of remaining work, but at least chunk_size • default chunk size is 1 • auto • selected by implementation • runtime • use default value stored in variable def-sched-var
Parallel sections • #pragma omp sections • #pragma omp section • #pragma omp section • … • several blocks of code that should be evaluated in parallel • modifiers • private, firstprivate, lastprivate, reduction • nowait
Single • #pragmaomp single • code is executed by only one thread of the team • modifiers • private, firstprivate • nowait • when not used, there is a barrier at the end of the block • copyprivate • final value of the variable is distributed to all threads in the team after the block is executed • incompatible with nowait
Workshare • Fortran only… SUBROUTINE A11_1(AA, BB, CC, DD, EE, FF, N) INTEGER N REAL AA(N,N), BB(N,N), CC(N,N), DD(N,N), EE(N,N), FF(N,N) !$OMP PARALLEL !$OMP WORKSHARE AA = BB CC = DD EE = FF !$OMP END WORKSHARE !$OMP END PARALLEL END SUBROUTINE A11_1
Master • #pragma omp master • executed only by the master thread
Critical section • #pragma omp critical [name] • the well-known critical section • at most once thread can execute critical section with certain name • multiple pragmas with same name form one section • names have external linkage • all unnamed pragmas form one section
Barrier • #pragma omp barrier • no associated block of code • some restrictions on placement if (a<10) #pragma omp barrier { do_something() }
Atomic • #pragma omp atomic • followed by expression in the form • x op= expr • +, *, -, /, &, ^, |, <<,or >> • expr must not reference x • x++ • ++x • x-- • --y
Flush • #pragma omp flush (variable list) • make thread’s view of variables consistent with the main memory • variable list may be omitted, flushes all • similar to volatile in C/C++ • influences memory operation reordering that can be performed by the compiler • cannot move read/write of the flushed variable to the other “side” of the flush operation • all values of flushed variables are saved to the memory before flush finishes • first read of flushed variable after flush is performed from the main memory • same placement restrictions as barrier
threadprivate • #pragma omp threadprivate(list) • makes global variable private for each thread • complex restrictions
copyin, copyprivate • copyin(list) • copy value of threadprivate variable from master thread to other members of the team • used as modifier in #pragmaomp parallel • values copied at the start of the block • copyprivate(list) • copy value from one thread’s threadprivate variable to all other members of the team • used as modifier in #pragmaomp single • values copied at the end of the block
Task • new in OpenMP 3.0 • #pragma omp task • piece of code to be executed in parallel • immediately or later • if clause forces immediate execution when false • tied or untied (to a thread) • can be suspended, e.g. by launching nested task • modifiers • default, private, firstprivate, shared • untied • if
Task scheduling points • after explicit generation of a task • after the last instruction of a task region • taskwait region • in implicit and explicit barriers • (almost) anywhere in untied tasks
Taskwait • #pragma omp taskwait • wait for completion of all child tasks generated since the start of the current task
Functions • omp_set_num_threads, omp_get_max_threads • number of threads used for parallel regions without num_threads clause • omp_get_num_threads • number of threads in the team • omp_get_thread_num • number of calling thread within the team • 0 = master • omp_get_num_procs • number of processors available to the program
Functions – cont. • omp_in_parallel • checks if the caller is in active parallel region • active region is region without if or if the condition was true • omp_set_dynamic, omp_get_dynamic • dynamic adjustment of thread number • on/off • omp_set_nested, omp_get_nested • nested parallelism • on/off
Locks • plain and nested • omp_lock_t, omp_nest_lock_t • omp_init_lock,omp_init_nest_lock • initializes the lock • omp_destroy_lock, omp_destroy_nest_lock • uninitializes • must be unlocked • omp_set_lock,omp_set_nest_lock • must be initialized • locks the lock • blocks until the lock is acquired • omp_unset_lock, omp_unset_nest_lock • must be locked and owned by the calling thread • unlocks • omp_test_lock, omp_test_nest_lock • like set but does not block
Timing routines • double omp_get_wtime() • wall clocl time in seconds • since “time in the past” • may not be consistent between threads • double omp_get_wtick() • number of seconds between successive clock ticks of the timer used by omp_get_wtime
Environment variables • OMP_NUM_THREADS • number of threads launched in parallel regions • omp_set_num_threads, omp_get_num_threads • OMP_SCHEDULE • used in loops with schedule(runtime) • "guided,4", "dynamic“ • OMP_DYNAMIC • set if implementation may change number of threads • omp_set_dynamic, omp_get_dynamic • true or false • OMP_NESTED • controls nested parallelism • true or false • default is false
Nesting of regions • some limitations • “close nesting” • no #pragma omp parallel nested between the two regions • “work-sharing region” • for, sections, single, (workshare) • work-sharing region may not be closely nested inside a work-sharing, critical, ordered, or master region • barrier region may not be closely nested inside a work-sharing, critical, ordered, or master region • master region may not be closely nested inside a work-sharing region • ordered region may not be closely nested inside a critical region • ordered region must be closely nested inside a loop region (or parallel loop region) with an ordered clause • critical region may not be nested (closely or otherwise) inside a critical region with the same name • note that this restriction is not sufficient to prevent deadlock
OpenMP 4.0 • The newest version (June 2013) • No implementations yet • Thread affinity • proc_bind(master | close | spread) • SIMD support • Explicit loop vectorization (by SSE, AVX, …) • User defined reduction • #pragma omp declare reduction (identifier : typelist : combiner-expr) [initializer-clause] • Atomic operations with sequentialconsistency (seq_cst)
OpenMP 4.0 • Accelerator support • Xeon Phi cards, GPUs, … • #pragma omp target – offloads computation • device(idx) • map(variable map) • #pragma target update