Parallel Programming with OpenMp

1. Parallel Programming with OpenMp Edward Chrzanowski December 2003

2. What is Parallel Computing? Parallel computing is when a program uses concurrency to either: Increase the size of the problem that can be solved or Decrease the runtime for the solution to a problem

3. Introduction History What is OpenMP

4. What is OpenMP? OpenMP is: An Application Program Interface (API) that may be used to explicitly direct multi-threaded, shared parallelism Comprised of 3 primary components Compiler directives Runtime library routines Environmental variables Portable Specified for c/C++, F77, F90, F95 Implemented on most Unix platforms and Windows NT Standardized? Jointly defined and endorsed by major computer vendors Expected to be an ANSI standard Definition Open specifications for Multi Processing via collaborative work between interested parties from the hardware and software industry, government and academia

5. What is OpenMP? OpenMP is not: Meant for distributed memory parallel systems (by itself) Necessarily implemented identically by all vendors Guaranteed to make the most efficient use of shared memory

6. History Ancient History Early 1990�s vendors supplied directive based Fortran programming extentions Implementations were all functionally similar, but were diverging First attempt at a standard was ANSI X3H5 in 1994 Recent History OpenMP standard specification started again in 1997 Official web sight http://www.openmp.org/ Release History October 1997: Fortran version 1.0 Late 1998: C/C++ version 1.0 June 2000: Fortran version 2.0 April 2002: C/C++ version 2.0

7. Why OpenMP?

8. OpenMP Programming Model Thread based Parallelism A shared memory process can consist of multiple threads Explicit Parallelism OpenMP is an explicit (not automatic) programming model, offering the programmer full control over parallelization Fork-join model OpenMP uses the fork-join model of parallel execution Compiler Directive Based OpenMP parallelism is specified through the use of compiler directives which are imbedded in the source coded Nested Parallelism support Supports parallel constructs inside other parallel constructs Dynamic Threads Provision for dynamically altering the number of threads which may be used to execute different parallel regions

9. Cont� Fork-Join model All OpenMP programs begin as a single process � the master thread. Which executes sequentially until the first parallel region construct is encountered FORK � Master thread creates team of parallel threads JOIN � When the team of threads complete the statements in a parallel region construct, they synchronize and terminated, leaving only the master thread

10. OpenMP Compiler directives or Pragmas General syntax of directives (Fortran) and pragmas (C, C++)

11. Fortran Directives Source may be either fixed form or free form In fixed form, a line that begin with one of the following prefix keywords (sentinels): !$omp � C$omp � *$omp � and contains either a space or a zero in the sixth column is treaded as an OpenMP directive by the compiler A line that begins with one of the above sentinels and contains any other character in the sixth colums is treated as a continuation directive line by an OpenMP compiler

12. Fortran Directives Cont� In free form Fortran source a line that begins with the sentinel !$omp � is treated as an OpenMP directive. The sentinel may appear in any column so long as it appears as a single word and is preceded by white space A directive that needs to be continued on the next line is expressed as !$omp <directive> & (with the ampersand as the last token on that line)

13. C and C++ Pragmas Pragmas in C and C++ use the following sytnax: #pragma omp � The omp keyword distinguishes the pragma as an OpenMP pragma and is processed by OpenMP compilers and is ignored by others Application developers can use the same source code base for building both parallel and sequential (serial) version of an application using just a compile-time flag.Application developers can use the same source code base for building both parallel and sequential (serial) version of an application using just a compile-time flag.

14. A Simple Loop Saxpy (single-precision a*x plus y) subroutine saxpy(z, a, x, y, n) integer i, n real z(n), a, x(n), y !$omp parallel do do i = 1, n z(i) = a * x(i) + y enddo return end

15. Simple program cont � Notice that the only minimal change we make to the original program is the addition of the parallel do directive The directive must be followed by a do loop construct An OpenMP compiler will create a set of threads and distribute the iterations of the do loop across those threads for parallel execution

16. OpenMP constucts 5 main categories: Parallel regions Worksharing Data environment Synchronization Runtime functions/environment variables

17. Parallel regions You create threads in OpenMP with the �omp parallel� pragma/directive Example double x[1000]; omp_set_num_threads(4); #pragma omp parallel { int ID = omp_thread_num(); blah(ID,A); } printf(�finished\n�); A single copy of x is shared among all threads

18. Parallelism with Parallel Regions Loop-level parallelism is generally considered as fine-grained parallelism and refers to the unit of work executed in parallel In a loop the typical unit of work is relatively small compared to the program as a whole For a courser-grained parallelism the use of a !$omp parallel !$omp end parallel Will define the region to be parallelized The parallel/end parallel directive pair is a control structure that forks a team of parallel threads with individual data environments to execute the enclosed code concurrently

19. Some details Dynamic mode (default mode) Number of threads used in a parallel region can vary from one parallel region to another Setting the number of threads only sets the maximum number of threads � you can get less Static mode The number of threads is fixed and controlled by the programmer Nested parallel regions A compiler can choose to serialize the nested parallel region (I.e. use a team with only one thread)

20. Work Sharing Constructs #pragma omp for The �for� construct splits up loop iterations among the threads in a team #pragma omp parallel #pragma omp for for (I=0;I<N;I++){ SOME_STUFF(I); } Note that by default there is a barrier at the end of the �omp for� and you can use the �nowait� clause to turn off the barrier

21. Schedule clause The schedule clause effects how loop iterations are mapped onto threads Schedule(static [,chunk]) Deal out blocks of iterations of size �chunk� to each thread Schedule(dynamic[,chunk]) Each thread grabs �chunk� iterations off a queue until all iterations have been handled Schedule(guided[,chunk]) Threads dynamically grab blocks of iterations. The size of the block starts large and shrinks down to size �chunk� as the calculation proceeds Schedule(runtime) Schedule and chunk size taken from the OMP_SCHEDULE environment variable

22. Parallel Sections If the serial version of an application performs a sequence of tasks in which none of the later tasks depends on the results of the earlier ones, it may be more beneficial to assign different tasks to different threads !$OPM section [clause [,] [clause �]] #pragma omp sections [clause [clause]�]

23. Combined work sharing constructs !OMP parallel do #pragma omp parallel for #pragma omp parallel sections

24. Data Environment Shared memory programming model Most variables shared by default Global variables are shared among threads Fortran: common blocks, SAVE variables, MODULE variables C: File scope variables, static Not everything is shared Stack variables in sub-programs called from parallel regions are PRIVATE Automatic variables within statement blocks are PRIVATE

25. Changing Storage Attributes One can selectively change storage attribute constructs using the following clauses which apply to the lexical extent of the OpenMP construct Shared Private Firstprivate Threadprivate The value of a private inside a parallel loop can be transmitted to a global value outside the loop with a �lastprivate� The default status can be modified with: DEFAULT (PRIVATE|SHARED|NONE)

26. Cont � PRIVATE (var) creates a local copy of var for each thread The value is uninitialized Private copy is not storage associated with the original I=0 C$OMP PARALLEL DO PRIVATE (I) DO J=1,100 I=I+1 1000 CONTINUE PRINT *,I I was not initialized in the DO Regardless of initialization, I is undefined in the print statement

27. Firstprivate clause Special case of private Initializes each private copy with the corresponding value from the master thread

28. Threadprivate clause Makes global data private to a thread COMMON blocks in Fortran File scope an static variables in C Threadprivate variables can be initialized using COPYIN or by using DATA statements

29. Reduction clause Reduction (op: list) The variables in �list� must be shared in the enclosing parallel region Inside a parallel or a worksharing construct: A local copy of each list variable is made and initialized depending on the �op� (i.e. +, *, -) Pair wise �op� is updated on the local value Local copies are reduced into a single global copy at the end of the construct

30. Synchronization OpenMP has the following constructs to support synchronization: Atomic Barrier Critical section Flush Master Ordered and single

31. Critical section !$omp critical !$omp end critical Only one critical section is allowed to execute at one time anywhere in the program. It is equivalent to a global lock on the program It is illegal to branch into or jump out of a critical section

32. Atomic Is a special case of a critical section that can be used for certain simple statements It applies only to the update of a memory location !$omp atomic Can be applied only if the critical section consists of a single assignment statement that updates a scalar variable

33. Barrier Each thread waits until all threads arrive #pragma omp barrier Simple directive that can be used to ensure that a piece of work has been completed before moving on to the next phase

34. Ordered Enforces the sequential order for a block

35. Master Denotes a structured block that is only executed by the master thread. The other threads just skip it (no implied barriers or flushes). Used in parallel regions

36. Single Denotes a block of code that is executed by only one thread A barrier and a flush are implied at the end of the single block

37. Flush Denotes a sequence point where a thread tries to create a consistent view of memory All memory operations (both reads and writes) defined prior to the sequence must complete All memory operations defined after the sequence point must follow the flush Variables in registers or write buffers must be updated in memory Arguments to flush specify which variables are flushed. No arguments specifies that all thread visible variables are flushed

38. Runtime functions and library routines Lock routines Omp_init_lock(), omp_set_lock(), omp_unset_lock() omp_test_lock Runtime environment routines: Modify/check the number of threads Omp_set_num_threads(), omp_get_num_threads(), omp_get_thread_num(), omp_get_max_threads() Turn on/off nesting and dynamic mode Omp_set_nested(), omp_set_dynamic(), omp_get_nested(), omp_get_dynamic() Are we in a parallel region Omp_in_parallel() How many processors in the system Omp_num_procs()

39. Performance improvements The compiler listing gives many useful clues for improving the performance Loop optimization tables Reports about data dependencies Explanations about applied transformations The anotated, transformed code Calling tree Performance statistics The type of reports to be included in the listing can be set through compiler options

40. Tuning Automatically Parallelized code Task is similar to explicit parallel programming Two important differences: The compiler gives hints in its listing, which may tell you where to focus attention (I.e. which variables have data dependencies) You do not need to perform all transformations by hand. If you expose the right information to the compiler, it will do the transformation for you (I.e. C$assert independent)

41. Cont� Hand improvements can pay off because: Compiler techniques are limited (I.e. array reductions are parallelized by only a few compilers) Compilers may have insufficient information(I.e. loop iteration range may be input data and variables are defined in other subroutines)

42. Performance Tuning Use the following methodology: Use compiler-parallelized code as a starting point Get loop profile and compiler listing Inspect time-consuming loops (biggest potential for improvement

43. SMP Programming Errors Shared memory parallel programming Saves the programmer from having to map data onto multiple processors It opens up a range of new errors coming from unanticipated shared resource conflicts

44. Two Major Errors Race conditions The outcome of a program depends on the detailed timing of the threads in the team Deadlock Threads lock up waiting on a locked resource that will never become free

45. OpenMp traps Are you using threadsafe libraries I/O inside a parallel region can interleave unpredictably Make sure you understand what your constructors are doing with private objects Private variables can mask globals Understand when shared memory is coherent When in doubt, use FLUSH NOWAIT removes implied barriers

46. How to avoid the Traps Analyze your code to make sure every semantically permitted interleaving of the threads yields the correct results Can be prohibitively difficult due to the explosion of possible interleavings Write SMP code that is portable and equivalent to the sequential form Use a safe subset of OpenMP Follow a set of rules for sequential equivalence

47. Strong Sequential Equivalence Rules Control data scope with the base language Avoid data scope clauses Only use private for scratch variables local to a block whose global initializations do not matter Locate all cases where a shared variable can be written by multiple threads The access to the variable must be protected If multiple threads combine results into a single value, enforce sequential order Do not use the reduction clause

48. Conclusion OpenMP is: A great way to write fast executing code Allows you to analyze special painful errors Tools and/or a discipline of writing portable sequentially equivalent programs can help

49. Some assignments A couple of simple assignments 1.) write a multi-threaded �Hello World� program where: Each thread prints a simple message (I.e. hello world) What do the results tell you about I/O with multiple threads? 2.) write a multi-threaded �pi� program using the following serial/sequential program Do it as an SPMD program using a parallel region only Do it with a work sharing construct Make sure multiple threads do not overwrite each others variables

50. Pi program static long num_steps = 100000; double step; void mail () { int i; double x, pi, sum=0.0; step = 1.0/(double) num_steps; for (i=1;i<=num_steps; i++){ x= (i-0.5)*step; sum = sum + 4.0/1.0+X*X); } pi = step*sum; }

51. OpenMP - future In the hands of the Architectural Review Board (the ARB) HP, Intel, Sun, SGI, DOE ASCI ARB resolves interpretation issues and manages the evolution of new OpenMP API�s Membership in the ARB is open to any organization with a stake in OpenMP

52. References http://www.openmp.org/ Parallel Programming in OpenMP, Morgan Kaufman Publishers Parallel Programming in C with MPI and OpenMP, McGraw Hill Publishers

Parallel Programming with OpenMp

Parallel Programming with OpenMp

Presentation Transcript

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP

Parallel Processing with OpenMP

Parallel Programming in C with MPI and OpenMP

Advanced Parallel Programming with OpenMP

Parallel Programming with OpenMP

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP

Parallel Programming with OpenMP

Parallel Programming with OpenMP part 2 – OpenMP v3.0 - tasking

Parallel Programming with OpenMP

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP

Parallel Programming with MPI and OpenMP

Parallel Programming with OpenMP part 1 – OpenMP v2.5

Parallel Programming with OpenMP

Parallel Programming in C with MPI and OpenMP

Parallel Programming with MPI and OpenMP

Parallel Programming with OpenMP

Parallel Processing with OpenMP