520 likes | 993 Views
What is Parallel Computing?. Parallel computing is when a program uses concurrency to either:Increase the size of the problem that can be solved orDecrease the runtime for the solution to a problem. Introduction. HistoryWhat is OpenMP. What is OpenMP?. OpenMP is:An Application Program Interface
E N D
1. Parallel Programming with OpenMp Edward Chrzanowski
December 2003
2. What is Parallel Computing? Parallel computing is when a program uses concurrency to either:
Increase the size of the problem that can be solved or
Decrease the runtime for the solution to a problem
3. Introduction History
What is OpenMP
4. What is OpenMP? OpenMP is:
An Application Program Interface (API) that may be used to explicitly direct multi-threaded, shared parallelism
Comprised of 3 primary components
Compiler directives
Runtime library routines
Environmental variables
Portable
Specified for c/C++, F77, F90, F95
Implemented on most Unix platforms and Windows NT
Standardized?
Jointly defined and endorsed by major computer vendors
Expected to be an ANSI standard
Definition
Open specifications for Multi Processing via collaborative work between interested parties from the hardware and software industry, government and academia
5. What is OpenMP? OpenMP is not:
Meant for distributed memory parallel systems (by itself)
Necessarily implemented identically by all vendors
Guaranteed to make the most efficient use of shared memory
6. History Ancient History
Early 1990s vendors supplied directive based Fortran programming extentions
Implementations were all functionally similar, but were diverging
First attempt at a standard was ANSI X3H5 in 1994
Recent History
OpenMP standard specification started again in 1997
Official web sight http://www.openmp.org/
Release History
October 1997: Fortran version 1.0
Late 1998: C/C++ version 1.0
June 2000: Fortran version 2.0
April 2002: C/C++ version 2.0
7. Why OpenMP?
8. OpenMP Programming Model Thread based Parallelism
A shared memory process can consist of multiple threads
Explicit Parallelism
OpenMP is an explicit (not automatic) programming model, offering the programmer full control over parallelization
Fork-join model
OpenMP uses the fork-join model of parallel execution
Compiler Directive Based
OpenMP parallelism is specified through the use of compiler directives which are imbedded in the source coded
Nested Parallelism support
Supports parallel constructs inside other parallel constructs
Dynamic Threads
Provision for dynamically altering the number of threads which may be used to execute different parallel regions
9. Cont
Fork-Join model
All OpenMP programs begin as a single process the master thread. Which executes sequentially until the first parallel region construct is encountered
FORK Master thread creates team of parallel threads
JOIN When the team of threads complete the statements in a parallel region construct, they synchronize and terminated, leaving only the master thread
10. OpenMP Compiler directives or Pragmas General syntax of directives (Fortran) and pragmas (C, C++)
11. Fortran Directives Source may be either fixed form or free form
In fixed form, a line that begin with one of the following prefix keywords (sentinels):
!$omp
C$omp
*$omp
and contains either a space or a zero in the sixth column is treaded as an OpenMP directive by the compiler
A line that begins with one of the above sentinels and contains any other character in the sixth colums is treated as a continuation directive line by an OpenMP compiler
12. Fortran Directives Cont
In free form Fortran source a line that begins with the sentinel
!$omp
is treated as an OpenMP directive. The sentinel may appear in any column so long as it appears as a single word and is preceded by white space
A directive that needs to be continued on the next line is expressed as
!$omp <directive> &
(with the ampersand as the last token on that line)
13. C and C++ Pragmas Pragmas in C and C++ use the following sytnax:
#pragma omp
The omp keyword distinguishes the pragma as an OpenMP pragma and is processed by OpenMP compilers and is ignored by others Application developers can use the same source code base for building both parallel and sequential (serial) version of an application using just a compile-time flag.Application developers can use the same source code base for building both parallel and sequential (serial) version of an application using just a compile-time flag.
14. A Simple Loop Saxpy (single-precision a*x plus y)
subroutine saxpy(z, a, x, y, n)
integer i, n
real z(n), a, x(n), y
!$omp parallel do
do i = 1, n
z(i) = a * x(i) + y
enddo
return
end
15. Simple program cont
Notice that the only minimal change we make to the original program is the addition of the parallel do directive
The directive must be followed by a do loop construct
An OpenMP compiler will create a set of threads and distribute the iterations of the do loop across those threads for parallel execution
16. OpenMP constucts 5 main categories:
Parallel regions
Worksharing
Data environment
Synchronization
Runtime functions/environment variables
17. Parallel regions You create threads in OpenMP with the omp parallel pragma/directive
Example
double x[1000];
omp_set_num_threads(4);
#pragma omp parallel
{
int ID = omp_thread_num();
blah(ID,A);
}
printf(finished\n);
A single copy of x is shared among all threads
18. Parallelism with Parallel Regions Loop-level parallelism is generally considered as fine-grained parallelism and refers to the unit of work executed in parallel
In a loop the typical unit of work is relatively small compared to the program as a whole
For a courser-grained parallelism the use of a
!$omp parallel
!$omp end parallel
Will define the region to be parallelized
The parallel/end parallel directive pair is a control structure that forks a team of parallel threads with individual data environments to execute the enclosed code concurrently
19. Some details Dynamic mode (default mode)
Number of threads used in a parallel region can vary from one parallel region to another
Setting the number of threads only sets the maximum number of threads you can get less
Static mode
The number of threads is fixed and controlled by the programmer
Nested parallel regions
A compiler can choose to serialize the nested parallel region (I.e. use a team with only one thread)
20. Work Sharing Constructs #pragma omp for
The for construct splits up loop iterations among the threads in a team
#pragma omp parallel
#pragma omp for
for (I=0;I<N;I++){
SOME_STUFF(I);
}
Note that by default there is a barrier at the end of the omp for and you can use the nowait clause to turn off the barrier
21. Schedule clause The schedule clause effects how loop iterations are mapped onto threads
Schedule(static [,chunk])
Deal out blocks of iterations of size chunk to each thread
Schedule(dynamic[,chunk])
Each thread grabs chunk iterations off a queue until all iterations have been handled
Schedule(guided[,chunk])
Threads dynamically grab blocks of iterations. The size of the block starts large and shrinks down to size chunk as the calculation proceeds
Schedule(runtime)
Schedule and chunk size taken from the OMP_SCHEDULE environment variable
22. Parallel Sections If the serial version of an application performs a sequence of tasks in which none of the later tasks depends on the results of the earlier ones, it may be more beneficial to assign different tasks to different threads
!$OPM section [clause [,] [clause
]]
#pragma omp sections [clause [clause]
]
23. Combined work sharing constructs !OMP parallel do
#pragma omp parallel for
#pragma omp parallel sections
24. Data Environment Shared memory programming model
Most variables shared by default
Global variables are shared among threads
Fortran: common blocks, SAVE variables, MODULE variables
C: File scope variables, static
Not everything is shared
Stack variables in sub-programs called from parallel regions are PRIVATE
Automatic variables within statement blocks are PRIVATE
25. Changing Storage Attributes One can selectively change storage attribute constructs using the following clauses which apply to the lexical extent of the OpenMP construct
Shared
Private
Firstprivate
Threadprivate
The value of a private inside a parallel loop can be transmitted to a global value outside the loop with a lastprivate
The default status can be modified with:
DEFAULT (PRIVATE|SHARED|NONE)
26. Cont
PRIVATE (var) creates a local copy of var for each thread
The value is uninitialized
Private copy is not storage associated with the original
I=0
C$OMP PARALLEL DO PRIVATE (I)
DO J=1,100
I=I+1
1000 CONTINUE
PRINT *,I
I was not initialized in the DO
Regardless of initialization, I is undefined in the print statement
27. Firstprivate clause Special case of private
Initializes each private copy with the corresponding value from the master thread
28. Threadprivate clause Makes global data private to a thread
COMMON blocks in Fortran
File scope an static variables in C
Threadprivate variables can be initialized using COPYIN or by using DATA statements
29. Reduction clause Reduction (op: list)
The variables in list must be shared in the enclosing parallel region
Inside a parallel or a worksharing construct:
A local copy of each list variable is made and initialized depending on the op (i.e. +, *, -)
Pair wise op is updated on the local value
Local copies are reduced into a single global copy at the end of the construct
30. Synchronization OpenMP has the following constructs to support synchronization:
Atomic
Barrier
Critical section
Flush
Master
Ordered and
single
31. Critical section !$omp critical
!$omp end critical
Only one critical section is allowed to execute at one time anywhere in the program. It is equivalent to a global lock on the program
It is illegal to branch into or jump out of a critical section
32. Atomic Is a special case of a critical section that can be used for certain simple statements
It applies only to the update of a memory location
!$omp atomic
Can be applied only if the critical section consists of a single assignment statement that updates a scalar variable
33. Barrier Each thread waits until all threads arrive
#pragma omp barrier
Simple directive that can be used to ensure that a piece of work has been completed before moving on to the next phase
34. Ordered Enforces the sequential order for a block
35. Master Denotes a structured block that is only executed by the master thread. The other threads just skip it (no implied barriers or flushes).
Used in parallel regions
36. Single Denotes a block of code that is executed by only one thread
A barrier and a flush are implied at the end of the single block
37. Flush Denotes a sequence point where a thread tries to create a consistent view of memory
All memory operations (both reads and writes) defined prior to the sequence must complete
All memory operations defined after the sequence point must follow the flush
Variables in registers or write buffers must be updated in memory
Arguments to flush specify which variables are flushed. No arguments specifies that all thread visible variables are flushed
38. Runtime functions and library routines Lock routines
Omp_init_lock(), omp_set_lock(), omp_unset_lock() omp_test_lock
Runtime environment routines:
Modify/check the number of threads
Omp_set_num_threads(), omp_get_num_threads(), omp_get_thread_num(), omp_get_max_threads()
Turn on/off nesting and dynamic mode
Omp_set_nested(), omp_set_dynamic(), omp_get_nested(), omp_get_dynamic()
Are we in a parallel region
Omp_in_parallel()
How many processors in the system
Omp_num_procs()
39. Performance improvements The compiler listing gives many useful clues for improving the performance
Loop optimization tables
Reports about data dependencies
Explanations about applied transformations
The anotated, transformed code
Calling tree
Performance statistics
The type of reports to be included in the listing can be set through compiler options
40. Tuning Automatically Parallelized code Task is similar to explicit parallel programming
Two important differences:
The compiler gives hints in its listing, which may tell you where to focus attention (I.e. which variables have data dependencies)
You do not need to perform all transformations by hand. If you expose the right information to the compiler, it will do the transformation for you (I.e. C$assert independent)
41. Cont
Hand improvements can pay off because:
Compiler techniques are limited (I.e. array reductions are parallelized by only a few compilers)
Compilers may have insufficient information(I.e. loop iteration range may be input data and variables are defined in other subroutines)
42. Performance Tuning Use the following methodology:
Use compiler-parallelized code as a starting point
Get loop profile and compiler listing
Inspect time-consuming loops (biggest potential for improvement
43. SMP Programming Errors Shared memory parallel programming
Saves the programmer from having to map data onto multiple processors
It opens up a range of new errors coming from unanticipated shared resource conflicts
44. Two Major Errors Race conditions
The outcome of a program depends on the detailed timing of the threads in the team
Deadlock
Threads lock up waiting on a locked resource that will never become free
45. OpenMp traps Are you using threadsafe libraries
I/O inside a parallel region can interleave unpredictably
Make sure you understand what your constructors are doing with private objects
Private variables can mask globals
Understand when shared memory is coherent
When in doubt, use FLUSH
NOWAIT removes implied barriers
46. How to avoid the Traps Analyze your code to make sure every semantically permitted interleaving of the threads yields the correct results
Can be prohibitively difficult due to the explosion of possible interleavings
Write SMP code that is portable and equivalent to the sequential form
Use a safe subset of OpenMP
Follow a set of rules for sequential equivalence
47. Strong Sequential Equivalence Rules Control data scope with the base language
Avoid data scope clauses
Only use private for scratch variables local to a block whose global initializations do not matter
Locate all cases where a shared variable can be written by multiple threads
The access to the variable must be protected
If multiple threads combine results into a single value, enforce sequential order
Do not use the reduction clause
48. Conclusion OpenMP is:
A great way to write fast executing code
Allows you to analyze special painful errors
Tools and/or a discipline of writing portable sequentially equivalent programs can help
49. Some assignments A couple of simple assignments
1.) write a multi-threaded Hello World program where:
Each thread prints a simple message (I.e. hello world)
What do the results tell you about I/O with multiple threads?
2.) write a multi-threaded pi program using the following serial/sequential program
Do it as an SPMD program using a parallel region only
Do it with a work sharing construct
Make sure multiple threads do not overwrite each others variables
50. Pi program static long num_steps = 100000;
double step;
void mail ()
{ int i; double x, pi, sum=0.0;
step = 1.0/(double) num_steps;
for (i=1;i<=num_steps; i++){
x= (i-0.5)*step;
sum = sum + 4.0/1.0+X*X);
}
pi = step*sum;
}
51. OpenMP - future In the hands of the Architectural Review Board (the ARB)
HP, Intel, Sun, SGI, DOE ASCI
ARB resolves interpretation issues and manages the evolution of new OpenMP APIs
Membership in the ARB is open to any organization with a stake in OpenMP
52. References http://www.openmp.org/
Parallel Programming in OpenMP, Morgan Kaufman Publishers
Parallel Programming in C with MPI and OpenMP, McGraw Hill Publishers