210 likes | 336 Views
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING. CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING. Shared Memory Programming Model.
E N D
DISTRIBUTED ANDHIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING
Shared Memory Programming Model • threads of control – work is partitioned among multiple “threads”, each being a single task with a sequential flow of control (just like familiar threads used on sequential machines) • global shared address space – global data is accessed from a single global address space shared among the threads • local private address space – each thread can also have its own local private data that is not shared • parallel operations – different threads can operate on different processors to provide parallel speedup
Shared Memory Programming Model • asynchronous execution – threads can execute asynchronously, but need to synchronize at certain points, especially in accessing global shared memory locations • compiler directives – OpenMP standardizes compiler directives for compiling Fortran, C, C++ codes on shared memory machines (parallel and vector) • distributed shared memory – emulates shared memory programming model on a distributed memory machine
Threads • A thread is a single sequential flow of control within a program - has a beginning, an execution sequence, and an end. • Used to split a program into separate tasks, one per thread, that can execute concurrently. • On a single processor, threads execute concurrently using time-slicing. • Can allocate different priorities to threads, to give them more CPU time. • A fork is used to create a new thread of control that runs a specified procedure with specified arguments.
Threads • A join can be used to obtain a result from the procedure and terminate the thread. • However threads usually operate (and communicate) by processing shared global data values. • Need a mechanism to avoid problems with multiple threads accessing shared data at the same time. For example, a thread may be in the process of sequentially updating all elements of an array. If another thread accesses some array elements, it may end up with a mix of new (updated) values and old (not yet updated) values.
Synchronization of Threads • A mutex is used to specify mutual exclusion of a section of code (called a critical section) that accesses shared data values and can therefore only be accessed by one thread at a time. • A mutex has two states: locked and unlocked. It is initially unlocked, calling a LOCK statement locks the critical section, executes the code, and then unlocks the mutex. • All shared (mutable) data in a program, i.e. data used in critical sections, must be associated with an appropriate mutex (by the programmer), so locking the mutex locks access to the data. In an O-O language, locking a method automatically locks the object containing the method. • If another thread attempts to access a locked mutex, it will block until the mutex is unlocked.
Threads in Java • Java has a special Thread class, i.e. built-in support for threads. • Programmer must implement a run method that specifies what the thread should do. • The start method initializes the thread. • Independent threads can run asynchronously if they only use local (private) data, to share data they must synchronize. • Java uses the synchronized keyword to implement mutex locks and synchronization on critical sections.
Threads in Java • If a method is defined to be synchronized, the object is locked whenever a thread calls that method, and only unlocked (made accessible to other threads) when the method is completed. • Some JDK implementations use native threads which allow different threads to be run on different processors on multiprocessor SMPs.
Parallel Programming Using Threads • Can assign different threads to different processors to provide parallel speedup. • Threads must be “lightweight” for parallel programming, i.e. processes for creating, running, synchronizing and destroying threads must have low overhead. • There is also a synchronization overhead for accessing shared data. • As with message passing, expect good performance if this overhead is small compared with computation time. • Reduce amount of synchronization required by using appropriate data structures and programming with parallelism in mind. • Still have to deal with other parallel computing issues such as load balancing, deadlock, etc.
OpenMP Features • Standard, portable language • Used with Fortran 77, Fortran 90, C or C++ • Like HPF, most of the constructs in OpenMP are compiler directives or pragmas (implementation-specific actions). For C and C++ #pragma omp construct For Fortran !$OMP construct • These are ignored by compilers that don’t support OpenMP, so codes can also be run on sequential machines.
OpenMP Features • Compiler directives are used to specify sections of code that can be executed in parallel. • Directives also used for specifying mutexes on critical sections, and to specify shared and private variables. • Mainly used to parallelize loops, e.g. separate threads to handle separate instances of the loop. • There is also a run-time library (like MPI) that has several useful routines for checking the number of threads and number of processors, changing the number of threads, setting and unsetting locks, etc
Threads in OpenMP • Threads and critical section blocks are created using the parallel construct !$OMP PARALLEL ... do stuff (Fortran) !$OMP END PARALLEL #pragma omp parallel (C and C++) { ... do stuff }
Threads in OpenMP • As in MPI, query routines allow you to get the number of threads and the ID of a specific thread id = omp_get_thread_num(); Nthreads = omp_get_num_threads(); • Can specify number of threads at runtime, or in an environment variable, or in the program omp_set_num_threads(Nthreads);
Parallelising Loops in OpenMP • Compiler directive specifies that loop can be done in parallel !$OMP PARALLEL DO DO (i=1:N) value(i) = compute(i); END DO !$OMP END PARALLEL DO #pragma omp parallel for for (i=0;i++;i<N) { value[i] = compute(i); }
Parallelising Loops in OpenMP • Can use thread scheduling to specify allocation of threads to processors (like data distribution in HPF) #pragma omp parallel for schedule(static,4) • schedule(static [,chunk]) • Deal out blocks of iterations of size chunk to each thread • schedule(dynamic [,chunk]) • Each thread grabs a chunk iterations off a queue until all are done • schedule(runtime) • Find schedule from an environment variable
Shared and Private Data in OpenMP • OpenMP lets programmer specify a data storage attribute, which says whether data is shared or private. • shared(var) states that var is a global variable to be shared among threads • Default data storage attribute is shared
Shared and Private Data in OpenMP • There are mechanisms for converting between private and shared !$OMP PARALLEL DO !$OMP& PRIVATE(xx,yy) SHARED(u,f) DO j = 1,m DO i = 1,n xx = -1.0 + dx * (i-1) yy = -1.0 + dy * (j-1) u(i,j) = 0.0 f(i,j) = -alpha * (1.0-xx*xx) * & (1.0-yy*yy) END DO END DO !$OMP END PARALLEL DO
Some Other OpenMP Constructs • Reduction reduction (op : var list) Standard reduction operations, e.g. add, logical OR. A local copy of each list variable is made and initialized for each thread. The reduction operation is done for each thread, then the local values are combined to create the global value. double ZZ, func(), res=0.0; #pragma omp parallel for reduction (+:res) private(ZZ) for (i=0;i<N;i++) { ZZ = func(i); res = res + ZZ: }
Some Other OpenMP Constructs • Barrier synchronization There is an implicit barrier synchronization among threads at the end of each parallel block. Can also explicitly do a barrier synchronization using the barrier construct.
OpenMP Compilers • It is possible for automated parallelizing compilers to analyze loops in sequential programs and automatically generate OpenMP directives. • For example, compilers can identify parallel loops and reduction functions. • This may or may not produce more efficient code, but allows user to look at OpenMP version of the code and do manual optimization. • Compiler can also give clues about data dependencies that may allow programmer to restructure the code or data structures for better parallelism.
OpenMP Compilers • Since parallelism is mostly achieved by parallelising loops using shared memory, OpenMP compilers work well for multiprocessor SMPs and vector machines (vectorizing compilers work on the same principles). • OpenMP could work for distributed memory machines, but would need to use a good distributed shared memory (DSM) implementation, performance and scalability lacking at the moment. • Currently no OpenMP standard for Java - maybe not really needed since Java already offers good support for threads?