330 likes | 500 Views
Objectives. Discuss major classes of parallel programming modelsHybrid programmingExamples. Introduction. Parallel computer architectures have evolved and so has the programming styles needed to effectively use these architecturesTwo styles have become defacto standardsMPI library for message pa
E N D
1. Parallel Computing 2 Parallel Programming Styles and Hybrids
2. Objectives Discuss major classes of parallel programming models
Hybrid programming
Examples
3. Introduction Parallel computer architectures have evolved and so has the programming styles needed to effectively use these architectures
Two styles have become defacto standards
MPI library for message passing and
OpenMP compiler directives for multithreading
Both are widely used and are virtually on every parallel system.
With current parallel systems it makes more sense to mix multithreading and message passing to maximize performance
4. HPC Architectures Based on memory distribution
Shared all processors share equal access to one or more banks of memory
CRAY-YMP, SGI challenge, dual and quad workstations
Distributed each processor has its own memory which may or may not be visible to other processors
IBM SP2 and clusters of single uniprocessor machines
5. Distributed shared memory
NUMA (non-uniform memory access)
SGI origin 3000, HP superdome
Cluster of SMP (shared memory system)
IBM SP, Beowulf clusters
6. Parallel Programming Styles Explicit threading
Not commonly used on distributed systems
Uses locks, semaphores, and mutexes
Synchronization and parallelization handled by programmer
POSIX threads (pthreads library)
7. Message passing interface (MPI)
Application consists of several processes
Communicates by passing data to one another (send/receive) (broadcast/gather)
Synchronization is still require of the programmer, however, locking is not since nothing is shared
Common approach is domain decomposition where each task is assigned a subset and communicates it edge values to neighbouring subdomains
8. Compiler directives (OpenMP)
Special comments are added to serial programs in parallelizable regions
Requires a compiler that understands the special directives
Locking and synchronization handled by the compiler unless overwritten by directives (implicit and explicit)
Decomposition is done primarily by the programmer
Scalability is limited than that of MPI applications due to lesser amount of control the programmer has over how the code is parallelized
9. Hybrid
Mixture of MPI and OpenMP
Used on distributed shared memory systems
Applications are usually consists of computationally expensive loops punctuated by calls to MPI. These loops, in many cases, can be further parallelized by adding OpenMP directives
Not a solution to all parallel programs but quite suitable for certain algorithms
10. Why Hybrid? Performance considerations
Scalability. For a fixed problem size, hybrid code will scale to higher processor counts before being overwhelmed by communication overhead
A good example is the Laplace equation
May not be effective where performance is limited by the speed of interconnect rather than the processor
11. Computer architecture
Some architectural limitations force the use of hybrid computing (i.E. MPI process limits per 8 or cluster blocks)
Some algorithms namely FFTs run better on machines where the local bandwidth is much greater than that of the network due to the O(N) behaviour of the bandwidth required. With a hybrid approach the number of MPI processes can be lowered while retaining the same number of processors used
12. Algorithms
Some algorithms such as computational fluid dynamics benefit greatly from a hybrid approach. The solution space is separated into interconnecting zones. The interaction between zones is handled by MPI while the fine grained computations required inside a zone are handled by OpenMP
13. Considerations on MPI, OpenMP, and Hybrid Styles General considerations
Amdahls law
Amdahls law states that the speedup through parallelization is limited by the portion of the serial code that cannot be parallelized. But in a hybrid program, if the fraction of MPI processes parallelized by OpenMP is not high, then the overall speedup is limited
14. Communication patterns
How do the programs communication needs match the underlining hardware? It might increase performance if a hybrid approach is used where MPI code leads to rapid growth in communication traffic
15. Machine balance
How does memory, cpu, and interconnect affect the performance of a program? If the processors are fast then communications might be a problem. Or if the primary cache is accessed differently in clusters (older machines in a Beowulf cluster)
16. Memory access patterns
Cache memory has to be effectively used in order to achieve better performance in clusters (i.e. primary, secondary, tertiary)
17. Advantages and Disadvantages of OpenMP Advantages
Comparatively easy to implement. In particular, it is easy to refit an existing serial code for parallel execution
Same source code can be used for both parallel and serial versions
More natural for shared-memory architectures
Dynamic scheduling (load balancing is easier than with MPI)
Useful for both fine and course grained problems
18. Disadvantages
Can only run on shared memory systems
Limits the number of processors that can be used
Data placement and locality may become serious issues
Especially true for SGI NUMA architectures where the cost of remote memory access may be high
Thread creation overhead can be significant unless enough work is performed in each parallel loop
Implementing course-grained solutions in OpenMP is usually about as involved as constructing the analogous MPI application
Explicit synchronization is required
19. General characteristics
Most effective for problems with fine-grain parallelism (i.E. Loop-level)
Can also be used for course-grained parallelism
Overall intra-node memory bandwidth may limit the number of processors that can effectively be used
Each thread sees the same global memory, but has its own private memory
Implicit messaging
High level of abstraction (higher than MPI)
20. Advantages and Disadvantages of MPI Advantages
Any parallel algorithm can be expressed in terms of the MPI paradigm
Runs on both distributed and shared-memory systems. Performance is generally good in either environment
Allows explicit control over communication, leading to high efficiency due to overlapping communication and computation
Allows for static task handling
Data placement problems are rarely observed
For suitable problems MPI scales well to very large numbers of processors
MPI is portable
Current implementations are efficient and optimized
21. Disadvantages
Application development is difficult. Re-fitting existing serial code using MPI is often a major undertaking, requiring extensive restructuring of the serial code
It is less useful with fine-grained problems where communication costs may dominate
For all-to-all type operations, the effective number of point-to-point interactions increases as the square of the number of processors resulting in rapidly increasing communication costs
Dynamic load balancing is difficult to implement
Variations exist in different manufacturers implementation of the entire MPI library. Some may not implement all the calls, while others offer extensions
22. General characteristics
MPI is most effective for problems with course-grained parallelism, for which
The problem decomposes into quasi-independent pieces and
Communication needs are minimized
23. The Best of Both Worlds Use hybrid programming when
The code exhibits limited scaling with MPI
The code could make use of dynamic load balancing
The code exhibits fine-grained or a combination of both fine-grained and course-grained parallelism
The application makes use of replicated data
24. Problems When Mixing Modes Environment variables may not be passed correctly to the remote MPI processes. This has negative implications for hybrid jobs because each MPI process needs to access the MP_SET_NUMTHREADS environment variable in order to start up the proper number of OpenMP threads. It can be solved by always setting the number of OpenMP threads within the code
25. Calling MPI communication functions within OpenMP parallel regions
Hybrid programming works with having OpenMP threads spawned from MPI processes. It does not work the other way. It will result in a runtime error
26. Laplace Example Outline
Serial
MPI
OpenMP
Hybrid
27. Outline
The Laplace equation in two dimensions is also know as the potential equation and is usually one of the first PDEs (partial differential equations) encountered c(?u/?x + ?u/?y) = 0
Governing equation for electrostatics, heat diffusion, and fluid flow. By adding a function f(x,y) we get Poissons equation, a first derivative in time we get the diffusion equation and by adding a second derivative in time we get the wave equation
A numerical solution to this PDE can be computed by using a finite difference based approach
28. Using an iterative method to solve the equation we get the following:
du[sup(n+1)sub(ij)] = (u[sup(n)sub(I-1j)] + u[sup(n)sub(I+1j] + u[sup(n)sub(ij-1)] + u[sup(n)sub(ij+1)]) / 4- u[sup(n)sub(ij)]
u[sup(n+1)sub(ij)] = u[sup(n)sub(ij) + du[sup(n+1)sub(ij)] *note* - n represents iteration number , not an exponent
29. Serial cache-friendly approach incrementally computes du values and compares it with the current maximum, then updates all u values. Usually can be done without any additional memory operations good for clusters
See code
30. MPI
Currently the most utilized method for distributed memory systems
Note processes are not the same as processors. An MPI process can be thought of as a thread and multiple threads can run on a single processor. The system is responsible for mapping the MPI processes to physical processors
Each process is an exact copy of the program with the exception that each copy has its own unique id
31. Hello World PROGRAM hello
INCLUDE mpi.f
INTEGER ierror, rank, size
CALL MPI_INIT(ierror)
CALL MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)
CALL MPI_COMM_SIZE(MPI_COM_WORLD, size, ierror)
if (rank .EQ. 2) print *, P:, rank, Hello World
print *, I have rank , rank, out of, size
CALL MPI_FINALIZE(ierror)
END #include <mpi.h>
#include <stdio.h>
void main(int argc, char *argv[])
{
int rank, size;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if(rank==2) printf ("P:%d Hello World\n",rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
printf("I am %d out of %d.\n", rank, size);
MPI_Finalize();
}
32. OpenMP
OpenMP is a tool for writing multi-threaded applications in a shared memory environment. It consists of a set of compiler directives and library routines. The compiler generates multi-threaded code based on the specified directives. OpenMP is essentially a standardization of the last 18 years or so of SMP (Symmetric Multi-Processor) development and practice
See code
33. Hybrid
Remember you are running both MPI and OpenMP so f90 O3 mp file.f90 lmpi
See code