1.91k likes | 2.43k Views
Schedule. Concept (a frame of mind)Compile (application). Introduction. Programming parallel computersCompiler extensionSequential programming language extensionParallel programming layerNew parallel languages. Concept. Parallel Algorithm DesignProgramming paradigmsParallel Random Access M
E N D
1. HPC Parallel Programming:From Concept to Compile A One day Introductory Workshop
October 12th, 2004
2. Schedule Concept (a frame of mind)
Compile (application)
3. Introduction Programming parallel computers
Compiler extension
Sequential programming language extension
Parallel programming layer
New parallel languages
4. Concept Parallel Algorithm Design
Programming paradigms
Parallel Random Access Machine the PRAM model
result, specialist, agenda - RSA model
Task / Channel - the PCAM model
Bulk Synchronous Parallel the BSP model
Pattern Language
5. Compile Serial
Introduction to OpenMP
Introduction to MPI
Profilers
Libraries
Debugging
Performance Analysis Formulas
6. By the end of this workshop you will be exposed to:
Different parallel programming models and paradigms
Serial Im not bad, I was built this way programming and how you can optimize it
OpenMP and MPI
libraries
debugging
7. Eu est velociter perfectus
Well Done is Quickly Done
Caesar Augustus
8. Introduction What is Parallel Computing?
It is the ability to program in a language that allows you to explicitly indicate how different portions of the computation may be executed concurrently by different processors
9. Why do it?
The need for speed
How much speedup can be determined by:
Amdahls Law S(p) = p/(1+(p-1)f) where:
f - fraction of the computation that cannot be divided into concurrent tasks, 0 = f = 1 and
p - the number of processors
So if we have 20 processors and a serial portion of 5% we will get a speedup of 20/(1+(20-1).05) =10.26
Also Gustafson-Barsiss Law which takes into account scalability, and
Karp-Flatt Metric which takes into account the parallel overhead, and
Isoefficiency Relation which is used to determine the range of processors for which a particular level of efficiency can be determined. Parallel overhead increases as the number of processors increase, so to maintain efficiency increase the problem size
10. Why Do Parallel Computing some other reasons
Time: Reduce the turnaround time of applications
Performance: Parallel computing is the only way to extend performance toward the TFLOP realm
Cost/Performance: Traditional vector computers become too expensive as one pushes the performance barrier
Memory: Applications often require memory that goes beyond that addressable by a single processor
Whole classes of important algorithms are ideal for parallel execution. Most algorithms can benefit from parallel processing such as Laplace equation, Monte Carlo, FFT (signal processing), image processing
Life itself is a set of concurrent processes
Scientists use modeling so why not model systems in a way closer to nature
11. Many complex scientific problems require large computing resources. Problems such as:
Quantum chemistry, statistical mechanics, and relativistic physics
Cosmology and astrophysics
Computational fluid dynamics and turbulence
Biology, genome sequencing, genetic engineering
Medicine
Global weather and environmental modeling
One such place is http://www-fp.mcs.anl.gov/grand-challenges/
12. Programming Parallel Computers In 1988 four distinct paths for application software development on parallel computers were identified by McGraw and Axelrod:
Extend an existing compiler to translate sequential programs into parallel programs
Extend an existing language with new operations that allow users to express parallelism
Add a new language layer on top of an existing sequential language
Define a totally new parallel language
13. Compiler extension Design parallelizing compilers that exploit parallelism in existing programs written in a sequential language
Advantages:
billions of dollars and thousands of years of programmer effort have already gone into legacy programs.
Automatic parallelization can save money and labour.
It has been an active area of research for over twenty years
Companies such as Parallel Software Products http://www.parallelsp.com/ offer compilers that translate F77 code into parallel programs for MPI and OpenMP
Disadvantages:
Pits programmer and compiler in game of hide and seek. The programmer hides parallelism in DO loops and control structures and the compiler might irretrievably lose some parallelism
14. Sequential Programming Language Design Extend a sequential language with functions that allow programmers to create, terminate, synchronize and communicate with parallel processes
Advantages:
Easiest, quickest, and least expensive since it only requires the development of a subroutine library
Libraries meeting the MPI standard exist for almost every parallel computer
Gives programmers flexibility with respect to program development
Disadvantages:
Compiler is not involved in generation of parallel code therefore it cannot flag errors
It is very easy to write parallel programs that are difficult to debug
15. Parallel Programming layers Think of a parallel program consisting of 2 layers. The bottom layer contains the core of the computation which manipulates its portion of data to gets its result. The upper layer controls creation and synchronization of processes. A compiler would then translate these two levels into code for execution on parallel machines
Advantages:
Allows users to depict parallel programs as directed graphs with nodes depicting sequential procedures and arcs representing data dependences among procedures
Disadvantages:
Requires programmer to learn and use a new parallel programming system
16. New Parallel Languages Develop a parallel language from scratch. Let the programmer express parallel operations explicitly. The programming language Occam is one such famous example http://wotug.ukc.ac.uk/parallel/occam/
Advantages:
Explicit parallelism means programmer and compiler are now allies instead of adversaries
Disadvantages:
Requires development of new compilers. It typically takes years for vendors to develop high-quality compilers for their parallel architectures
Some parallel languages such as C* were not adapted as standard compromising severely portable code
User resistance. Who wants to learn another language
17. The most popular approach continues to be augmenting existing sequential languages with low-level constructs expressed by function calls or compiler directives
Advantages:
Can exhibit high efficiency
Portable to a wide range of parallel systems
C, C++, F90 with MPI or OpenMP are such examples
Disadvantages:
More difficult to code and debug
18. Concept An algorithm (from OED) is a set of rules or process, usually one expressed in algebraic notation now used in computing
A parallel algorithm is one in which the rules or process are concurrent
There is no simple recipe for designing parallel algorithms. However, it can benefit from a methodological approach. It allows the programmer to focus on machine-independent issues such as concurrency early in the design process and machine-specific aspects later
You will be introduced to such approaches and models and hopefully gain some insight into the design process
Examining these models is a good way to start thinking in parallel
19. Parallel Programming Paradigms Parallel applications can be classified into well defined programming paradigms
Each paradigm is a class of algorithms that have the same control structure
Experience suggests that there are a relatively few paradigms underlying most parallel programs
The choice of paradigm is determined by the computing resources which can define the level of granularity and type of parallelism inherent in the program which reflects the structure of either the data or application
A paradigm is a pattern a case or instance regarded as representative or typicalA paradigm is a pattern a case or instance regarded as representative or typical
20. Parallel Programming Paradigms The most systematic definition of paradigms comes from a technical report from the University of Basel in 1993 entitled BACS: Basel Algorithm Classification Scheme
A generic tuple of factors which characterize a parallel algorithm:
Process properties (structure, topology, execution)
Interaction properties
Data properties (partitioning, placement)
The following paradigms were described:
Task-Farming (or Master/Slave)
Single Program Multiple Data (SPMD)
Data Pipelining
Divide and Conquer
Speculative Parallelism
21. PPP Task-Farming Task-farming consists of two entities:
Master which decomposes the problem into small tasks and distributes/farms them to the slave processes. It also gathers the partial results and produces the final computational result
Slave which gets a message with a task, executes the task and sends the result back to the master
It can use either static load balancing (distribution of tasks is all performed at the beginning of the computation) or dynamic load-balancing (when the number of tasks exceeds the number of processors or is unknown, or when execution times are not predictable, or when dealing with unbalanced problems). This paradigm responds quite well to the loss of processors and can be scaled by extending the single master to a set of masters
22. PPP Single Program Multiple data (SPMD) SPMD is the most commonly used paradigm
Each process executes the same piece of code but on a different part of the data which involves the splitting of the application data among the available processors. This is also referred to as geometric parallelism, domain decomposition, or data parallelism
Applications can be very efficient if the data is well distributed by the processes on a homogeneous system. If different workloads are evident then some sort of load balancing scheme is necessary during run-time execution
Highly sensitive to loss of a process. Unusually results in a deadlock until global synchronization point is reached
23. PPP Data Pipelining Data pipelining is fine-grained parallelism and is based on a functional decomposition approach
The tasks (capable of concurrent operation) are identified and each processor executes a small part of the total algorithm
One of the simplest and most popular functional decomposition paradigms and can also be referred to as data-flow parallelism.
Communication between stages of the pipeline can be asynchronous since the efficiency is directly dependent on the ability to balance the load across the stages
Often used in data reduction and image processing
24. PPP Divide and Conquer The divide and conquer approach is well known in sequential algorithm development in which a problem is divided into two or more subproblems. Each subproblem is solved independently and the results combined
In parallel divide and conquer, the subproblems can be solved at the same time
Three generic computational operations split, compute, and join (sort of like a virtual tree where the tasks are computed at the leaf nodes)
25. PPP Speculative Parallelism Employed when it is difficult to obtain parallelism through any one of the previous paradigms
Deals with complex data dependencies which can be broken down into smaller parts using some speculation or heuristic to facilitate the parallelism
26. PRAM Parallel Random Access Machine Descendent of RAM (Random Access Machine)
A theoretical model of parallel computation in which an arbitrary but finite number of processors can access any value in an arbitrarily large shared memory in a single time step
Introduced in the 1970s it still remains popular since it is theoretically tractable and gives algorithm designers a common target. The Prentice Hall book from 1989 entitled the Design and Analysis of Parallel algorithms, gives a good introduction to the design of algorithms using this model
27. PRAM cont The three most important variations on this model are:
EREW (exclusive read exclusive write) where any memory location may be access only once in any one step
CREW (concurrent read exclusive write) where any memory location may be read any number of times during a single step but written to only once after the reads have finished
CRCW (concurrent read concurrent write) where any memory location may be written to or read from any number of times during a single step. Some rule or priority must be given to resolve multiple writes
28. PRAM cont This model has problems
PRAMs cannot be emulated optimally on all architectures
Problem lies in the assumption that every processor can access the memory simultaneously in a single step. Even in hypercubes messages must take several hops between source and destination and it grows logarithmically with the machines size. As a result any buildable computer will experience a logarithmic slowdown relative to the PRAM model as its size increases
One solution is to take advantage of cases in which there is greater parallelism in the process than in the hardware it is running on, enabling each physical processor to emulate many virtual processors. An example of such is as follows:
29. PRAM cont Example
Process A sends request
Process B runs while request travels to memory
Process C runs while memory services request
Process D runs while reply returns to processor
Process A is re-scheduled
The efficiency with which physically resizable architectures could emulate the PRAM is dictated by the theorem:
If each of P processors sends a single message to a randomly-selected partner, it is highly probable that at least on processor will receive O(P/log log P) messages, and some others will receive none; but
If each processor sends log P messages to randomly-selected partners, there is a high probability that no processor will receive more than 3 log P messages.
So if problem size is increased at least logarithmically faster than machine size, efficiency can be held constant. The problem is that it holds constant for hypercubes in which the communication links grow with the number of processors.
Several ways around the above limitation has been suggested. Such as:
30. PRAM cont XPRAM where computations are broken up into steps such that no processor may communicate more that a certain number of times per single time step.
Programs which fit this model can be emulated efficiently
Problem is that it is difficult to design algorithms in which the frequency of communication decreases as the problem size increases
Bird-Meertens formalism where the allowed set of communications would be restricted to those which can be emulated efficiently
The scan-vector model proposed by Blelloch accounts for the relative distance of different portions of memory
Another option was proposed by Ramade in 1991 which uses a butterfly network in which each node is a processor/memory pair. Routing messages in this model is complicated but the end result is an optimal PRAM emulation
31. Result, Agenda, Specialist Model The RAS model was proposed by Nicholas Carriero and David Gelernter in their book How to Write Parallel Programs in 1990
To write a parallel program:
Choose a pattern that is most natural to the problem
Write a program using the method that is most natural for that pattern, and
If the resulting program is not efficient, then transform it methodically into a more efficient version
32. RAS Sounds simple. We can envision parallelism in terms of:
Result - focuses on the shape of the finished product
Plan a parallel application around a data structure yielded as the final result. We get parallelism by computing all elements of the result simultaneously
Agenda - focuses on the list of tasks to be performed
Plan a parallel application around a particular agenda of tasks, and then assign many processes to execute the tasks
Specialist - focuses on the make-up of the work
Plan an application around an ensemble of specialists connected into a logical network of some kind. Parallelism results in all nodes being active simultaneously much like pipe-lining
33. RAS - Result In most cases the easiest way to think of a parallel program is to think of the resulting data structure. It is a good starting point for any problem whose goal is to produce a series of values with predictable organization and interdependencies
Such a program reads as follows:
Build a data structure
Determine the value of all elements of the structure simultaneously
Terminate when all values are known
If all values are independent then all computations start in parallel. However, if some elements cannot be computed until certain other values are known, then those tasks are blocked
As a simple example consider adding two n-element vectors (i.e. add the ith elements of both and store the sum in another vector)
34. RAS - Agenda Agenda parallelism adapts well to many different problems
The most flexible is the master-worker paradigm
in which a master process initializes the computation and creates a collection of identical worker processes
Each worker process is capable of performing any step in the computation
Workers seek a task to perform and then repeat
When no tasks remain, the program is finished
An example would be to find the lowest ratio of salary to dependents in a database. The master fills a bag with records and each worker draws from the bag, computes the ratio, sends the results back to the master. The master keeps track of the minimum and when tasks are complete reports the answer
35. RAS - Specialist Specialist parallelism involves programs that are conceived in terms of a logical network.
Best understood in which each node executes an autonomous computation and inter-node communication follows predictable paths
An example could be a circuit simulation where each element is realized by a separate process
36. RAS - Example Consider a nave n-body simulator where on each iteration of the simulation we calculate the prevailing forces between each body and all the rest, and update each bodys position accordingly
With the result parallelism approach it is easy to restate the problem description as follows:
Suppose n bodies, q iterations of the simulation, compute matrix M such that M [i, j] is the position of the ith body after the jth iteration
Define each entry in terms of other entries i.e. write a function to compute position (i, j)
37. RAS - Example With the agenda parallelism model we can repeatedly apply the transformation compute next position to all bodies in the set
So the steps involved would be to
Create a master process and have it generate n initial task descriptors ( one for each body )
On the first iteration, each process repeatedly grabs a task descriptor and computes the next position of the corresponding body, until all task descriptors are used
The master can the store information about each bodys position at the last iteration in a distributed table structure where each process can refer to it directly
38. RAS - Example Finally, with the specialist parallelism approach we create a series of processes, each one specializing in a single body (i.e. each responsible for computing a single bodys current position throughout the entire simulation)
At the start of each iteration, each process sends data to and receives data from each other process
The data included in the incoming message group of messages is sufficient to allow each process to compute a new position for its body then repeat
39. Task Channel model THERE IS NO SIMPLE RECIPE FOR DESIGNING PARALLEL ALGORITHMS
However, with suggestions by Ian Foster and his book Designing and Building Parallel Programs there is a methodology we can use
The task/channel method is one most often sited as a practical means to organize the design process
It represents a parallel computation as a set of tasks that may interact with each other by sending messages through channels
It can be viewed as a directed graph where vertices represent tasks and directed edges represent channels
A thorough examination of this design process will conclude with a practical example
40. A task is a program, its local memory, and a collection of I/O ports
The local memory contains the programs instructions and its private data
It can send local data values to other tasks via output ports
It also receives data values from other tasks via input ports
A channel is a message queue that connects one tasks output port with another tasks input port
Data values appear at the input port in the same order as they were placed in the output port of the channel
Tasks cannot receive data until it is sent (i.e. receiving is blocked)
Sending is never blocked
41. The four stages of Fosters Design process are:
Partitioning the process of dividing the computation and data into pieces
Communication the pattern of send and receives between tasks
Agglomeration process of grouping tasks into larger tasks to simplify programming or improve performance
Mapping the processes of assigning tasks to processors
Commonly referred to as PCAM
42. Partitioning Discover as much parallelism as possible. To this end strive to split the computation and data into smaller pieces
There are two approaches:
Domain decomposition
Functional decomposition
43. PCAM partitioning domain decomposition Domain decomposition is where you first divide the data into pieces and then determine how to associate computations with the data
Typically focus on the largest or most frequently accessed data structure in the program
Consider a 3D matrix. It can be partitioned as:
Collection of 2D slices, resulting in a 1D collection of tasks
Collection of 1D slices, resulting in a 2D collection of tasks
Each matrix element separately resulting in a 3D collection of tasks
At this point in the design process it is usually best to maximize the number of tasks hence 3D partitioning is best
44. PCAM partitioning functional decomposition Functional decomposition is complimentary to domain decomposition in which the computation is first divided into pieces and then the data items are associated with each computation. This is often know as pipelining which yield a collection of concurrent tasks
Consider brain surgery
before surgery begins a set of CT images are input to form a 3D model of the brain
The system tracks the position of the instruments converting physical coordinates into image coordinates and displaying them on a monitor. While one task is converting physical coordinates to image coordinates, another is displaying the previous image, and yet another is tracking the instrument for the next image. (Anyone remember the movie The Fantastic Voyage?)
45. PCAM Partitioning - Checklist Regardless of decomposition we must maximize the number of primitive tasks since it is the upper bound on the parallelism we can exploit. Foster has presented us a checklist to evaluate the quality of the partitioning:
There are at least an order of magnitude more tasks than processors on the target parallel machine. If not, there may be little flexibility in later design options
Avoid redundant computation and storage requirements since the design may not work well when the size of the problem increases
Tasks are of comparable size. If not, it may be hard to allocate each processor equal amounts of work
The number of tasks scale with problem size. If not, it may be difficult to solve larger problems when more processors are available
Investigate alternative partitioning to maximize flexibility later
46. PCAM-Communication After the tasks have been identified it is necessary to understand the communication patterns between them
Communications are considered part of the overhead of a parallel algorithm, since the sequential algorithm does not need to do this. Minimizing this overhead is an important goal
Two such patterns local and global are more commonly used than others (structured/unstructured, static/dynamic, synchronous/asynchronous)
Local communication exists when a task need values from a small number of other tasks (its neighbours) in order to form a computation
Global communication exits when a large number of tasks must supply data in order to form a computation (e.g. performing a parallel reduction operation computing the sum of values over N tasks)
47. PCAM Communication - checklist These are guidelines and not hard and fast rules
Are the communication operations balanced between tasks? Unbalanced communication requirements suggest a non-scalable construct
Each task communicates only with a small number of neighbours
Tasks are able to communicate concurrently. If not the algorithm is likely to be inefficient and non-scalable.
Tasks are able to perform their computations concurrently
48. PCAM - Agglomeration The first two steps of the design process was focused on identifying as much parallelism as possible
At this point the algorithm would probably not execute efficiently on any particular parallel computer. For example, if there are many magnitudes more tasks than processors it can lead to a significant overhead in communication
In the next two stages of the design we consider combining tasks into larger tasks and then mapping them onto physical processors to reduce parallel overhead
49. PCAM - Agglomeration Agglomeration (according to OED) is the process of collecting in a mass. In this case we try group tasks into larger tasks to facilitate improvement in performance or to simplify programming.
There are three main goals to agglomeration:
Lower communication overhead
Maintain the scalability of the parallel design, and
Reduce software engineering costs
50. PCAM - Agglomeration How can we lower communication overhead?
By agglomerating tasks that communicate with each other, communication is completely eliminated, since data values controlled by the tasks are in the memory of the consolidated task. This process is known as increasing the locality of the parallel algorithm
Another way is to combine groups of transmitting and receiving tasks thereby reducing the number of messages sent. Sending fewer, longer messages takes less time than sending more, shorter messages since there is an associated startup cost (message latency) inherent with every message sent which is independent of the length of the message.
51. PCAM - Agglomeration How can we maintain the scalability of the parallel design?
Ensure that you do not combine too many tasks since porting to a machine with more processors may be difficult.
For example part of your parallel program is to manipulate a 3D array 16 X 128 X 256 and the machine has 8 processors. By agglomerating the 2nd and 3rd dimensions each task would be responsible for a submatrix of 2 X 128 X 256. We can even port this to a machine that has 16 processors. However porting this to a machine with more processors might result in large changes to the parallel code. Therefore agglomerating the 2nd and 3rd dimension might not be a good idea. What about a machine with 50, 64, or 128 processors?
52. PCAM - Agglomeration How can we reduce software engineering costs?
By parallelizing a sequential program we can reduce time and expense of developing a similar parallel program. Remember Parallel Software Products
53. PCAM Agglomeration - checklist Some of these points in this checklist emphasize quantitative performance analysis which becomes more important as we move from the abstract to the concrete
Has the agglomeration increased the locality of the parallel algorithm?
Do replicated computations take less time than the communications they replace?
Is the amount of replicated data small enough to allow the algorithm to scale?
Do agglomerated tasks have similar computational and communication costs?
Is the number of tasks an increasing function of the problem size?
Are the number of tasks as small as possible and yet as large as the number of processors on your parallel computer?
Is the trade-off between your chosen agglomeration and the cost of modifications to existing sequential code reasonable?
54. PCAM - Mapping In this 4th and final stage we specify where each task is to execute
The goals of mapping are to maximize processor utilization and minimize interprocessor communications. Often these are conflicting goals
Processor utilization is maximized when the commutation is balanced evenly. Conversely, it drops when one or processors are idle
Interprocessor communication increases when two tasks connected by a channel are mapped to different processors. Conversely, it decreases when the two tasks connected by the channel are mapped to the same processor
Mapping every task to the same processors cut communications to zero but utilization is reduced to 1/processors. The point is to choose a mapping that represents a reasonable balance between conflicting goals. The mapping problem has a name and it is
55. PCAM - Mapping The mapping problem is known to be NP-hard, meaning that no computationally tractable (polynomial-time) algorithm exists for evaluating these trade-offs in the general case. Hence we must rely on heuristics that can do a reasonably good job of mapping
Some strategies for decomposition of a problem are:
Perfectly parallel
Domain
Control
Object-oriented
Hybrid/layered (multiple uses of the above)
56. PCAM Mapping decomposition - perfect Perfectly parallel
Applications that require little or no inter-processor communication when running in parallel
Easiest type of problem to decompose
Results in nearly perfect speed-up
The pi example is almost perfectly parallel
The only communication occurs at the beginning of the problem when the number of divisions needs to be broadcast and at the end where the partial sums need to be added together
The calculation of the area of each slice proceeds independently
This would be true even if the area calculation were replaced by something more complex
57. PCAM mapping decomposition - domain Domain decomposition
In simulation and modelling this is the most common solution
The solution space (which often corresponds to the real space) is divided up among the processors. Each processor solves its own little piece
Finite-difference methods and finite-element methods lend themselves well to this approach
The method of solution often leads naturally to a set of simultaneous equations that can be solved by parallel matrix solvers
Sometimes the solution involves some kind of transformation of variables (i.e. Fourier Transform). Here the domain is some kind of phase space. The solution and the various transformations involved can be parallelized
58. PCAM mapping decomposition - domain Solution of a PDE (Laplaces Equation)
A finite-difference approximation
Domain is divided into discrete finite differences
Solution is approximated throughout
In this case, an iterative approach can be used to obtain a steady-state solution
Only nearest neighbour cells are considered in forming the finite difference
Gravitational N-body, structural mechanics, weather and climate models are other examples
59. PCAM mapping decomposition - control Control decomposition
If you cannot find a good domain to decompose, your problem might lend itself to control decomposition
Good for:
Unpredictable workloads
Problems with no convenient static structures
One set of control decomposition is functional decomposition
Problem is viewed as a set of operations. It is among operations where parallelization is done
Many examples in industrial engineering ( i.e. modelling an assembly line, a chemical plant, etc.)
Many examples in data processing where a series of operations is performed on a continuous stream of data
60. PCAM mapping decomposition - control Examples
Image processing: given a series of raw images, perform a series of transformation that yield a final enhanced image. Solve this in a functional decomposition (each process represents a different function in the problem) using data pipelining
Game playing: games feature an irregular search space. One possible move may lead to a rich set of possible subsequent moves to search.
61. PCAM mapping decomposition - OO Object-oriented decomposition is really a combination of functional and domain decomposition
Rather than thinking about a dividing data or functionality, we look at the objects in the problem
The object can be decomposed as a set of data structures plus the procedures that act on those data structures
The goal of object-oriented parallel programming is distributed objects
Although conceptually clear, in practice it can be difficult to achieve good load balancing among the objects without a great deal of fine tuning
Works best for fine-grained problems and in environments where having functionally ready at-the-call is more important than worrying about under-worked processors (i.e. battlefield simulation)
Message passing is still explicit (no standard C++ compiler automatically parallelizes over objects).
62. PCAM mapping decomposition - OO Example: the client-server model
The server is an object that has data associated with it (i.e. a database) and a set of procedures that it performs (i.e. searches for requested data within the database)
The client is an object that has data associated with it (i.e. a subset of data that it has requested from the database) and a set of procedures it performs (i.e. some application that massages the data).
The server and client can run concurrently on different processors: an object-oriented decomposition of a parallel application
In the real-world, this can be large scale when many clients (workstations running applications) access a large central data base kind of like a distributed supercomputer
63. PCAM mapping decomposition -summary A good decomposition strategy is
Key to potential application performance
Key to programmability of the solution
There are many different ways of thinking about decomposition
Decomposition models (domain, control, object-oriented, etc.) provide standard templates for thinking about the decomposition of a problem
Decomposition should be natural to the problem rather than natural to the computer architecture
Communication does no useful work; keep it to a minimum
Always wise to see if a library solution already exists for your problem
Dont be afraid to use multiple decompositions in a problem if it seems to fit
64. PCAM mapping - considerations If the communication pattern among tasks is regular, create p agglomerated tasks that minimize communication and map each task to its own processor
If the number of tasks is fixed and communication among them regular but the time require to perform each task is variable, then some sort of cyclic or interleaved mapping of tasks to processors may result in a more balanced load
Dynamic load-balancing algorithms are needed when tasks are created and destroyed at run-time or computation or communication of tasks vary widely
65. PCAM mapping - checklist It is important to keep an open mind during the design process. These points can help you decide if you have done a good job of considering design alternatives
Is the design based on one task per processor or multiple tasks per processor?
Have both static and dynamic allocation of tasks to processors been considered?
If dynamic allocation of tasks is chosen is the manager (task allocator) a bottle neck to performance
If using probabilistic or cyclic methods, do you have a large enough number of tasks to ensure reasonable load balance (typically ten times as many tasks as processors are required)
66. PCAM example N-body problem In a Newtonian n-body simulation, gravitational forces have infinite range. Sequential algorithms to solve these problems have time complexity of T(n) per iteration where n is the number of objects
Let us suppose that we are simulating the motion of n particles of varying mass in 2D. During each iteration we need to compute the velocity vector of each particle, given the positions of all other particles.
Using the four stage process we get
67. PCAM example N-body problem Partitioning
Assume we have one task per particle.
This particle must know the location of all other particles
Communication
A gather operation is a global communication that takes a dataset distributed among a group of tasks and collects the items on a single task
An all-gather operation is similar to gather, except at the end of communication every task has a copy of the entire dataset
We need to update the location of every particle so an all-gather is necessary
68. PCAM example N-body problem So put a channel between every pair of tasks
During each communication step each task sends it vector element to one other task. After n 1 communication steps, each task has the position of all other particles, and it can perform the calculations needed to determine the velocity and new location for its particle
Is there a quicker way? Suppose there were only two particles. If each task had a single particle, they can exchange copies of their values. What if there were four particles? After a single exchange step tasks 0 and 1 could both have particles v0 and v1 , likewise for tasks 2 and 3. If task 0 exchanges its pair of particles with task 2 and task 1 exchanges with task 3, then all tasks will have all four particles. A logarithmic number of exchange steps is sufficient to allow every processor to acquire the value originally held by every other processor. So the ith exchange step of messages have length 2^(i-1)
69. PCAM example N-body problem Agglomeration and mapping
In general, there are more particles n than processors p. If n is a multiple of p we can associate one task per processor and agglomerate n/p particles into each task.
70. PCAM - summary Task/channel (PCAM) is a theoretical construct that represents a parallel computation as a set of tasks that may interact with each other by sending messages through channels
It encourages parallel algorithm designs that maximize local computations and minimize communications
71. BSP Bulk Synchronous Parallel BSP model was proposed in 1989. It provides an elegant theoretical framework for bridging the gap between parallel hardware and software
BSP allows for the programmer to design an algorithm as a sequence of large step (supersteps in the BSP language) each containing many basic computation or communication operations done in parallel and a global synchronization at the end, where all processors wait for each other to finish their work before they proceed with the next superstep.
BSP is currently used around the world and very good text (which this segment is based on) is called Parallel Scientific Computation by Rob Bisseling published by Oxford Press in 2004
72. BSP Bulk Synchronous Parallel Some useful links:
BSP Worldwide organization
http://www.bsp-worldwide.org
The Oxford BSP toolset (public domain GNU license)
http://www.bsp-worldwide.org/implmnts/oxtool
The source files from the book together with test programs form a package called BSPedupack and can be found at
http://www.math.uu.nl/people/bisseling/software.html
The MPI version called MPIedupack is also available from the previously mentioned site
73. BSP Bulk Synchronous Parallel BSP satisfies all requirements of a useful parallel programming model
Simple enough to allow easy development and analysis of algorithms
Realistic enough to allow reasonably accurate modelling of real-life parallel computing
There exists a portability layer in the form of BSPlib
It has been efficiently implemented in the Oxford BSP toolset and Paderborn University BSP library
Currently being used as a framework for algorithm design and implementation on clusters of PCs, networks of workstations, shared-memory multiprocessors and large parallel machines with distributed memory
74. BSP Model BSP comprises of a computer architecture, a class of algorithms, and a function for charging costs to algorithms (hmm no wonder it is a popular model)
The BSP computer
consists of a collection of processors, each with private memory,
and a communication network that allows processors to access each others memories
75. BSP Model The BSP algorithm is a series of supersteps which contain either a number of computation or communication steps followed by a global barrier synchronization (i.e. bulk synchronization
What is one possible problem you see right away with designing algorithms this way?
76. BSP Model The BSP cost function is classified as an h-relation and consists of a superstep where at least one processor sends and receives at most h data words (real or integer) Therefore h = max{ hsend, hreceive }
It assumes sends and receives are simultaneous
This charging cost is based on the assumption that the bottleneck is at the entry or exit of a communication network
The cost of an h-relation would be
Tcomm(h) = hg + l, where
g is the communication cost per data word, and l is the global synchronization cost (both in time units of 1 flop) and the cost of a BSP algorithm is the expression
a + bg + cl (a, b, c) where a, b, c depend in general on p and on the problem size
77. BSP Bulk Synchronous Parallel This model currently allows you to convert from BSP to MPI-2 using MPIEDUPACK as an example (i.e. MPI can be used for programming in the BSP style
The main difference between MPI and BSPlib is that MPI provides more opportunities for optimization by the user. However, BSP does impose a discipline that can prove fruitful in developing reusable code
The book contains an excellent section on sparse matrix vector multiplication and if you link to the website you can download some interesting solvers http://www.math.uu.nl/people/bisseling/Mondriaan/mondriaan.html
78. Pattern Language Primarily from the book Patterns for Parallel Programming by Mattson, Sanders, and Massingill, Addison-Wesley, 2004
From the back cover Its the first parallel programming guide written specifically to serve working software developers, not just computer scientists. The authors introduce a complete, highly accessible pattern language that will help any experienced developer think parallel and start writing effective code almost immediately
The clich Dont judge a book by its cover comes to mind
79. Pattern Language We have come full circle. However, we have gained some knowledge along the way
A pattern language is not a programming language. It is an embodiment of design methodologies which provides domain specific advise to the application designer
Design patterns were introduced into software engineering in 1987
80. Pattern Language Organized into four design spaces (sound familiar - PCAM)
Finding concurrency
Structure problem to expose exploitable concurrency
Algorithm structure
Structure the algorithm to take advantage of the concurrency found above
Supporting structures
Structured approaches that represent the program and shared data structures
Implementation mechanisms
Mapping patterns to processors
81. Concept - Summary What is the common thread of all these models and paradigms?
82. Concept - Conclusion You take a problem, break it up into n tasks and assign them to p processors thats the science
How you break up the problem and exploit the parallelism now thats the art
83. This page intentionally left blank
84. Compile Serial/sequential program optimization
Introduction to OpenMP
Introduction to MPI
Profilers
Libraries
Debugging
Performance Analysis Formulas
85. Serial Some of you may be thinking why would I want to discuss serial in an talk about parallel computing.
Well, have you ever eaten just one bran flake or one rolled oat at a time? Homonym serial cerealHomonym serial cereal
86. Serial Most of the serial optimization techniques can be used for any program parallel or serial
Well written assembler code will beat high level programming language any day but who has the time to write a parallel application in assembler for one of the myriad of processors available. However, small sections of assembler might be more affective.
Reducing the memory requirements of an application is a good tool that frequently results in better processor performance
You can use these tools to write efficient code from scratch or to optimize existing code.
First attempts at optimization may be compiler options or modifying a loop. However performance tuning is like trying to reach the speed of light more and more time or energy is expended but the peak performance is never reached. It may be best, before optimizing your program, to consider how much time and energy you have and are willing or allowed to commit. Remember, you may spend a lot of time optimizing for one processor/compiler only to be told to port the code to another system
87. Serial Computers have become faster over the past years (Moores Law). However, application speed has not kept pace. Why? Perhaps it is because programmers:
Write programs without any knowledge of the hardware on which they will run
Do not know how to use compilers effectively (how many use the gnu compilers?)
Do not know how to modify code to improve performance
88. Serial Storage Problems Avoid cache thrashing and memory bank contention by dimensioning multidimensional arrays so that the dimensions are not powers of two
Eliminate TLB (Translation Lookaside Buffer which translates virtual memory addresses into physical memory addresses) misses and memory bank contention by accessing arrays in unit stride. A TLB miss is when a process accesses memory which does not have its translation in the TLB
Avoid Fortran I/O interfaces such as open(), read(), write(), etc. since they are built on top of buffered I/O mechanisms fopen(), fread(), fwrite(), etc.. Fortran adds additional functionality to the I/O routines which leads to more overhead for doing the actual transfers
Do your own buffering for I/O and use system calls to transfer large blocks of data to and from files
89. Serial Compilers and HPC A compiler takes a high-level language as input and produces assembler code and once linked with other objects, form an executable which can run on a computer
Initially programmers had no choice but to program in assembler for a specific processor. When processors change so would the code
Now programmers write in a high-level language that can be recompiled for other processors (source code compatibility). There is also object and binary compatibility
90. Serial the compiler and you How the compiler generates good assembly code and things you can do to help it
Register allocation is when the compiler assigns quantities to registers. C and C++ have the register command. Some optimizations increase the number of registers required
C/C++ register data type usual when the programmer knows the variable will be used many times and should not be reloaded from memory
C/C++ asm macro allows assembly code to be inserted directly into the instruction sequence. It makes code non-portable
C/C++ include file math.h generates faster code when used
Uniqueness of memory addresses. Different languages make assumptions on whether memory locations of variables are unique. Aliasing occurs when multiple elements have the same memory locations.
91. Serial The Compiler and You Dead code elimination is the removal of code that is never used
Constant folding is when expressions with multiple constants can be folded together and evaluated at compile time (i.e. A = 3+4 can be replaced by A = 7). Propagation is when variable references are replaced by a constant value at compile time (i.e. A=3+4, B=A+3 can be replaced by A=7 and B=10
Common subexpression elimination (i.e. A=B+(X+Y), C=D+(X+Y)) puts repeated expressions into a new variable
Strength reduction
92. Serial Strength reductions Replace integer multiplication or division with shift operations
Multiplies and divides are expensive
Replace 32-bit integer division by 64-bit floating-point division
Integer division is much more expensive than floating-point division
Replace floating-point multiplication with floating-point addition
Y=X+X is cheaper than Y=2*X
Replace multiple floating-point divisions by division and multiplication
Division is one of the most expensive operations a=x/z, b=y/z can be replaced by c=1/z, a=x*c, b=y*c
Replace power function by floating-point multiplications
Power calculations are very expensive and take 50 times longer than performing a multiplication so Y=X**3 can be replaced by Y=X*X*X
93. Serial Single Loop Optimization Induction variable optimization
when values in a loop are a linear function of the induction variable the code can be simplified by replacing the expression with a counter and replacing the multiplication by an addition
Prefetching
what happens when the compiler prefetches off the end of the array (fortunately it is ignored)
Test promotion in loops
Branches in code greatly reduce performance since they interfere with pipelining
Loop peeling
Handle boundary conditions outside the loop (i.e. do not test for them inside the loop)
Fusion
If the loop is the same (i.e. i=0; i<n, i++) for more than one loop combine them together
Fission
Sometime loops need to be split apart to help performance
Copying
Loop fission using dynamically allocated meory
94. Software pipelining
Closely related to unrolling
Loop invariant code motion
An expression in a loop is loop invariant if its value does not change from one loop iteration to the next
Array padding
Consider padding variables so that it is not a power of two
Optimizing reductions
A reduction is when a vector of data is used to create a scalar by applying a function to each element of the vector
95. Serial Nested Loop Optimization Loop interchange
An easy optimization to make. It helps performance since it makes the array references unit stride (do I=1,n do J=1,n x(I,J)=0.0 endo endo)
Outer loop rolling
Reduces the number of load operations
Unroll and jam
Unrolling multiple loops and jamming them back together in ways that reduce the number of memory operations
Blocking
Decreases the number of cache misses in nested loops. The most common blocking takes the inner loop, splits it into two loops and then exchanges the newly created non-innermost loop with one of the original outer loop. Good to use if the length of the inner loop is very long
96. MPI and OpenMP At this point you have squeezed out all the performance from a single processor and the only way to progress now is to go to multiple processors
With some understanding of the parallel models previously discussed and with some help from parallel programming models such as MPI and OpenMP, you should be able to write efficient parallel programs
97. MPI and OpenMP Programming models tied to particular parallel systems lead to programs that are not portable between parallel computers
As of 2000 there was a plethora of parallel programming environments. From A (ABCPL) to Z (ZPL) there were over 250
By late 1990s the parallel programming community settled primarily on two environments for parallel programming namely OpenMP for Shared memory and MPI for message passing
98. OpenMP and MPI Neither OpenMP or MPI is an ideal fit for hybrid architectures
OpenMP does not recognize nonuniform memory access times so allocating data leads to poor performance on non SMP machines, while MPI does not include constructs to manage data residing in a shared memory
Solution use MPI between nodes and OpenMP on the shared memory node. However, it is a bit more work for the programmer two different programming models are at work in the same program. Or give up on shared memory and use MPI for everything
Another alternative is MPI 2.0 which contains one-sided communications (a read or write is launched from an initiating process without involvement of other processes) which mimics some of the features of shared memory systems. Or, OpenMP has been extended to run in distributed-memory environments. Consult the annual WOMPAT (Workshop on OpenMP Applications and Tools) workshop for current approaches
99. Another alternative to consider A language with built-in features to support parallel programming is Java. It does not support HPS per say, but does contain features for explicitly specifying concurrent processing with shared memory and distributed memory models
Java.util.concurrent.* are such packages
Currently Java programs cannot compete with OpenMP and MPI in terms of performance, but this should improve
The Titanium Project http://http.cs.berkeley.edu/projects/titanium/ is an example of a Java dialect designed for HPC in a ccNUMA environment ccNUMA cache coherent non-uniform memory access shared physically distributed memoryccNUMA cache coherent non-uniform memory access shared physically distributed memory
100. OpenMP or MPI?
101. OpenMp - Introduction History
What is OpenMP
102. OpenMp What is it? OpenMP is:
An Application Program Interface (API) that may be used to explicitly direct multi-threaded, shared parallelism
Comprised of 3 primary components
Compiler directives
Runtime library routines
Environmental variables
Portable
Specified for c/C++, F77, F90, F95
Implemented on most Unix platforms and Windows NT
Standardized?
Jointly defined and endorsed by major computer vendors
Expected to be an ANSI standard
Definition
Open specifications for Multi Processing via collaborative work between interested parties from the hardware and software industry, government and academia
103. OpenMP What its not OpenMP is not:
Meant for distributed memory parallel systems (by itself)
Necessarily implemented identically by all vendors
Guaranteed to make the most efficient use of shared memory
104. OpenMP - History Ancient History
Early 1990s vendors supplied directive based Fortran programming extentions
Implementations were all functionally similar, but were diverging
First attempt at a standard was ANSI X3H5 in 1994
Recent History
OpenMP standard specification started again in 1997
Official web sight http://www.openmp.org/
Release History
October 1997: Fortran version 1.0
Late 1998: C/C++ version 1.0
June 2000: Fortran version 2.0
April 2002: C/C++ version 2.0
105. OpenMP Programming Model Thread based Parallelism
A shared memory process can consist of multiple threads
Explicit Parallelism
OpenMP is an explicit (not automatic) programming model, offering the programmer full control over parallelization
Fork-join model
OpenMP uses the fork-join model of parallel execution
Compiler Directive Based
OpenMP parallelism is specified through the use of compiler directives which are imbedded in the source coded
Nested Parallelism support
Supports parallel constructs inside other parallel constructs
Dynamic Threads
Provision for dynamically altering the number of threads which may be used to execute different parallel regions
106. Cont Fork-Join model
All OpenMP programs begin as a single process the master thread. Which executes sequentially until the first parallel region construct is encountered
FORK Master thread creates team of parallel threads
JOIN When the team of threads complete the statements in a parallel region construct, they synchronize and terminated, leaving only the master thread
107. OpenMP Compiler directives or Pragmas General syntax of directives (Fortran) and pragmas (C, C++)
108. Fortran Directives Source may be either fixed form or free form
In fixed form, a line that begin with one of the following prefix keywords (sentinels):
!$omp
C$omp
*$omp
and contains either a space or a zero in the sixth column is treaded as an OpenMP directive by the compiler
A line that begins with one of the above sentinels and contains any other character in the sixth columns is treated as a continuation directive line by an OpenMP compiler
109. Fortran Directives Cont In free form Fortran source a line that begins with the sentinel
!$omp
is treated as an OpenMP directive. The sentinel may appear in any column so long as it appears as a single word and is preceded by white space
A directive that needs to be continued on the next line is expressed as
!$omp <directive> &
(with the ampersand as the last token on that line)
110. C and C++ Pragmas Pragmas in C and C++ use the following syntax:
#pragma omp
The omp keyword distinguishes the pragma as an OpenMP pragma and is processed by OpenMP compilers and is ignored by others Application developers can use the same source code base for building both parallel and sequential (serial) version of an application using just a compile-time flag.Application developers can use the same source code base for building both parallel and sequential (serial) version of an application using just a compile-time flag.
111. A Simple Loop Saxpy (single-precision a*x plus y)
subroutine saxpy(z, a, x, y, n)
integer i, n
real z(n), a, x(n), y
!$omp parallel do
do i = 1, n
z(i) = a * x(i) + y
enddo
return
end
112. Simple program cont Notice that the only minimal change we make to the original program is the addition of the parallel do directive
The directive must be followed by a do loop construct
An OpenMP compiler will create a set of threads and distribute the iterations of the do loop across those threads for parallel execution
113. OpenMP constructs 5 main categories:
Parallel regions
Worksharing
Data environment
Synchronization
Runtime functions/environment variables
114. Parallel regions You create threads in OpenMP with the omp parallel pragma/directive
Example
double x[1000];
omp_set_num_threads(4);
#pragma omp parallel
{
int ID = omp_thread_num();
blah(ID,A);
}
printf(finished\n);
A single copy of x is shared among all threads
115. Parallelism with Parallel Regions Loop-level parallelism is generally considered as fine-grained parallelism and refers to the unit of work executed in parallel
In a loop the typical unit of work is relatively small compared to the program as a whole
For a courser-grained parallelism the use of a
!$omp parallel
!$omp end parallel
Will define the region to be parallelized
The parallel/end parallel directive pair is a control structure that forks a team of parallel threads with individual data environments to execute the enclosed code concurrently
116. Some details Dynamic mode (default mode)
Number of threads used in a parallel region can vary from one parallel region to another
Setting the number of threads only sets the maximum number of threads you can get less
Static mode
The number of threads is fixed and controlled by the programmer
Nested parallel regions
A compiler can choose to serialize the nested parallel region (I.e. use a team with only one thread)
117. Work Sharing Constructs #pragma omp for
The for construct splits up loop iterations among the threads in a team
#pragma omp parallel
#pragma omp for
for (I=0;I<N;I++){
SOME_STUFF(I);
}
Note that by default there is a barrier at the end of the omp for and you can use the nowait clause to turn off the barrier
118. Schedule clause The schedule clause effects how loop iterations are mapped onto threads
Schedule(static [,chunk])
Deal out blocks of iterations of size chunk to each thread
Schedule(dynamic[,chunk])
Each thread grabs chunk iterations off a queue until all iterations have been handled
Schedule(guided[,chunk])
Threads dynamically grab blocks of iterations. The size of the block starts large and shrinks down to size chunk as the calculation proceeds
Schedule(runtime)
Schedule and chunk size taken from the OMP_SCHEDULE environment variable
119. Parallel Sections If the serial version of an application performs a sequence of tasks in which none of the later tasks depends on the results of the earlier ones, it may be more beneficial to assign different tasks to different threads
!$OPM section [clause [,] [clause ]]
#pragma omp sections [clause [clause]]
120. Combined work sharing constructs !OMP parallel do
#pragma omp parallel for
#pragma omp parallel sections
121. Data Environment Shared memory programming model
Most variables shared by default
Global variables are shared among threads
Fortran: common blocks, SAVE variables, MODULE variables
C: File scope variables, static
Not everything is shared
Stack variables in sub-programs called from parallel regions are PRIVATE
Automatic variables within statement blocks are PRIVATE
122. Changing Storage Attributes One can selectively change storage attribute constructs using the following clauses which apply to the lexical extent of the OpenMP construct
Shared
Private
Firstprivate
Threadprivate
The value of a private inside a parallel loop can be transmitted to a global value outside the loop with a lastprivate
The default status can be modified with:
DEFAULT (PRIVATE|SHARED|NONE)
123. Cont PRIVATE (var) creates a local copy of var for each thread
The value is uninitialized
Private copy is not storage associated with the original
I=0
C$OMP PARALLEL DO PRIVATE (I)
DO J=1,100
I=I+1
1000 CONTINUE
PRINT *,I
I was not initialized in the DO
Regardless of initialization, I is undefined in the print statement
124. Firstprivate clause Special case of private
Initializes each private copy with the corresponding value from the master thread
125. Threadprivate clause Makes global data private to a thread
COMMON blocks in Fortran
File scope an static variables in C
Threadprivate variables can be initialized using COPYIN or by using DATA statements
126. Reduction clause Reduction (op: list)
The variables in list must be shared in the enclosing parallel region
Inside a parallel or a worksharing construct:
A local copy of each list variable is made and initialized depending on the op (i.e. +, *, -)
Pair wise op is updated on the local value
Local copies are reduced into a single global copy at the end of the construct
127. Synchronization OpenMP has the following constructs to support synchronization:
Atomic
Barrier
Critical section
Flush
Master
Ordered and
single
128. Critical section !$omp critical
!$omp end critical
Only one critical section is allowed to execute at one time anywhere in the program. It is equivalent to a global lock on the program
It is illegal to branch into or jump out of a critical section
129. Atomic Is a special case of a critical section that can be used for certain simple statements
It applies only to the update of a memory location
!$omp atomic
Can be applied only if the critical section consists of a single assignment statement that updates a scalar variable
130. Barrier Each thread waits until all threads arrive
#pragma omp barrier
Simple directive that can be used to ensure that a piece of work has been completed before moving on to the next phase
131. Ordered Enforces the sequential order for a block
132. Master Denotes a structured block that is only executed by the master thread. The other threads just skip it (no implied barriers or flushes).
Used in parallel regions
133. Single Denotes a block of code that is executed by only one thread
A barrier and a flush are implied at the end of the single block
134. Flush Denotes a sequence point where a thread tries to create a consistent view of memory
All memory operations (both reads and writes) defined prior to the sequence must complete
All memory operations defined after the sequence point must follow the flush
Variables in registers or write buffers must be updated in memory
Arguments to flush specify which variables are flushed. No arguments specifies that all thread visible variables are flushed
135. Runtime functions and library routines Lock routines
Omp_init_lock(), omp_set_lock(), omp_unset_lock() omp_test_lock
Runtime environment routines:
Modify/check the number of threads
Omp_set_num_threads(), omp_get_num_threads(), omp_get_thread_num(), omp_get_max_threads()
Turn on/off nesting and dynamic mode
Omp_set_nested(), omp_set_dynamic(), omp_get_nested(), omp_get_dynamic()
Are we in a parallel region
Omp_in_parallel()
How many processors in the system
Omp_num_procs()
136. Performance improvements The compiler listing gives many useful clues for improving the performance
Loop optimization tables
Reports about data dependencies
Explanations about applied transformations
The annotated, transformed code
Calling tree
Performance statistics
The type of reports to be included in the listing can be set through compiler options
137. Tuning Automatically Parallelized code Task is similar to explicit parallel programming
Two important differences:
The compiler gives hints in its listing, which may tell you where to focus attention (I.e. which variables have data dependencies)
You do not need to perform all transformations by hand. If you expose the right information to the compiler, it will do the transformation for you (I.e. C$assert independent)
138. Cont Hand improvements can pay off because:
Compiler techniques are limited (I.e. array reductions are parallelized by only a few compilers)
Compilers may have insufficient information(I.e. loop iteration range may be input data and variables are defined in other subroutines)
139. Performance Tuning Use the following methodology:
Use compiler-parallelized code as a starting point
Get loop profile and compiler listing
Inspect time-consuming loops (biggest potential for improvement
140. SMP Programming Errors Shared memory parallel programming
Saves the programmer from having to map data onto multiple processors
It opens up a range of new errors coming from unanticipated shared resource conflicts
141. Two Major Errors Race conditions
The outcome of a program depends on the detailed timing of the threads in the team
Deadlock
Threads lock up waiting on a locked resource that will never become free
142. OpenMp traps Are you using threadsafe (safe to operate on with more than one thread running, and/or in a thread that is not also the main thread) libraries?
I/O inside a parallel region can interleave unpredictably
Make sure you understand what your constructors are doing with private objects
Private variables can mask globals
Understand when shared memory is coherent
When in doubt, use FLUSH
NOWAIT removes implied barriers
143. How to avoid the Traps Analyze your code to make sure every semantically permitted interleaving of the threads yields the correct results
Can be prohibitively difficult due to the explosion of possible interleavings
Write SMP code that is portable and equivalent to the sequential form
Use a safe subset of OpenMP
Follow a set of rules for sequential equivalence
144. Strong Sequential Equivalence Rules Control data scope with the base language
Avoid data scope clauses
Only use private for scratch variables local to a block whose global initializations do not matter
Locate all cases where a shared variable can be written by multiple threads
The access to the variable must be protected
If multiple threads combine results into a single value, enforce sequential order
Do not use the reduction clause
145. Conclusion OpenMP is:
A great way to write fast executing code
Allows you to analyze special painful errors
Tools and/or a discipline of writing portable sequentially equivalent programs can help
146. OpenMP - future In the hands of the Architectural Review Board (the ARB)
HP, Intel, Sun, SGI, DOE ASCI
ARB resolves interpretation issues and manages the evolution of new OpenMP APIs
Membership in the ARB is open to any organization with a stake in OpenMP
147. References http://www.openmp.org/
Parallel Programming in OpenMP, Morgan Kaufman Publishers ISBN 1-55860-671-8, 2001
Parallel Programming in C with MPI and OpenMP, McGraw Hill Publishers ISBN 0-07282-256-2, 2004
148. OpenMP p Program C no pragmas static long num_steps = 100000;
double step;
void main ()
{ int i;
double x, pi, sum=0.0;
step = 1.0/(double) num_steps;
for (i=1;i<=num_steps; i++){
x= (i-0.5)*step;
sum = sum + (4.0/(1.0+X*X));
}
pi = step*sum;
}
149. OpenMP p Program C pragmas #include <omp.h>
#define NUM_THREADS 2
static int num_steps = 100000;
double step;
void main ()
{ int i;
double x, pi, sum = 0.0;step = 1.0/(double) num_steps;omp_set_num_threads( NUM_THREADS);#pragma omp parallel for reduction(+:sum) private(x)for (i=0; i< num_steps; i++) { x = (i+0.5)*step; sum = sum + (4.0/(1.0+x*x));}
pi = step * sum;
}
150. MPI - Introduction History
MPI Message-Passing Interface is a library of functions and macros that can be used in Fortran (77, 90, 95), C and C++ programs that exploit the existence of multiple processors by message-passing
Developed in 1993-1994
151. MPI - Advantages of the MP Model Universality
Fits well on separate processors connected by a communication network. Thus it works well on parallel supercomputers and clusters
Expressivity
A useful and complete model in which to express parallel algorithms
Tolerant of imbalances in process speeds found on shared networks
152. MPI - Advantages Ease of Debugging
While debuggers for parallel programs are easier to write for an SMP model, debugging itself is easier in the MP model because only one process has direct access to any memory location
Performance
MPI provides a way for the programmer to explicitly associate specific data with processes and thus allow the compiler and cache-memory to function fully
153. A Perfect Parallel Program Compute the value of pi by numerical integration. Solve f(x) = 4/(1+x)
To do this we divide the interval 0 to 1 into some number n of subintervals and add up the areas
Language bindings to C++, C, Fortran, and example pi programs follow
154. MPI 6 Functions MPI_Comm_rank - determines a process's ID number
MPI_Init - Initializes MPI.
This allows the system to do any setup needed to handle further calls to the MPI library and must be called before any other MPI function.
MPI_Comm_size - find the number of processes
Every active process becomes a member of a communicator named MPI_COMM_WORLD which is an opaque object that provides the environment for message passing among processes and is the default communicator. Processes within a communicator are ordered. The rank of a process has a unique ID (rank) between 0 and p-1. A process may use its rank to determine which portion of a computation and or dataset it is responsible for. A process calls MPI_Comm_rank to determine its rank within a communicator and calls MPI_Comm_size to determine the total number of processes in a communicator.
MPI_Bcast - send information to all
enables a process to broadcast one or more data items of the same type to all other processes in a communicator. After this function has executed, every process has an up-to-date value
MPI_Reduce retrieve information from all
perform one or more reduction operations on values submitted by all the processes in a communicator. It is important to remember that while only a single process (in most cases process 0) gets the global result, every process must call MPI_Reduce.
MPI_Finalize - cleanly end MPI
After a process has completed all of its MPI library calls, it calls MPI_Finalize, allowing the system to free up resources such as memory. With few exceptions, no MPI calls may be made by a process after its call to MPI_FINALIZE
155. MPI Bindings C++ void MPI::Init( int& argc, char**& argv )void MPI::Init( )void MPI::Comm::Get_rank( ) constvoid MPI::Comm::Get_size( ) constvoid MPI::Intracomm::Bcast( cvoid* buf, int* count, const Datatype& datatype, int root) constvoid MPI::Intracomm::Reduce( const void* sendbuf, void* recvbuf, int count, const Datatype& datatype, const Op& op, int root ) constvoid MPI::Finalize( )
156. MPI Bindings C++ C++ bindings added to MPI-2 and therefore might not be everywhere
Approach was to exploit MPIs object-oriented structure to define classes and methods that closely follow the structure of the C and fortran bindings
Differences are as follows:
Call to MPI_Init has become MPI::Init
Invoke the method Get_size on the object MPI::COMM_WORLD and return the size of the MPI::COMM_WORLD communicator as its value
Instead of returning an error code, the behaviour is to throw an exception
The call to Get_rank is similar to the call to Get_size but returns the rank as the value of the function
Bcast and Reduce are methods on MPI::COMM_WORLD
157. MPI p C++ program #include <math.h>
#include "mpi.h
int main(int argc, char *argv[])
int n, myid, nprocs, i;
double PI25DT = 3.141592653589793238462643
double mypi, pi, h, sum, x;
// MPI inititialization
MPI::Init(argc,argv);
size = MPI::COMM_WORLD.Get_size();
rank = MPI::COMM_WORLD.Get_rank();
while (i) {
if (rank == 0) {
cout << "Enter the number of intervals: (0 quits)
<< endl;
cin >> n;
}
// broadcast n to the rest of the processes
MPI::COMM_WORLD.Bcast(&n,1,MPI::INT,0);
if (n == 0)
break;
else {
h = 1.0 / (double) n;
sum = 0.0;
for (i = rank +1; i <= n; i += size) {
x = h * ((double)i - 0.5);
/* function to integrate */
sum += (4.0 / (1.0 + x*x));
}
mypi = h * sum;
// collect all partial sums
MPI::COMM_WORLD.Reduce(&mypi,&pi,1,MPI::DOUBLE,MPI::SUM,0);
// node 0 or master prints the answer
if (rank == 0)
cout << "pi is approximately " << pi
<< ", Error is " << fabs(pi - PI25DT))
<< endl;
}
}
MPI::Finalize();
return 0;
}
158. MPI Bindings - C int MPI_Init( int *argc, char ***argv )int MPI_Comm_size( MPI_Comm comm, int *size )int MPI_Comm_rank( MPI_Comm comm, int *rank )int MPI_Bcast( void *buf, int count, MPI_Datatype datatype, int root, MPI_Comm comm )int MPI_Reduce( void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm comm )int MPI_Finalize( )
159. MPI Bindings - C Primary difference is that error codes are returned as the value of C functions instead of a separate argument
More strongly typed than in fortran
I.e. MPI_Comm and MPI_Datatype where fortran has integers
mpi.h instead of mpi module or mpif.h
MPI_Init is different to take advantage of command-line arguments
160. MPI p C program #include "mpi.h
#include <math.h>
int main( int argc, char *argv[] )
{
int n, myid, nprocs, i;
double PI25DT = 3.141592653589793238462643;
double mypi, pi, h, sum, x;
/* MPI inititialization */
MPI_Init(&argc,argv);
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
while (i) {
if (myid == 0) {
printf("enter the number of intervals: (0 quits) ");
scanf("%d",&n);
}
/* broadcast n to the rest of the processes */
MPI_Bcast(&n,1,MPI_INT,0,MPI_COMM_WORLD); if (n == 0)
break;
else {
h = 1.0 / (double) n;
sum = 0.0;
for (i = myid +1; i <= n; i += numprocs) {
x = h * ((double)i - 0.5);
/* function to integrate */
sum += (4.0 / (1.0 + x*x));
}
mypi = h * sum;
/* collect all partial sums */
MPI_Reduce(&mypi,&pi,1,MPI_DOUBLE,MPI_SUM,0,MPI_COMM_WORLD);
/* node 0 or master prints the answer */
if (myid == 0)
printf("pi is approximately %.16f, Error is %.16f\n",
pi, fabs(pi - PI25DT));
}
}
MPI_Finalize();
return 0;
}
161. MPI Bindings - Fortran MPI_INIT( ierror ) integer ierrorMPI_COMM_SIZE( comm, size, ierror ) integer comm, size, ierrorMPI_COMM_RANK( comm, rank, ierror ) integer comm, rank, ierrorMPI_BCAST( buffer, count, datatype, root, comm, ierror ) <type> buffer(*) integer count, datatype, root, comm, ierrorMPI_REDUCE( sendbuf, recvbuf, count, datatype, op, root, comm, ierror ) <type> sendbuf(*), recvbuf(*) integer count, datatype, root, comm, ierrorMPI_FINALIZE( ierror ) integer ierror
162. MPI p Fortran program program main use mpiC Use the following include if the mpi module is not availableC include "mpif.h" double precision PI25DT parameter (PI25DT = 3.141592653589793238462643d0) double precision mypi, pi, h, sum, x, f, a integer n, myid, numprocs, i, ierrCfunction to integrate f(a) = 4.d0 / (1.d0 + a*a) call MPI_INIT(ierr) call MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr) call MPI_COMM_SIZE(MPI_COMM_WORLD, numprocs, ierr)
10 if ( myid .eq. 0) thenprint *, 'Enter the number of intervals: (0 quits) 'read(*,*) nendifC broadcast n to the rest of the processes call MPI_BCAST(n, 1,MPI_INTEGER, 0,MPI_COMM_WORLD, ierr)C check for quit signal C check for quit signal
if ( n .le. 0 ) goto 30
C calculate the interval size
h = 1.0d0/n
sum = 0.0d0
do 20 i = myid+1, n, numproc
x = h * (dble(i) - 0.5d0)
sum = sum + f(x)
20 Continue
mypi = h * sum
C collect all partial sums
CALL MPI_REDUCE(myid,pi,1,MPI_DOUBLE_PRECICION,MPI_SUM,0, & MPI_COMM_WORLD,ierr)
C node 0 or master prints the answer
if ( myid .eq. 0) then
print *, 'pi is ', pi, 'Error is', abs(pi - PI25DT)
endif
goto 10
30 call MPI_FINALIZE(ierr)
stop
end
163. MPI - Timing double precision MPI_wtime()
double precision MPI_wtick()
For parallel programs getting the correct answer is not enough; one wishes to decrease the execution time
164. MPI - Common Errors and Misunderstandings Forgetting ierr in Fortran
Most common error in fortran written MPI programs is to forget the last argument, where the error code is returned
Fortran 90 will catch this but other compilers will not leading to hard-to-find errors
165. MPI - Common Errors Misdeclaring status in Fortran
When an MPI_Recv returns, certain information has been stored in the status argument
Status is an array of integers and not a single integer
Many compilers will not complain about this but running this program will lead to unexpected memory overwrites and mysterious behaviour
166. MPI - Common Errors Misdeclaring string variables in Fortran
Strings are not the same as arrays of characters
A 12 character string a in fortran should be declared as :
character*12 a
and definitely not
character a(12)
The latter is an array of characters, not a single character string variable
167. MPI - Common Errors Expecting argc and argv to be passed to all processes
Arguments to MPI_Init in C are &argc and &argv which allows the MPI implementation to fill these in on all processes but the MPI Standard does not require it
MPI does not assume it is running under Unix MPI also runs on Windows and Macs
168. MPI - Common Errors Doing things before MPI_Init and after MPI_Finalize
The MPI Standard says nothing about the situation before MPI_Init and after MPI_Finalize, not even how many processes are running
Doing anything before or after may yield unexpected and implementation-dependent results
169. MPI - Common Errors Matching MPI_Bcast with MPI_Recv
It may be thought that MPI_Bcast is a multiple send and that MPI_Recv should receive it
Not so MPI_Bcast is a collective operation which must be called on all processes in the group
It functions as a multi-send on the specified root process and as a receive on the others
This allows for optimal performance
MPI_Recv does not have to check whether the message it received is part of a broadcast
170. MPI - Common Errors Assuming the MPI implementation is thread-safe
Thread-safe is when multiple threads could call MPI routines at the same time
The Standard does not require it of implementations
Unpredictable results may occur if the compiler generates multithreaded code and the MPI implementation is not thread-safe
171. MPI - Resources MPI Home page
http://www.mpi-forum.org
Examples of MPI programs
http://www.mcs.anl.gov/mpi/usingmpi2
MPI Implementations
http://www.mcs.anl.gov/mpi/mpich
MPI Standard
http://www.mpi-forum.org
Discussion on MPI
http://www.erc.msstate.edu/mpi/mpi-faq.html
172. Profilers Optimizing a small piece of code can be done easily by inserting calls to a timer and if you know that that code will enhance the performance of your program. It is not the case with large applications. That is where a profiler comes in handy.
The job of the profiler is to aid in the discovery of bottlenecks and where you can increase performance.
They do this by inserting calls into applications that generate information and timings about subroutines, functions, and loops.
The performance of the program will suffer because the of the extra data collection that is carried out by the processor
When using profilers:
Check for correct answers
Profile to find the most time consuming routines
Optimize these routines with compiler options, pragmas and source code modification
Repeat the first three steps ( but remember the speed of light)
173. Profilers Four types of profilers
Sampling based
Use of predefined clock with every tick a new sample is taken of what the code is doing. This type of profiler might miss an event happening between ticks
Event based
Every event is monitored (i.e. entry and exit from a subroutine)
Trace based
Compiler keeps all information it collects
Reductionist based
Only statistical information is kept (i.e. average time in a routine)
The most common Unix profilers are prof (reductionist) and gprof (event). They require code to be relinked and executed so that an output file with timing results can be created
174. Libraries If you are going to build a house you do not make the tools, nails, screws, wood, etc., yourself
Likewise for programming (serial or parallel). You do not recreate a cosine function every time you need one. There are many efficient mathematical libraries out there and at this point I would like to introduce some of the commonly used ones to you
175. Libraries - BLAS One of the stalwarts in mathematical libraries first started in 1979 by C. L. Lawson et al and continued to 1990 with Dongarra et al
BLAS stands for Basic Linear Algebra Subprograms and they are a collection of routines for basic vector and matrix operations. They are divided into three groups:
Level 1 BLAS vector operations
Level 2 BLAS matrix-vector operations
Level 3 BLAS matrix-matrix operations
They are strong and well-accepted subroutines used in F77, F90, F95, C, and C++ programs
http://www.netlib.org/blas/blast-forum/
176. Libraries - EISPACK EISPACK (Eigenvalue Software PACKage) is a collection of routines that compute eigenvalues and eigenvectors for several types of matrices. Designed in the late 1970s
177. Libraries - LAPACK LAPACK (Linear Algebra PACKage) is a successful public domain library used extensively.
Memory access patterns are more efficient
LAPACK subroutines use Level 2 and Level 3 BLAS routines wherever possible. So when vendors provide highly tuned BLAS routines LAPACK benefits
Users should use LAPACK rather than EISPACK or LINPACK as it is more efficient and is superior in performance
Comes with extensive documentation and tunable routines for your system.
178. Libraries - LINPACK LINPACK (LINear algebra PACKage) is a collection of Fortran subroutines for solving linear least-squares problems and linear equations
Designed in the late 1970s
Based on legacy BLAS Level 1 routines
Solves linear systems whose matrices are banded, dense, symmetric, symmetric positive definite, and triangular
179. Libraries - PLAPACK PLAPACK (Parallel Linear Algebra PACKage) developed in 1997 by R. A. van de Geijn and P. Alpatov is an infrastructure for coding linear algebra algorithms
Allows for high levels of performance to Cholesky, LU, and QR factorization solvers
More information can be found at http://www.cs.utexas.edu/users/plapack/
180. Libraries - ScaLAPACK ScaLAPACK (Scalable Linear Algebra PACKage) is a parallelized subset of the LAPACK library
Built primarily for distributed parallel computers and uses its own communication package BLACS (Basic Linear Algebra Communication Subprograms) which can be built with MPI or PVM
ScaLAPACK, LAPACK, and BLAS can be obtained from http://netlib.org/
181. Libraries - Commercial Most vendors provide their own HPC libraries for mathematical operations. They tend to perform better than the public domain versions and are fully supported by the vendor
HPs MLIB (Mathematical software LIBrary)
IBMs ESSL (Engineering and Scientific Subroutine Library)
SGIs SCSL (Scientific Computing Software Library)
Suns PL (Performance Library)
There are also numerous independent vendors that provide libraries:
NAG (Numerical Algorithms Group) http://www.nag.co.uk/
IMSL ( International Mathematical and Statistical Library) http://www.vni.com/
The IMSL Fortran Library is the only commercially available library that has 100% thread safety
It is available for Fortran, C, C++, and Java
HSL (Harwell Scientific Library) http://www.cse.clrc.ac.uk/nag/hsl/
Can be obtained without charge for academic purposes
182. Debugging Programming is an error-prone activity
Most programmers (unix) will use dbx, and some sort of print statement (printf, print) to catch and isolate bugs
Debugging parallel programs is much harder since:
More things can go wrong
Multiple processes are performing multiple computations and interact with other processes using some sort of message passing
Programmers do not have access to good tools. However, you can purchase TotalView, Vampir, etc.
183. Debugging General Hints Make sure your single processor version runs correctly and is efficient. You can take advantage of serial debuggers such as dbx or gdb to test values, set breakpoints, etc.
Work with the smallest number of processors you can that can test the programs functionality usually two or three
Use a smaller problem size. Typically a 4 x 4 array can functionally test a program that requires a 1024 x 1024 matrix. Using a smaller problem size allows you to put in additional print statements to look at data structures. You will have a lot less output to consider
184. Debugging General Hints Put a flush statement (i.e. fflush(stdout)) after every printf statement to collect all output from every process before the program deadlocks
For point-to-point messages, to make sure what the values are, print the data that you sending and receiving
For a serial program, messages received will be in a chronological order. This is not the case for parallel programming. You cannot assume that if a message from process A appears before a message from process B that As message was printed before Bs. So in MPI prefix each message with a process rank then sort the output of the program (which will organize output by process)
185. Debugging General Hints Debug the initialization phase of the program first to make sure that you have set up the data structures correctly
Do not try and optimize the program. First get the logic right. Then worry about performance enhancing steps
Reuse code or use library functions wherever possible
186. Performance Analysis Accurately predicting the performance of your parallel algorithm will should help you decide whether it is worth coding and debugging
Explanation and a closer look at:
Amdahls Law which relies upon the evaluation of sequential program to predict an upper limit to the speedup using multiple processors
Gustafson-Barsiss Law which gives an estimate of scaled speedup
Karp-Flatt metric which examines the speedup achieved by a parallel program solving a problem of fixed size
Isoefficiency metric which is used to determine the scalability of a parallel system
187. Performance analysis Amdahls Law Amdahls Law S(p) = p/(1+(p-1)f) = 1/(f+(1-f)/p) where:
f - fraction of the computation that cannot be divided into concurrent tasks, 0 = f = 1 and
p - the number of processors
So if we have 20 processors and a serial portion of 5% we will get a speedup of 20/(1+(20-1).05) =10.26
Amdahls Law ignores overhead
188. Performance Analysis Gustafson-Barsiss Law Given a parallel program solving a problem of size n using p processors, let s denote the fraction of total execution time spent in serial code. The maximum speedup
? = p + ( 1 p )s
Example: suppose I told my boss that if I buy a supercomputer with 3000 processors we can get a speedup of at least 2,700 times on the bosses favorite program. To do so how much time can be spent on the serial portion of the code in order to meet this goal?
2700 = 3000 2999s
s = 300/2999
= .1 = 10%
189. Performance Analysis Karp-Flatt Metric Since both the previous 2 laws ignore the parallel overhead term (time for interprocessor communication), they can overestimate speedup or scaled speedup. Therefore we have the Karp-Flatt metric which states: given a parallel computation exhibiting a speedup ? on p processors where p=1, the experimentally determined serial fraction e is defined as
e = (1/? 1/p)/(1 1/p)
190. Performance Analysis Karp-Flatt Metric Example : suppose we have a 4 processor machine experiencing the following speedup
p 1 2 3 4
? 1.00 1.82 2.50 3.08
Why do we only have a speedup of 3.08 on the 4 processors?
Well if we compute e for the above we will notice that
e .10 .10 .10 .10
which shows that the serial fraction is not increasing with the number of processors and is the result of the program being inherently sequential
191. Performance Analysis Isoefficiency metric Efficiency of a parallel system can be represented as
e(n, p) where n is the problem size and p the number of processors and let
C = e(n, p) / (1 - e(n, p) ))
T(n, 1) be the sequential execution time and
T0(n, 1) be the total amount of overhead time
Then in order to maintain the same level of efficiency as the number of processors increase, the problem size must increase to satisfy the following inequality:
T(n, 1) = CT0(n, p)
192. Performance Analysis Isoefficiency metric Example
We have a sequential reduction algorithm whose computational complexity is T(n) with the reduction step having a time complexity of T(log p) and every processor is used so T0(n, p) = T(p log p) and hence the isoefficiency metric for this algorithm would be n= C p log p
To check.
If we are sum-reducing n values over p processors then each processor adds n/p values and is included in a reduction that has ?log p? steps. Doubling n and p would still keep each processors share the same but the number of steps needed to perform the reduction increases slightly to ?log (2p)? (the efficiency has dropped a bit). So in order to maintain the same efficiency, every time p is doubled n will have to be more than doubled