HPC Parallel Programming: From Concept to Compile

1. HPC Parallel Programming:From Concept to Compile A One day Introductory Workshop October 12th, 2004

2. Schedule Concept (a frame of mind) Compile (application)

3. Introduction Programming parallel computers Compiler extension Sequential programming language extension Parallel programming layer New parallel languages

4. Concept Parallel Algorithm Design Programming paradigms Parallel Random Access Machine � the PRAM model result, specialist, agenda - RSA model Task / Channel - the PCAM model Bulk Synchronous Parallel � the BSP model Pattern Language

5. Compile Serial Introduction to OpenMP Introduction to MPI Profilers Libraries Debugging Performance Analysis Formulas

6. By the end of this workshop you will be exposed to: Different parallel programming models and paradigms Serial �I�m not bad, I was built this way� programming and how you can optimize it OpenMP and MPI libraries debugging

7. Eu est velociter perfectus Well Done is Quickly Done Caesar Augustus

8. Introduction What is Parallel Computing? It is the ability to program in a language that allows you to explicitly indicate how different portions of the computation may be executed concurrently by different processors

9. Why do it? The need for speed How much speedup can be determined by: Amdahl�s Law S(p) = p/(1+(p-1)f) where: f - fraction of the computation that cannot be divided into concurrent tasks, 0 = f = 1 and p - the number of processors So if we have 20 processors and a serial portion of 5% we will get a speedup of 20/(1+(20-1).05) =10.26 Also Gustafson-Barsis�s Law which takes into account scalability, and Karp-Flatt Metric which takes into account the parallel overhead, and Isoefficiency Relation which is used to determine the range of processors for which a particular level of efficiency can be determined. Parallel overhead increases as the number of processors increase, so to maintain efficiency increase the problem size

10. Why Do Parallel Computing � some other reasons Time: Reduce the turnaround time of applications Performance: Parallel computing is the only way to extend performance toward the TFLOP realm Cost/Performance: Traditional vector computers become too expensive as one pushes the performance barrier Memory: Applications often require memory that goes beyond that addressable by a single processor Whole classes of important algorithms are ideal for parallel execution. Most algorithms can benefit from parallel processing such as Laplace equation, Monte Carlo, FFT (signal processing), image processing Life itself is a set of concurrent processes Scientists use modeling so why not model systems in a way closer to nature

11. Many complex scientific problems require large computing resources. Problems such as: Quantum chemistry, statistical mechanics, and relativistic physics Cosmology and astrophysics Computational fluid dynamics and turbulence Biology, genome sequencing, genetic engineering Medicine Global weather and environmental modeling One such place is http://www-fp.mcs.anl.gov/grand-challenges/

12. Programming Parallel Computers In 1988 four distinct paths for application software development on parallel computers were identified by McGraw and Axelrod: Extend an existing compiler to translate sequential programs into parallel programs Extend an existing language with new operations that allow users to express parallelism Add a new language layer on top of an existing sequential language Define a totally new parallel language

13. Compiler extension Design parallelizing compilers that exploit parallelism in existing programs written in a sequential language Advantages: billions of dollars and thousands of years of programmer effort have already gone into legacy programs. Automatic parallelization can save money and labour. It has been an active area of research for over twenty years Companies such as Parallel Software Products http://www.parallelsp.com/ offer compilers that translate F77 code into parallel programs for MPI and OpenMP Disadvantages: Pits programmer and compiler in game of hide and seek. The programmer hides parallelism in DO loops and control structures and the compiler might irretrievably lose some parallelism

14. Sequential Programming Language Design Extend a sequential language with functions that allow programmers to create, terminate, synchronize and communicate with parallel processes Advantages: Easiest, quickest, and least expensive since it only requires the development of a subroutine library Libraries meeting the MPI standard exist for almost every parallel computer Gives programmers flexibility with respect to program development Disadvantages: Compiler is not involved in generation of parallel code therefore it cannot flag errors It is very easy to write parallel programs that are difficult to debug

15. Parallel Programming layers Think of a parallel program consisting of 2 layers. The bottom layer contains the core of the computation which manipulates its portion of data to gets its result. The upper layer controls creation and synchronization of processes. A compiler would then translate these two levels into code for execution on parallel machines Advantages: Allows users to depict parallel programs as directed graphs with nodes depicting sequential procedures and arcs representing data dependences among procedures Disadvantages: Requires programmer to learn and use a new parallel programming system

16. New Parallel Languages Develop a parallel language from scratch. Let the programmer express parallel operations explicitly. The programming language Occam is one such famous example http://wotug.ukc.ac.uk/parallel/occam/ Advantages: Explicit parallelism means programmer and compiler are now allies instead of adversaries Disadvantages: Requires development of new compilers. It typically takes years for vendors to develop high-quality compilers for their parallel architectures Some parallel languages such as C* were not adapted as standard compromising severely portable code User resistance. Who wants to learn another language

17. The most popular approach continues to be augmenting existing sequential languages with low-level constructs expressed by function calls or compiler directives Advantages: Can exhibit high efficiency Portable to a wide range of parallel systems C, C++, F90 with MPI or OpenMP are such examples Disadvantages: More difficult to code and debug

18. Concept An algorithm (from OED) is a set of rules or process, usually one expressed in algebraic notation now used in computing A parallel algorithm is one in which the rules or process are concurrent There is no simple recipe for designing parallel algorithms. However, it can benefit from a methodological approach. It allows the programmer to focus on machine-independent issues such as concurrency early in the design process and machine-specific aspects later You will be introduced to such approaches and models and hopefully gain some insight into the design process Examining these models is a good way to start thinking in parallel

19. Parallel Programming Paradigms Parallel applications can be classified into well defined programming paradigms Each paradigm is a class of algorithms that have the same control structure Experience suggests that there are a relatively few paradigms underlying most parallel programs The choice of paradigm is determined by the computing resources which can define the level of granularity and type of parallelism inherent in the program which reflects the structure of either the data or application A paradigm is a pattern � a case or instance regarded as representative or typicalA paradigm is a pattern � a case or instance regarded as representative or typical

20. Parallel Programming Paradigms The most systematic definition of paradigms comes from a technical report from the University of Basel in 1993 entitled BACS: Basel Algorithm Classification Scheme A generic tuple of factors which characterize a parallel algorithm: Process properties (structure, topology, execution) Interaction properties Data properties (partitioning, placement) The following paradigms were described: Task-Farming (or Master/Slave) Single Program Multiple Data (SPMD) Data Pipelining Divide and Conquer Speculative Parallelism

21. PPP � Task-Farming Task-farming consists of two entities: Master which decomposes the problem into small tasks and distributes/farms them to the slave processes. It also gathers the partial results and produces the final computational result Slave which gets a message with a task, executes the task and sends the result back to the master It can use either static load balancing (distribution of tasks is all performed at the beginning of the computation) or dynamic load-balancing (when the number of tasks exceeds the number of processors or is unknown, or when execution times are not predictable, or when dealing with unbalanced problems). This paradigm responds quite well to the loss of processors and can be scaled by extending the single master to a set of masters

22. PPP � Single Program Multiple data (SPMD) SPMD is the most commonly used paradigm Each process executes the same piece of code but on a different part of the data which involves the splitting of the application data among the available processors. This is also referred to as geometric parallelism, domain decomposition, or data parallelism Applications can be very efficient if the data is well distributed by the processes on a homogeneous system. If different workloads are evident then some sort of load balancing scheme is necessary during run-time execution Highly sensitive to loss of a process. Unusually results in a deadlock until global synchronization point is reached

23. PPP � Data Pipelining Data pipelining is fine-grained parallelism and is based on a functional decomposition approach The tasks (capable of concurrent operation) are identified and each processor executes a small part of the total algorithm One of the simplest and most popular functional decomposition paradigms and can also be referred to as data-flow parallelism. Communication between stages of the pipeline can be asynchronous since the efficiency is directly dependent on the ability to balance the load across the stages Often used in data reduction and image processing

24. PPP � Divide and Conquer The divide and conquer approach is well known in sequential algorithm development in which a problem is divided into two or more subproblems. Each subproblem is solved independently and the results combined In parallel divide and conquer, the subproblems can be solved at the same time Three generic computational operations � split, compute, and join (sort of like a virtual tree where the tasks are computed at the leaf nodes)

25. PPP � Speculative Parallelism Employed when it is difficult to obtain parallelism through any one of the previous paradigms Deals with complex data dependencies which can be broken down into smaller parts using some speculation or heuristic to facilitate the parallelism

26. PRAM � Parallel Random Access Machine Descendent of RAM (Random Access Machine) A theoretical model of parallel computation in which an arbitrary but finite number of processors can access any value in an arbitrarily large shared memory in a single time step Introduced in the 1970�s it still remains popular since it is theoretically tractable and gives algorithm designers a common target. The Prentice Hall book from 1989 entitled the Design and Analysis of Parallel algorithms, gives a good introduction to the design of algorithms using this model

27. PRAM cont� The three most important variations on this model are: EREW (exclusive read exclusive write) where any memory location may be access only once in any one step CREW (concurrent read exclusive write) where any memory location may be read any number of times during a single step but written to only once after the reads have finished CRCW (concurrent read concurrent write) where any memory location may be written to or read from any number of times during a single step. Some rule or priority must be given to resolve multiple writes

28. PRAM cont� This model has problems PRAM�s cannot be emulated optimally on all architectures Problem lies in the assumption that every processor can access the memory simultaneously in a single step. Even in hypercubes messages must take several hops between source and destination and it grows logarithmically with the machines size. As a result any buildable computer will experience a logarithmic slowdown relative to the PRAM model as its size increases One solution is to take advantage of cases in which there is greater parallelism in the process than in the hardware it is running on, enabling each physical processor to emulate many virtual processors. An example of such is as follows:

29. PRAM cont� Example Process A sends request Process B runs while request travels to memory Process C runs while memory services request Process D runs while reply returns to processor Process A is re-scheduled The efficiency with which physically resizable architectures could emulate the PRAM is dictated by the theorem: If each of P processors sends a single message to a randomly-selected partner, it is highly probable that at least on processor will receive O(P/log log P) messages, and some others will receive none; but If each processor sends log P messages to randomly-selected partners, there is a high probability that no processor will receive more than 3 log P messages. So if problem size is increased at least logarithmically faster than machine size, efficiency can be held constant. The problem is that it holds constant for hypercubes in which the communication links grow with the number of processors. Several ways around the above limitation has been suggested. Such as:

30. PRAM cont� XPRAM where computations are broken up into steps such that no processor may communicate more that a certain number of times per single time step. Programs which fit this model can be emulated efficiently Problem is that it is difficult to design algorithms in which the frequency of communication decreases as the problem size increases Bird-Meertens formalism where the allowed set of communications would be restricted to those which can be emulated efficiently The scan-vector model proposed by Blelloch accounts for the relative distance of different portions of memory Another option was proposed by Ramade in 1991 which uses a butterfly network in which each node is a processor/memory pair. Routing messages in this model is complicated but the end result is an optimal PRAM emulation

31. Result, Agenda, Specialist Model The RAS model was proposed by Nicholas Carriero and David Gelernter in their book How to Write Parallel Programs in 1990 To write a parallel program: Choose a pattern that is most natural to the problem Write a program using the method that is most natural for that pattern, and If the resulting program is not efficient, then transform it methodically into a more efficient version

32. RAS Sounds simple. We can envision parallelism in terms of: Result - focuses on the shape of the finished product Plan a parallel application around a data structure yielded as the final result. We get parallelism by computing all elements of the result simultaneously Agenda - focuses on the list of tasks to be performed Plan a parallel application around a particular agenda of tasks, and then assign many processes to execute the tasks Specialist - focuses on the make-up of the work Plan an application around an ensemble of specialists connected into a logical network of some kind. Parallelism results in all nodes being active simultaneously � much like pipe-lining

33. RAS - Result In most cases the easiest way to think of a parallel program is to think of the resulting data structure. It is a good starting point for any problem whose goal is to produce a series of values with predictable organization and interdependencies Such a program reads as follows: Build a data structure Determine the value of all elements of the structure simultaneously Terminate when all values are known If all values are independent then all computations start in parallel. However, if some elements cannot be computed until certain other values are known, then those tasks are blocked As a simple example consider adding two n-element vectors (i.e. add the ith elements of both and store the sum in another vector)

34. RAS - Agenda Agenda parallelism adapts well to many different problems The most flexible is the master-worker paradigm in which a master process initializes the computation and creates a collection of identical worker processes Each worker process is capable of performing any step in the computation Workers seek a task to perform and then repeat When no tasks remain, the program is finished An example would be to find the lowest ratio of salary to dependents in a database. The master fills a bag with records and each worker draws from the bag, computes the ratio, sends the results back to the master. The master keeps track of the minimum and when tasks are complete reports the answer

35. RAS - Specialist Specialist parallelism involves programs that are conceived in terms of a logical network. Best understood in which each node executes an autonomous computation and inter-node communication follows predictable paths An example could be a circuit simulation where each element is realized by a separate process

36. RAS - Example Consider a na�ve n-body simulator where on each iteration of the simulation we calculate the prevailing forces between each body and all the rest, and update each body�s position accordingly With the result parallelism approach it is easy to restate the problem description as follows: Suppose n bodies, q iterations of the simulation, compute matrix M such that M [i, j] is the position of the ith body after the jth iteration Define each entry in terms of other entries i.e. write a function to compute position (i, j)

37. RAS - Example With the agenda parallelism model we can repeatedly apply the transformation �compute next position� to all bodies in the set So the steps involved would be to Create a master process and have it generate n initial task descriptors ( one for each body ) On the first iteration, each process repeatedly grabs a task descriptor and computes the next position of the corresponding body, until all task descriptors are used The master can the store information about each body�s position at the last iteration in a distributed table structure where each process can refer to it directly

38. RAS - Example Finally, with the specialist parallelism approach we create a series of processes, each one specializing in a single body (i.e. each responsible for computing a single body�s current position throughout the entire simulation) At the start of each iteration, each process sends data to and receives data from each other process The data included in the incoming message group of messages is sufficient to allow each process to compute a new position for its body � then repeat

39. Task Channel model THERE IS NO SIMPLE RECIPE FOR DESIGNING PARALLEL ALGORITHMS However, with suggestions by Ian Foster and his book Designing and Building Parallel Programs there is a methodology we can use The task/channel method is one most often sited as a practical means to organize the design process It represents a parallel computation as a set of tasks that may interact with each other by sending messages through channels It can be viewed as a directed graph where vertices represent tasks and directed edges represent channels A thorough examination of this design process will conclude with a practical example

40. A task is a program, its local memory, and a collection of I/O ports The local memory contains the program�s instructions and its private data It can send local data values to other tasks via output ports It also receives data values from other tasks via input ports A channel is a message queue that connects one task�s output port with another task�s input port Data values appear at the input port in the same order as they were placed in the output port of the channel Tasks cannot receive data until it is sent (i.e. receiving is blocked) Sending is never blocked

41. The four stages of Foster�s Design process are: Partitioning � the process of dividing the computation and data into pieces Communication � the pattern of send and receives between tasks Agglomeration � process of grouping tasks into larger tasks to simplify programming or improve performance Mapping � the processes of assigning tasks to processors Commonly referred to as PCAM

42. Partitioning Discover as much parallelism as possible. To this end strive to split the computation and data into smaller pieces There are two approaches: Domain decomposition Functional decomposition

43. PCAM � partitioning � domain decomposition Domain decomposition is where you first divide the data into pieces and then determine how to associate computations with the data Typically focus on the largest or most frequently accessed data structure in the program Consider a 3D matrix. It can be partitioned as: Collection of 2D slices, resulting in a 1D collection of tasks Collection of 1D slices, resulting in a 2D collection of tasks Each matrix element separately resulting in a 3D collection of tasks At this point in the design process it is usually best to maximize the number of tasks hence 3D partitioning is best

44. PCAM � partitioning � functional decomposition Functional decomposition is complimentary to domain decomposition in which the computation is first divided into pieces and then the data items are associated with each computation. This is often know as pipelining which yield a collection of concurrent tasks Consider brain surgery before surgery begins a set of CT images are input to form a 3D model of the brain The system tracks the position of the instruments converting physical coordinates into image coordinates and displaying them on a monitor. While one task is converting physical coordinates to image coordinates, another is displaying the previous image, and yet another is tracking the instrument for the next image. (Anyone remember the movie � The Fantastic Voyage?)

45. PCAM � Partitioning - Checklist Regardless of decomposition we must maximize the number of primitive tasks since it is the upper bound on the parallelism we can exploit. Foster has presented us a checklist to evaluate the quality of the partitioning: There are at least an order of magnitude more tasks than processors on the target parallel machine. If not, there may be little flexibility in later design options Avoid redundant computation and storage requirements since the design may not work well when the size of the problem increases Tasks are of comparable size. If not, it may be hard to allocate each processor equal amounts of work The number of tasks scale with problem size. If not, it may be difficult to solve larger problems when more processors are available Investigate alternative partitioning to maximize flexibility later

46. PCAM-Communication After the tasks have been identified it is necessary to understand the communication patterns between them Communications are considered part of the overhead of a parallel algorithm, since the sequential algorithm does not need to do this. Minimizing this overhead is an important goal Two such patterns local and global are more commonly used than others (structured/unstructured, static/dynamic, synchronous/asynchronous) Local communication exists when a task need values from a small number of other tasks (its �neighbours�) in order to form a computation Global communication exits when a large number of tasks must supply data in order to form a computation (e.g. performing a parallel reduction operation � computing the sum of values over N tasks)

47. PCAM � Communication - checklist These are guidelines and not hard and fast rules Are the communication operations balanced between tasks? Unbalanced communication requirements suggest a non-scalable construct Each task communicates only with a small number of neighbours Tasks are able to communicate concurrently. If not the algorithm is likely to be inefficient and non-scalable. Tasks are able to perform their computations concurrently

48. PCAM - Agglomeration The first two steps of the design process was focused on identifying as much parallelism as possible At this point the algorithm would probably not execute efficiently on any particular parallel computer. For example, if there are many magnitudes more tasks than processors it can lead to a significant overhead in communication In the next two stages of the design we consider combining tasks into larger tasks and then mapping them onto physical processors to reduce parallel overhead

49. PCAM - Agglomeration Agglomeration (according to OED) is the process of collecting in a mass. In this case we try group tasks into larger tasks to facilitate improvement in performance or to simplify programming. There are three main goals to agglomeration: Lower communication overhead Maintain the scalability of the parallel design, and Reduce software engineering costs

50. PCAM - Agglomeration How can we lower communication overhead? By agglomerating tasks that communicate with each other, communication is completely eliminated, since data values controlled by the tasks are in the memory of the consolidated task. This process is known as increasing the locality of the parallel algorithm Another way is to combine groups of transmitting and receiving tasks thereby reducing the number of messages sent. Sending fewer, longer messages takes less time than sending more, shorter messages since there is an associated startup cost (message latency) inherent with every message sent which is independent of the length of the message.

51. PCAM - Agglomeration How can we maintain the scalability of the parallel design? Ensure that you do not combine too many tasks since porting to a machine with more processors may be difficult. For example part of your parallel program is to manipulate a 3D array 16 X 128 X 256 and the machine has 8 processors. By agglomerating the 2nd and 3rd dimensions each task would be responsible for a submatrix of 2 X 128 X 256. We can even port this to a machine that has 16 processors. However porting this to a machine with more processors might result in large changes to the parallel code. Therefore agglomerating the 2nd and 3rd dimension might not be a good idea. What about a machine with 50, 64, or 128 processors?

52. PCAM - Agglomeration How can we reduce software engineering costs? By parallelizing a sequential program we can reduce time and expense of developing a similar parallel program. Remember Parallel Software Products

53. PCAM � Agglomeration - checklist Some of these points in this checklist emphasize quantitative performance analysis which becomes more important as we move from the abstract to the concrete Has the agglomeration increased the locality of the parallel algorithm? Do replicated computations take less time than the communications they replace? Is the amount of replicated data small enough to allow the algorithm to scale? Do agglomerated tasks have similar computational and communication costs? Is the number of tasks an increasing function of the problem size? Are the number of tasks as small as possible and yet as large as the number of processors on your parallel computer? Is the trade-off between your chosen agglomeration and the cost of modifications to existing sequential code reasonable?

54. PCAM - Mapping In this 4th and final stage we specify where each task is to execute The goals of mapping are to maximize processor utilization and minimize interprocessor communications. Often these are conflicting goals Processor utilization is maximized when the commutation is balanced evenly. Conversely, it drops when one or processors are idle Interprocessor communication increases when two tasks connected by a channel are mapped to different processors. Conversely, it decreases when the two tasks connected by the channel are mapped to the same processor Mapping every task to the same processors cut communications to zero but utilization is reduced to 1/processors. The point is to choose a mapping that represents a reasonable balance between conflicting goals. The mapping problem has a name and it is �

55. PCAM - Mapping The mapping problem is known to be NP-hard, meaning that no computationally tractable (polynomial-time) algorithm exists for evaluating these trade-offs in the general case. Hence we must rely on heuristics that can do a reasonably good job of mapping Some strategies for decomposition of a problem are: Perfectly parallel Domain Control Object-oriented Hybrid/layered (multiple uses of the above)

56. PCAM � Mapping � decomposition - perfect Perfectly parallel Applications that require little or no inter-processor communication when running in parallel Easiest type of problem to decompose Results in nearly perfect speed-up The pi example is almost perfectly parallel The only communication occurs at the beginning of the problem when the number of divisions needs to be broadcast and at the end where the partial sums need to be added together The calculation of the area of each slice proceeds independently This would be true even if the area calculation were replaced by something more complex

57. PCAM � mapping � decomposition - domain Domain decomposition In simulation and modelling this is the most common solution The solution space (which often corresponds to the real space) is divided up among the processors. Each processor solves its own little piece Finite-difference methods and finite-element methods lend themselves well to this approach The method of solution often leads naturally to a set of simultaneous equations that can be solved by parallel matrix solvers Sometimes the solution involves some kind of transformation of variables (i.e. Fourier Transform). Here the domain is some kind of phase space. The solution and the various transformations involved can be parallelized

58. PCAM � mapping � decomposition - domain Solution of a PDE (Laplace�s Equation) A finite-difference approximation Domain is divided into discrete finite differences Solution is approximated throughout In this case, an iterative approach can be used to obtain a steady-state solution Only nearest neighbour cells are considered in forming the finite difference Gravitational N-body, structural mechanics, weather and climate models are other examples

59. PCAM � mapping � decomposition - control Control decomposition If you cannot find a good domain to decompose, your problem might lend itself to control decomposition Good for: Unpredictable workloads Problems with no convenient static structures One set of control decomposition is functional decomposition Problem is viewed as a set of operations. It is among operations where parallelization is done Many examples in industrial engineering ( i.e. modelling an assembly line, a chemical plant, etc.) Many examples in data processing where a series of operations is performed on a continuous stream of data

60. PCAM � mapping � decomposition - control Examples Image processing: given a series of raw images, perform a series of transformation that yield a final enhanced image. Solve this in a functional decomposition (each process represents a different function in the problem) using data pipelining Game playing: games feature an irregular search space. One possible move may lead to a rich set of possible subsequent moves to search.

61. PCAM � mapping � decomposition - OO Object-oriented decomposition is really a combination of functional and domain decomposition Rather than thinking about a dividing data or functionality, we look at the objects in the problem The object can be decomposed as a set of data structures plus the procedures that act on those data structures The goal of object-oriented parallel programming is distributed objects Although conceptually clear, in practice it can be difficult to achieve good load balancing among the objects without a great deal of fine tuning Works best for fine-grained problems and in environments where having functionally ready at-the-call is more important than worrying about under-worked processors (i.e. battlefield simulation) Message passing is still explicit (no standard C++ compiler automatically parallelizes over objects).

62. PCAM � mapping � decomposition - OO Example: the client-server model The server is an object that has data associated with it (i.e. a database) and a set of procedures that it performs (i.e. searches for requested data within the database) The client is an object that has data associated with it (i.e. a subset of data that it has requested from the database) and a set of procedures it performs (i.e. some application that massages the data). The server and client can run concurrently on different processors: an object-oriented decomposition of a parallel application In the real-world, this can be large scale when many clients (workstations running applications) access a large central data base � kind of like a distributed supercomputer

63. PCAM � mapping � decomposition -summary A good decomposition strategy is Key to potential application performance Key to programmability of the solution There are many different ways of thinking about decomposition Decomposition models (domain, control, object-oriented, etc.) provide standard templates for thinking about the decomposition of a problem Decomposition should be natural to the problem rather than natural to the computer architecture Communication does no useful work; keep it to a minimum Always wise to see if a library solution already exists for your problem Don�t be afraid to use multiple decompositions in a problem if it seems to fit

64. PCAM � mapping - considerations If the communication pattern among tasks is regular, create p agglomerated tasks that minimize communication and map each task to its own processor If the number of tasks is fixed and communication among them regular but the time require to perform each task is variable, then some sort of cyclic or interleaved mapping of tasks to processors may result in a more balanced load Dynamic load-balancing algorithms are needed when tasks are created and destroyed at run-time or computation or communication of tasks vary widely

65. PCAM � mapping - checklist It is important to keep an open mind during the design process. These points can help you decide if you have done a good job of considering design alternatives Is the design based on one task per processor or multiple tasks per processor? Have both static and dynamic allocation of tasks to processors been considered? If dynamic allocation of tasks is chosen is the manager (task allocator) a bottle neck to performance If using probabilistic or cyclic methods, do you have a large enough number of tasks to ensure reasonable load balance (typically ten times as many tasks as processors are required)

66. PCAM � example � N-body problem In a Newtonian n-body simulation, gravitational forces have infinite range. Sequential algorithms to solve these problems have time complexity of T(n�) per iteration where n is the number of objects Let us suppose that we are simulating the motion of n particles of varying mass in 2D. During each iteration we need to compute the velocity vector of each particle, given the positions of all other particles. Using the four stage process we get

67. PCAM � example � N-body problem Partitioning Assume we have one task per particle. This particle must know the location of all other particles Communication A gather operation is a global communication that takes a dataset distributed among a group of tasks and collects the items on a single task An all-gather operation is similar to gather, except at the end of communication every task has a copy of the entire dataset We need to update the location of every particle so an all-gather is necessary

68. PCAM � example � N-body problem So put a channel between every pair of tasks During each communication step each task sends it vector element to one other task. After n � 1 communication steps, each task has the position of all other particles, and it can perform the calculations needed to determine the velocity and new location for its particle Is there a quicker way? Suppose there were only two particles. If each task had a single particle, they can exchange copies of their values. What if there were four particles? After a single exchange step tasks 0 and 1 could both have particles v0 and v1 , likewise for tasks 2 and 3. If task 0 exchanges its pair of particles with task 2 and task 1 exchanges with task 3, then all tasks will have all four particles. A logarithmic number of exchange steps is sufficient to allow every processor to acquire the value originally held by every other processor. So the ith exchange step of messages have length 2^(i-1)

69. PCAM � example � N-body problem Agglomeration and mapping In general, there are more particles n than processors p. If n is a multiple of p we can associate one task per processor and agglomerate n/p particles into each task.

70. PCAM - summary Task/channel (PCAM) is a theoretical construct that represents a parallel computation as a set of tasks that may interact with each other by sending messages through channels It encourages parallel algorithm designs that maximize local computations and minimize communications

71. BSP � Bulk Synchronous Parallel BSP model was proposed in 1989. It provides an elegant theoretical framework for bridging the gap between parallel hardware and software BSP allows for the programmer to design an algorithm as a sequence of large step (supersteps in the BSP language) each containing many basic computation or communication operations done in parallel and a global synchronization at the end, where all processors wait for each other to finish their work before they proceed with the next superstep. BSP is currently used around the world and very good text (which this segment is based on) is called Parallel Scientific Computation by Rob Bisseling published by Oxford Press in 2004

72. BSP � Bulk Synchronous Parallel Some useful links: BSP Worldwide organization http://www.bsp-worldwide.org The Oxford BSP toolset (public domain GNU license) http://www.bsp-worldwide.org/implmnts/oxtool The source files from the book together with test programs form a package called BSPedupack and can be found at http://www.math.uu.nl/people/bisseling/software.html The MPI version called MPIedupack is also available from the previously mentioned site

73. BSP � Bulk Synchronous Parallel BSP satisfies all requirements of a useful parallel programming model Simple enough to allow easy development and analysis of algorithms Realistic enough to allow reasonably accurate modelling of real-life parallel computing There exists a portability layer in the form of BSPlib It has been efficiently implemented in the Oxford BSP toolset and Paderborn University BSP library Currently being used as a framework for algorithm design and implementation on clusters of PC�s, networks of workstations, shared-memory multiprocessors and large parallel machines with distributed memory

74. BSP � Model BSP comprises of a computer architecture, a class of algorithms, and a function for charging costs to algorithms (hmm � no wonder it is a popular model) The BSP computer consists of a collection of processors, each with private memory, and a communication network that allows processors to access each others memories

75. BSP � Model The BSP algorithm is a series of supersteps which contain either a number of computation or communication steps followed by a global barrier synchronization (i.e. bulk synchronization What is one possible problem you see right away with designing algorithms this way?

76. BSP � Model The BSP cost function is classified as an h-relation and consists of a superstep where at least one processor sends and receives at most h data words (real or integer) Therefore h = max{ hsend, hreceive } It assumes sends and receives are simultaneous This charging cost is based on the assumption that the bottleneck is at the entry or exit of a communication network The cost of an h-relation would be Tcomm(h) = hg + l, where g is the communication cost per data word, and l is the global synchronization cost (both in time units of 1 flop) and the cost of a BSP algorithm is the expression a + bg + cl (a, b, c) where a, b, c depend in general on p and on the problem size

77. BSP � Bulk Synchronous Parallel This model currently allows you to convert from BSP to MPI-2 using MPIEDUPACK as an example (i.e. MPI can be used for programming in the BSP style The main difference between MPI and BSPlib is that MPI provides more opportunities for optimization by the user. However, BSP does impose a discipline that can prove fruitful in developing reusable code The book contains an excellent section on sparse matrix vector multiplication and if you link to the website you can download some interesting solvers http://www.math.uu.nl/people/bisseling/Mondriaan/mondriaan.html

78. Pattern Language Primarily from the book �Patterns for Parallel Programming� by Mattson, Sanders, and Massingill, Addison-Wesley, 2004 From the back cover �It�s the first parallel programming guide written specifically to serve working software developers, not just computer scientists. The authors introduce a complete, highly accessible pattern language that will help any experienced developer think parallel � and start writing effective code almost immediately� The clich� �Don�t judge a book by its cover� comes to mind

79. Pattern Language We have come full circle. However, we have gained some knowledge along the way A pattern language is not a programming language. It is an embodiment of design methodologies which provides domain specific advise to the application designer Design patterns were introduced into software engineering in 1987

80. Pattern Language Organized into four design spaces (sound familiar - PCAM) Finding concurrency Structure problem to expose exploitable concurrency Algorithm structure Structure the algorithm to take advantage of the concurrency found above Supporting structures Structured approaches that represent the program and shared data structures Implementation mechanisms Mapping patterns to processors

81. Concept - Summary What is the common thread of all these models and paradigms?

82. Concept - Conclusion You take a problem, break it up into n tasks and assign them to p processors � that�s the science How you break up the problem and exploit the parallelism � now that�s the art

83. This page intentionally left blank

84. Compile Serial/sequential program optimization Introduction to OpenMP Introduction to MPI Profilers Libraries Debugging Performance Analysis Formulas

85. Serial Some of you may be thinking why would I want to discuss serial in an talk about parallel computing. Well, have you ever eaten just one bran flake or one rolled oat at a time? Homonym � serial cerealHomonym � serial cereal

86. Serial Most of the serial optimization techniques can be used for any program parallel or serial Well written assembler code will beat high level programming language any day � but who has the time to write a parallel application in assembler for one of the myriad of processors available. However, small sections of assembler might be more affective. Reducing the memory requirements of an application is a good tool that frequently results in better processor performance You can use these tools to write efficient code from scratch or to optimize existing code. First attempts at optimization may be compiler options or modifying a loop. However performance tuning is like trying to reach the speed of light � more and more time or energy is expended but the peak performance is never reached. It may be best, before optimizing your program, to consider how much time and energy you have and are willing or allowed to commit. Remember, you may spend a lot of time optimizing for one processor/compiler only to be told to port the code to another system

87. Serial Computers have become faster over the past years (Moore�s Law). However, application speed has not kept pace. Why? Perhaps it is because programmers: Write programs without any knowledge of the hardware on which they will run Do not know how to use compilers effectively (how many use the gnu compilers?) Do not know how to modify code to improve performance

88. Serial � Storage Problems Avoid cache thrashing and memory bank contention by dimensioning multidimensional arrays so that the dimensions are not powers of two Eliminate TLB (Translation Lookaside Buffer which translates virtual memory addresses into physical memory addresses) misses and memory bank contention by accessing arrays in unit stride. A TLB miss is when a process accesses memory which does not have its translation in the TLB Avoid Fortran I/O interfaces such as open(), read(), write(), etc. since they are built on top of buffered I/O mechanisms fopen(), fread(), fwrite(), etc.. Fortran adds additional functionality to the I/O routines which leads to more overhead for doing the actual transfers Do your own buffering for I/O and use system calls to transfer large blocks of data to and from files

89. Serial � Compilers and HPC A compiler takes a high-level language as input and produces assembler code and once linked with other objects, form an executable which can run on a computer Initially programmers had no choice but to program in assembler for a specific processor. When processors change so would the code Now programmers write in a high-level language that can be recompiled for other processors (source code compatibility). There is also object and binary compatibility

90. Serial � the compiler and you How the compiler generates good assembly code and things you can do to help it Register allocation is when the compiler assigns quantities to registers. C and C++ have the register command. Some optimizations increase the number of registers required C/C++ register data type usual when the programmer knows the variable will be used many times and should not be reloaded from memory C/C++ asm macro allows assembly code to be inserted directly into the instruction sequence. It makes code non-portable C/C++ include file math.h generates faster code when used Uniqueness of memory addresses. Different languages make assumptions on whether memory locations of variables are unique. Aliasing occurs when multiple elements have the same memory locations.

91. Serial � The Compiler and You Dead code elimination is the removal of code that is never used Constant folding is when expressions with multiple constants can be folded together and evaluated at compile time (i.e. A = 3+4 can be replaced by A = 7). Propagation is when variable references are replaced by a constant value at compile time (i.e. A=3+4, B=A+3 can be replaced by A=7 and B=10 Common subexpression elimination (i.e. A=B+(X+Y), C=D+(X+Y)) puts repeated expressions into a new variable Strength reduction

92. Serial � Strength reductions Replace integer multiplication or division with shift operations Multiplies and divides are expensive Replace 32-bit integer division by 64-bit floating-point division Integer division is much more expensive than floating-point division Replace floating-point multiplication with floating-point addition Y=X+X is cheaper than Y=2*X Replace multiple floating-point divisions by division and multiplication Division is one of the most expensive operations a=x/z, b=y/z can be replaced by c=1/z, a=x*c, b=y*c Replace power function by floating-point multiplications Power calculations are very expensive and take 50 times longer than performing a multiplication so Y=X**3 can be replaced by Y=X*X*X

93. Serial � Single Loop Optimization Induction variable optimization when values in a loop are a linear function of the induction variable the code can be simplified by replacing the expression with a counter and replacing the multiplication by an addition Prefetching what happens when the compiler prefetches off the end of the array (fortunately it is ignored) Test promotion in loops Branches in code greatly reduce performance since they interfere with pipelining Loop peeling Handle boundary conditions outside the loop (i.e. do not test for them inside the loop) Fusion If the loop is the same (i.e. i=0; i<n, i++) for more than one loop combine them together Fission Sometime loops need to be split apart to help performance Copying Loop fission using dynamically allocated meory

94. Software pipelining Closely related to unrolling Loop invariant code motion An expression in a loop is loop invariant if its value does not change from one loop iteration to the next Array padding Consider padding variables so that it is not a power of two Optimizing reductions A reduction is when a vector of data is used to create a scalar by applying a function to each element of the vector

95. Serial � Nested Loop Optimization Loop interchange An easy optimization to make. It helps performance since it makes the array references unit stride (do I=1,n do J=1,n x(I,J)=0.0 endo endo) Outer loop rolling Reduces the number of load operations Unroll and jam Unrolling multiple loops and jamming them back together in ways that reduce the number of memory operations Blocking Decreases the number of cache misses in nested loops. The most common blocking takes the inner loop, splits it into two loops and then exchanges the newly created non-innermost loop with one of the original outer loop. Good to use if the length of the inner loop is very long

96. MPI and OpenMP At this point you have squeezed out all the performance from a single processor and the only way to progress now is to go to multiple processors With some understanding of the parallel models previously discussed and with some help from parallel programming models such as MPI and OpenMP, you should be able to write efficient parallel programs

97. MPI and OpenMP Programming models tied to particular parallel systems lead to programs that are not portable between parallel computers As of 2000 there was a plethora of parallel programming environments. From A (ABCPL) to Z (ZPL) there were over 250 By late 1990s the parallel programming community settled primarily on two environments for parallel programming � namely OpenMP for Shared memory and MPI for message passing

98. OpenMP and MPI Neither OpenMP or MPI is an ideal fit for hybrid architectures OpenMP does not recognize nonuniform memory access times so allocating data leads to poor performance on non SMP machines, while MPI does not include constructs to manage data residing in a shared memory Solution � use MPI between nodes and OpenMP on the shared memory node. However, it is a bit more work for the programmer two different programming models are at work in the same program. Or give up on shared memory and use MPI for everything Another alternative is MPI 2.0 which contains one-sided communications (a read or write is launched from an initiating process without involvement of other processes) which mimics some of the features of shared memory systems. Or, OpenMP has been extended to run in distributed-memory environments. Consult the annual WOMPAT (Workshop on OpenMP Applications and Tools) workshop for current approaches

99. Another alternative to consider A language with built-in features to support parallel programming is Java. It does not support HPS per say, but does contain features for explicitly specifying concurrent processing with shared memory and distributed memory models Java.util.concurrent.* are such packages Currently Java programs cannot compete with OpenMP and MPI in terms of performance, but this should improve The Titanium Project http://http.cs.berkeley.edu/projects/titanium/ is an example of a Java dialect designed for HPC in a ccNUMA environment ccNUMA � cache coherent non-uniform memory access � shared physically distributed memoryccNUMA � cache coherent non-uniform memory access � shared physically distributed memory

100. OpenMP or MPI?

101. OpenMp - Introduction History What is OpenMP

102. OpenMp � What is it? OpenMP is: An Application Program Interface (API) that may be used to explicitly direct multi-threaded, shared parallelism Comprised of 3 primary components Compiler directives Runtime library routines Environmental variables Portable Specified for c/C++, F77, F90, F95 Implemented on most Unix platforms and Windows NT Standardized? Jointly defined and endorsed by major computer vendors Expected to be an ANSI standard Definition Open specifications for Multi Processing via collaborative work between interested parties from the hardware and software industry, government and academia

103. OpenMP � What it�s not OpenMP is not: Meant for distributed memory parallel systems (by itself) Necessarily implemented identically by all vendors Guaranteed to make the most efficient use of shared memory

104. OpenMP - History Ancient History Early 1990�s vendors supplied directive based Fortran programming extentions Implementations were all functionally similar, but were diverging First attempt at a standard was ANSI X3H5 in 1994 Recent History OpenMP standard specification started again in 1997 Official web sight http://www.openmp.org/ Release History October 1997: Fortran version 1.0 Late 1998: C/C++ version 1.0 June 2000: Fortran version 2.0 April 2002: C/C++ version 2.0

105. OpenMP Programming Model Thread based Parallelism A shared memory process can consist of multiple threads Explicit Parallelism OpenMP is an explicit (not automatic) programming model, offering the programmer full control over parallelization Fork-join model OpenMP uses the fork-join model of parallel execution Compiler Directive Based OpenMP parallelism is specified through the use of compiler directives which are imbedded in the source coded Nested Parallelism support Supports parallel constructs inside other parallel constructs Dynamic Threads Provision for dynamically altering the number of threads which may be used to execute different parallel regions

106. Cont� Fork-Join model All OpenMP programs begin as a single process � the master thread. Which executes sequentially until the first parallel region construct is encountered FORK � Master thread creates team of parallel threads JOIN � When the team of threads complete the statements in a parallel region construct, they synchronize and terminated, leaving only the master thread

107. OpenMP Compiler directives or Pragmas General syntax of directives (Fortran) and pragmas (C, C++)

108. Fortran Directives Source may be either fixed form or free form In fixed form, a line that begin with one of the following prefix keywords (sentinels): !$omp � C$omp � *$omp � and contains either a space or a zero in the sixth column is treaded as an OpenMP directive by the compiler A line that begins with one of the above sentinels and contains any other character in the sixth columns is treated as a continuation directive line by an OpenMP compiler

109. Fortran Directives Cont� In free form Fortran source a line that begins with the sentinel !$omp � is treated as an OpenMP directive. The sentinel may appear in any column so long as it appears as a single word and is preceded by white space A directive that needs to be continued on the next line is expressed as !$omp <directive> & (with the ampersand as the last token on that line)

110. C and C++ Pragmas Pragmas in C and C++ use the following syntax: #pragma omp � The omp keyword distinguishes the pragma as an OpenMP pragma and is processed by OpenMP compilers and is ignored by others Application developers can use the same source code base for building both parallel and sequential (serial) version of an application using just a compile-time flag.Application developers can use the same source code base for building both parallel and sequential (serial) version of an application using just a compile-time flag.

111. A Simple Loop Saxpy (single-precision a*x plus y) subroutine saxpy(z, a, x, y, n) integer i, n real z(n), a, x(n), y !$omp parallel do do i = 1, n z(i) = a * x(i) + y enddo return end

112. Simple program cont � Notice that the only minimal change we make to the original program is the addition of the parallel do directive The directive must be followed by a do loop construct An OpenMP compiler will create a set of threads and distribute the iterations of the do loop across those threads for parallel execution

113. OpenMP constructs 5 main categories: Parallel regions Worksharing Data environment Synchronization Runtime functions/environment variables

114. Parallel regions You create threads in OpenMP with the �omp parallel� pragma/directive Example double x[1000]; omp_set_num_threads(4); #pragma omp parallel { int ID = omp_thread_num(); blah(ID,A); } printf(�finished\n�); A single copy of x is shared among all threads

115. Parallelism with Parallel Regions Loop-level parallelism is generally considered as fine-grained parallelism and refers to the unit of work executed in parallel In a loop the typical unit of work is relatively small compared to the program as a whole For a courser-grained parallelism the use of a !$omp parallel !$omp end parallel Will define the region to be parallelized The parallel/end parallel directive pair is a control structure that forks a team of parallel threads with individual data environments to execute the enclosed code concurrently

116. Some details Dynamic mode (default mode) Number of threads used in a parallel region can vary from one parallel region to another Setting the number of threads only sets the maximum number of threads � you can get less Static mode The number of threads is fixed and controlled by the programmer Nested parallel regions A compiler can choose to serialize the nested parallel region (I.e. use a team with only one thread)

117. Work Sharing Constructs #pragma omp for The �for� construct splits up loop iterations among the threads in a team #pragma omp parallel #pragma omp for for (I=0;I<N;I++){ SOME_STUFF(I); } Note that by default there is a barrier at the end of the �omp for� and you can use the �nowait� clause to turn off the barrier

118. Schedule clause The schedule clause effects how loop iterations are mapped onto threads Schedule(static [,chunk]) Deal out blocks of iterations of size �chunk� to each thread Schedule(dynamic[,chunk]) Each thread grabs �chunk� iterations off a queue until all iterations have been handled Schedule(guided[,chunk]) Threads dynamically grab blocks of iterations. The size of the block starts large and shrinks down to size �chunk� as the calculation proceeds Schedule(runtime) Schedule and chunk size taken from the OMP_SCHEDULE environment variable

119. Parallel Sections If the serial version of an application performs a sequence of tasks in which none of the later tasks depends on the results of the earlier ones, it may be more beneficial to assign different tasks to different threads !$OPM section [clause [,] [clause �]] #pragma omp sections [clause [clause]�]

120. Combined work sharing constructs !OMP parallel do #pragma omp parallel for #pragma omp parallel sections

121. Data Environment Shared memory programming model Most variables shared by default Global variables are shared among threads Fortran: common blocks, SAVE variables, MODULE variables C: File scope variables, static Not everything is shared Stack variables in sub-programs called from parallel regions are PRIVATE Automatic variables within statement blocks are PRIVATE

122. Changing Storage Attributes One can selectively change storage attribute constructs using the following clauses which apply to the lexical extent of the OpenMP construct Shared Private Firstprivate Threadprivate The value of a private inside a parallel loop can be transmitted to a global value outside the loop with a �lastprivate� The default status can be modified with: DEFAULT (PRIVATE|SHARED|NONE)

123. Cont � PRIVATE (var) creates a local copy of var for each thread The value is uninitialized Private copy is not storage associated with the original I=0 C$OMP PARALLEL DO PRIVATE (I) DO J=1,100 I=I+1 1000 CONTINUE PRINT *,I I was not initialized in the DO Regardless of initialization, I is undefined in the print statement

124. Firstprivate clause Special case of private Initializes each private copy with the corresponding value from the master thread

125. Threadprivate clause Makes global data private to a thread COMMON blocks in Fortran File scope an static variables in C Threadprivate variables can be initialized using COPYIN or by using DATA statements

126. Reduction clause Reduction (op: list) The variables in �list� must be shared in the enclosing parallel region Inside a parallel or a worksharing construct: A local copy of each list variable is made and initialized depending on the �op� (i.e. +, *, -) Pair wise �op� is updated on the local value Local copies are reduced into a single global copy at the end of the construct

127. Synchronization OpenMP has the following constructs to support synchronization: Atomic Barrier Critical section Flush Master Ordered and single

128. Critical section !$omp critical !$omp end critical Only one critical section is allowed to execute at one time anywhere in the program. It is equivalent to a global lock on the program It is illegal to branch into or jump out of a critical section

129. Atomic Is a special case of a critical section that can be used for certain simple statements It applies only to the update of a memory location !$omp atomic Can be applied only if the critical section consists of a single assignment statement that updates a scalar variable

130. Barrier Each thread waits until all threads arrive #pragma omp barrier Simple directive that can be used to ensure that a piece of work has been completed before moving on to the next phase

131. Ordered Enforces the sequential order for a block

132. Master Denotes a structured block that is only executed by the master thread. The other threads just skip it (no implied barriers or flushes). Used in parallel regions

133. Single Denotes a block of code that is executed by only one thread A barrier and a flush are implied at the end of the single block

134. Flush Denotes a sequence point where a thread tries to create a consistent view of memory All memory operations (both reads and writes) defined prior to the sequence must complete All memory operations defined after the sequence point must follow the flush Variables in registers or write buffers must be updated in memory Arguments to flush specify which variables are flushed. No arguments specifies that all thread visible variables are flushed

135. Runtime functions and library routines Lock routines Omp_init_lock(), omp_set_lock(), omp_unset_lock() omp_test_lock Runtime environment routines: Modify/check the number of threads Omp_set_num_threads(), omp_get_num_threads(), omp_get_thread_num(), omp_get_max_threads() Turn on/off nesting and dynamic mode Omp_set_nested(), omp_set_dynamic(), omp_get_nested(), omp_get_dynamic() Are we in a parallel region Omp_in_parallel() How many processors in the system Omp_num_procs()

136. Performance improvements The compiler listing gives many useful clues for improving the performance Loop optimization tables Reports about data dependencies Explanations about applied transformations The annotated, transformed code Calling tree Performance statistics The type of reports to be included in the listing can be set through compiler options

137. Tuning Automatically Parallelized code Task is similar to explicit parallel programming Two important differences: The compiler gives hints in its listing, which may tell you where to focus attention (I.e. which variables have data dependencies) You do not need to perform all transformations by hand. If you expose the right information to the compiler, it will do the transformation for you (I.e. C$assert independent)

138. Cont� Hand improvements can pay off because: Compiler techniques are limited (I.e. array reductions are parallelized by only a few compilers) Compilers may have insufficient information(I.e. loop iteration range may be input data and variables are defined in other subroutines)

139. Performance Tuning Use the following methodology: Use compiler-parallelized code as a starting point Get loop profile and compiler listing Inspect time-consuming loops (biggest potential for improvement

140. SMP Programming Errors Shared memory parallel programming Saves the programmer from having to map data onto multiple processors It opens up a range of new errors coming from unanticipated shared resource conflicts

141. Two Major Errors Race conditions The outcome of a program depends on the detailed timing of the threads in the team Deadlock Threads lock up waiting on a locked resource that will never become free

142. OpenMp traps Are you using threadsafe (safe to operate on with more than one thread running, and/or in a thread that is not also the main thread) libraries? I/O inside a parallel region can interleave unpredictably Make sure you understand what your constructors are doing with private objects Private variables can mask globals Understand when shared memory is coherent When in doubt, use FLUSH NOWAIT removes implied barriers

143. How to avoid the Traps Analyze your code to make sure every semantically permitted interleaving of the threads yields the correct results Can be prohibitively difficult due to the explosion of possible interleavings Write SMP code that is portable and equivalent to the sequential form Use a safe subset of OpenMP Follow a set of rules for sequential equivalence

144. Strong Sequential Equivalence Rules Control data scope with the base language Avoid data scope clauses Only use private for scratch variables local to a block whose global initializations do not matter Locate all cases where a shared variable can be written by multiple threads The access to the variable must be protected If multiple threads combine results into a single value, enforce sequential order Do not use the reduction clause

145. Conclusion OpenMP is: A great way to write fast executing code Allows you to analyze special painful errors Tools and/or a discipline of writing portable sequentially equivalent programs can help

146. OpenMP - future In the hands of the Architectural Review Board (the ARB) HP, Intel, Sun, SGI, DOE ASCI ARB resolves interpretation issues and manages the evolution of new OpenMP API�s Membership in the ARB is open to any organization with a stake in OpenMP

147. References http://www.openmp.org/ Parallel Programming in OpenMP, Morgan Kaufman Publishers ISBN 1-55860-671-8, 2001 Parallel Programming in C with MPI and OpenMP, McGraw Hill Publishers ISBN 0-07282-256-2, 2004

148. OpenMP � p Program � C no pragmas static long num_steps = 100000; double step; void main () { int i; double x, pi, sum=0.0; step = 1.0/(double) num_steps; for (i=1;i<=num_steps; i++){ x= (i-0.5)*step; sum = sum + (4.0/(1.0+X*X)); } pi = step*sum; }

149. OpenMP � p Program � C pragmas #include <omp.h> #define NUM_THREADS 2 static int num_steps = 100000; double step; void main () { int i; double x, pi, sum = 0.0;step = 1.0/(double) num_steps;omp_set_num_threads( NUM_THREADS);#pragma omp parallel for reduction(+:sum) private(x)for (i=0; i< num_steps; i++) { x = (i+0.5)*step; sum = sum + (4.0/(1.0+x*x));} pi = step * sum; }

150. MPI - Introduction History MPI � Message-Passing Interface is a library of functions and macros that can be used in Fortran (77, 90, 95), C and C++ programs that exploit the existence of multiple processors by message-passing Developed in 1993-1994

151. MPI - Advantages of the MP Model Universality Fits well on separate processors connected by a communication network. Thus it works well on parallel supercomputers and clusters Expressivity A useful and complete model in which to express parallel algorithms Tolerant of imbalances in process speeds found on shared networks

152. MPI - Advantages Ease of Debugging While debuggers for parallel programs are easier to write for an SMP model, debugging itself is easier in the MP model because only one process has direct access to any memory location Performance MPI provides a way for the programmer to explicitly associate specific data with processes and thus allow the compiler and cache-memory to function fully

153. A Perfect Parallel Program Compute the value of pi by numerical integration. Solve f(x) = 4/(1+x�) To do this we divide the interval 0 to 1 into some number n of subintervals and add up the areas Language bindings to C++, C, Fortran, and example pi programs follow

154. MPI � 6 Functions MPI_Comm_rank - determines a process's ID number MPI_Init - Initializes MPI. This allows the system to do any setup needed to handle further calls to the MPI library and must be called before any other MPI function. MPI_Comm_size - find the number of processes Every active process becomes a member of a communicator named MPI_COMM_WORLD which is an opaque object that provides the environment for message passing among processes and is the default communicator.� Processes within a communicator are ordered.� The rank of a process has a unique ID (rank) between 0 and p-1.� A process may use its rank to determine which portion of a computation and or dataset it is responsible for.� A process calls MPI_Comm_rank to determine its rank within a communicator and calls MPI_Comm_size to determine the total number of processes in a communicator. MPI_Bcast - send information to all enables a process to broadcast one or more data items of the same type to all other processes in a communicator.� After this function has executed, every process has an up-to-date value MPI_Reduce � retrieve information from all perform one or more reduction operations on values submitted by all the processes in a communicator.� It is important to remember that while only a single process (in most cases process 0) gets the global result, every process must call MPI_Reduce. MPI_Finalize - cleanly end MPI After a process has completed all of its MPI library calls, it calls MPI_Finalize, allowing the system to free up resources such as memory. With few exceptions, no MPI calls may be made by a process after its call to MPI_FINALIZE

155. MPI � Bindings � C++ void MPI::Init( int& argc, char**& argv )void MPI::Init( )void MPI::Comm::Get_rank( ) constvoid MPI::Comm::Get_size( ) constvoid MPI::Intracomm::Bcast( cvoid* buf, int* count, const Datatype& datatype, int root) constvoid MPI::Intracomm::Reduce( const void* sendbuf, void* recvbuf, int count, const Datatype& datatype, const Op& op, int root ) constvoid MPI::Finalize( )

156. MPI � Bindings � C++ C++ bindings added to MPI-2 and therefore might not be everywhere Approach was to exploit MPI�s object-oriented structure to define classes and methods that closely follow the structure of the C and fortran bindings Differences are as follows: Call to MPI_Init has become MPI::Init Invoke the method Get_size on the object MPI::COMM_WORLD and return the size of the MPI::COMM_WORLD communicator as its value Instead of returning an error code, the behaviour is to throw an exception The call to Get_rank is similar to the call to Get_size but returns the rank as the value of the function Bcast and Reduce are methods on MPI::COMM_WORLD

157. MPI � p � C++ program #include <math.h> #include "mpi.h� int main(int argc, char *argv[]) int n, myid, nprocs, i; double PI25DT = 3.141592653589793238462643 double mypi, pi, h, sum, x; //� MPI inititialization MPI::Init(argc,argv); size = MPI::COMM_WORLD.Get_size(); rank = MPI::COMM_WORLD.Get_rank(); while (i) { if (rank == 0) { cout << "Enter the number of intervals: (0 quits) � << endl; cin >> n; } // broadcast n to the rest of the processes MPI::COMM_WORLD.Bcast(&n,1,MPI::INT,0); if (n == 0) break; else { h�� = 1.0 / (double) n; sum� = 0.0; for (i = rank +1; i <=� n; i += size) { x = h * ((double)i - 0.5); /* function to integrate */ sum += (4.0 / (1.0 + x*x)); } mypi = h * sum; // collect all partial sums MPI::COMM_WORLD.Reduce(&mypi,&pi,1,MPI::DOUBLE,MPI::SUM,0); // node 0 or master prints the answer if (rank == 0) cout << "pi is approximately " << pi << ", Error is " <<� fabs(pi - PI25DT)) << endl; } } MPI::Finalize(); return 0; }

158. MPI � Bindings - C int MPI_Init( int *argc, char ***argv )int MPI_Comm_size( MPI_Comm comm, int *size )int MPI_Comm_rank( MPI_Comm comm, int *rank )int MPI_Bcast( void *buf, int count, MPI_Datatype datatype, int root, MPI_Comm comm )int MPI_Reduce( void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm comm )int MPI_Finalize( )

159. MPI � Bindings - C Primary difference is that error codes are returned as the value of C functions instead of a separate argument More strongly typed than in fortran I.e. MPI_Comm and MPI_Datatype where fortran has integers mpi.h instead of mpi module or mpif.h MPI_Init is different to take advantage of command-line arguments

160. MPI � p � C program #include "mpi.h� #include <math.h> int main( int argc, char *argv[] ) { int n, myid, nprocs, i; double PI25DT = 3.141592653589793238462643; double mypi, pi, h, sum, x; /*� MPI inititialization */ MPI_Init(&argc,argv); MPI_Comm_size(MPI_COMM_WORLD,&numprocs); MPI_Comm_rank(MPI_COMM_WORLD,&myid); while (i) { if (myid == 0) { printf("enter the number of intervals: (0 quits) "); scanf("%d",&n); } /* broadcast n to the rest of the processes */ MPI_Bcast(&n,1,MPI_INT,0,MPI_COMM_WORLD); if (n == 0) break; else { h�� = 1.0 / (double) n; sum� = 0.0; for (i = myid +1; i <=� n; i += numprocs) { x = h * ((double)i - 0.5); /* function to integrate */ sum += (4.0 / (1.0 + x*x)); } mypi = h * sum; /* collect all partial sums */ MPI_Reduce(&mypi,&pi,1,MPI_DOUBLE,MPI_SUM,0,MPI_COMM_WORLD); /* node 0 or master prints the answer */ if (myid == 0) printf("pi is approximately %.16f, Error is %.16f\n", pi, fabs(pi - PI25DT)); } } MPI_Finalize(); return 0; }

161. MPI � Bindings - Fortran MPI_INIT( ierror )�� integer ierrorMPI_COMM_SIZE( comm, size, ierror )�� integer comm, size, ierrorMPI_COMM_RANK( comm, rank, ierror )�� integer comm, rank, ierrorMPI_BCAST( buffer, count, datatype, root, comm, ierror )�� <type> buffer(*)�� integer count, datatype, root, comm, ierrorMPI_REDUCE( sendbuf, recvbuf, count, datatype, op, root, comm, ierror )�� <type> sendbuf(*), recvbuf(*)�� integer count, datatype, root, comm, ierrorMPI_FINALIZE( ierror )�� integer ierror

162. MPI � p � Fortran program program main�� use mpiC�� Use the following include if the mpi module is not availableC�� include "mpif.h"�� double precision PI25DT�� parameter� (PI25DT = 3.141592653589793238462643d0)�� double precision mypi, pi, h, sum, x, f, a�� integer n, myid, numprocs, i, ierrC��function to integrate�� f(a) = 4.d0 / (1.d0 + a*a)�� call MPI_INIT(ierr)�� call MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr)�� call MPI_COMM_SIZE(MPI_COMM_WORLD, numprocs, ierr) 10 �if ( myid .eq. 0) then��print *, 'Enter the number of intervals: (0 quits) '��read(*,*) n��endifC�� broadcast n to the rest of the processes�� call MPI_BCAST(n, 1,MPI_INTEGER, 0,MPI_COMM_WORLD, ierr)C�� check for quit signal C�� check for quit signal if ( n .le. 0 ) goto 30 C�� calculate the interval size h = 1.0d0/n sum = 0.0d0 do 20 i = myid+1, n, numproc x = h * (dble(i) - 0.5d0) sum = sum + f(x) 20�� Continue mypi = h * sum C�� collect all partial sums CALL MPI_REDUCE(myid,pi,1,MPI_DOUBLE_PRECICION,MPI_SUM,0, &�� MPI_COMM_WORLD,ierr) C�� node 0 or master prints the answer if ( myid .eq. 0) then print *, 'pi is ', pi, 'Error is', abs(pi - PI25DT) endif goto 10 30�� call MPI_FINALIZE(ierr) stop end

163. MPI - Timing double precision MPI_wtime() double precision MPI_wtick() For parallel programs getting the correct answer is not enough; one wishes to decrease the execution time

164. MPI - Common Errors and Misunderstandings Forgetting ierr in Fortran Most common error in fortran written MPI programs is to forget the last argument, where the error code is returned Fortran 90 will catch this but other compilers will not leading to hard-to-find errors

165. MPI - Common Errors Misdeclaring status in Fortran When an MPI_Recv returns, certain information has been stored in the status argument Status is an array of integers and not a single integer Many compilers will not complain about this but running this program will lead to unexpected memory overwrites and mysterious behaviour

166. MPI - Common Errors Misdeclaring string variables in Fortran Strings are not the same as arrays of characters A 12 character string �a� in fortran should be declared as : character*12 a and definitely not character a(12) The latter is an array of characters, not a single character string variable

167. MPI - Common Errors Expecting argc and argv to be passed to all processes Arguments to MPI_Init in C are &argc and &argv which allows the MPI implementation to fill these in on all processes but the MPI Standard does not require it MPI does not assume it is running under Unix � MPI also runs on Windows and Macs

168. MPI - Common Errors Doing things before MPI_Init and after MPI_Finalize The MPI Standard says nothing about the situation before MPI_Init and after MPI_Finalize, not even how many processes are running Doing anything before or after may yield unexpected and implementation-dependent results

169. MPI - Common Errors Matching MPI_Bcast with MPI_Recv It may be thought that MPI_Bcast is a multiple send and that MPI_Recv should receive it Not so � MPI_Bcast is a collective operation which must be called on all processes in the group It functions as a multi-send on the specified root process and as a receive on the others This allows for optimal performance MPI_Recv does not have to check whether the message it received is part of a broadcast

170. MPI - Common Errors Assuming the MPI implementation is thread-safe Thread-safe is when multiple threads could call MPI routines at the same time The Standard does not require it of implementations Unpredictable results may occur if the compiler generates multithreaded code and the MPI implementation is not thread-safe

171. MPI - Resources MPI Home page http://www.mpi-forum.org Examples of MPI programs http://www.mcs.anl.gov/mpi/usingmpi2 MPI Implementations http://www.mcs.anl.gov/mpi/mpich MPI Standard http://www.mpi-forum.org Discussion on MPI http://www.erc.msstate.edu/mpi/mpi-faq.html

172. Profilers Optimizing a small piece of code can be done easily by inserting calls to a timer and if you know that that code will enhance the performance of your program. It is not the case with large applications. That is where a profiler comes in handy. The job of the profiler is to aid in the discovery of bottlenecks and where you can increase performance. They do this by inserting calls into applications that generate information and timings about subroutines, functions, and loops. The performance of the program will suffer because the of the extra data collection that is carried out by the processor When using profilers: Check for correct answers Profile to find the most time consuming routines Optimize these routines with compiler options, pragmas and source code modification Repeat the first three steps ( but remember the speed of light)

173. Profilers Four types of profilers Sampling based Use of predefined clock with every tick a new sample is taken of what the code is doing. This type of profiler might miss an event happening between ticks Event based Every event is monitored (i.e. entry and exit from a subroutine) Trace based Compiler keeps all information it collects Reductionist based Only statistical information is kept (i.e. average time in a routine) The most common Unix profilers are prof (reductionist) and gprof (event). They require code to be relinked and executed so that an output file with timing results can be created

174. Libraries If you are going to build a house you do not make the tools, nails, screws, wood, etc., yourself Likewise for programming (serial or parallel). You do not recreate a cosine function every time you need one. There are many efficient mathematical libraries out there and at this point I would like to introduce some of the commonly used ones to you

175. Libraries - BLAS One of the stalwarts in mathematical libraries � first started in 1979 by C. L. Lawson et al and continued to 1990 with Dongarra et al BLAS stands for Basic Linear Algebra Subprograms and they are a collection of routines for basic vector and matrix operations. They are divided into three groups: Level 1 BLAS � vector operations Level 2 BLAS � matrix-vector operations Level 3 BLAS � matrix-matrix operations They are strong and well-accepted subroutines used in F77, F90, F95, C, and C++ programs http://www.netlib.org/blas/blast-forum/

176. Libraries - EISPACK EISPACK (Eigenvalue Software PACKage) is a collection of routines that compute eigenvalues and eigenvectors for several types of matrices. Designed in the late 1970s

177. Libraries - LAPACK LAPACK (Linear Algebra PACKage) is a successful public domain library used extensively. Memory access patterns are more efficient LAPACK subroutines use Level 2 and Level 3 BLAS routines wherever possible. So when vendors provide highly tuned BLAS routines LAPACK benefits Users should use LAPACK rather than EISPACK or LINPACK as it is more efficient and is superior in performance Comes with extensive documentation and tunable routines for your system.

178. Libraries - LINPACK LINPACK (LINear algebra PACKage) is a collection of Fortran subroutines for solving linear least-squares problems and linear equations Designed in the late 1970s Based on legacy BLAS Level 1 routines Solves linear systems whose matrices are banded, dense, symmetric, symmetric positive definite, and triangular

179. Libraries - PLAPACK PLAPACK (Parallel Linear Algebra PACKage) developed in 1997 by R. A. van de Geijn and P. Alpatov is an infrastructure for coding linear algebra algorithms Allows for high levels of performance to Cholesky, LU, and QR factorization solvers More information can be found at http://www.cs.utexas.edu/users/plapack/

180. Libraries - ScaLAPACK ScaLAPACK (Scalable Linear Algebra PACKage) is a parallelized subset of the LAPACK library Built primarily for distributed parallel computers and uses its own communication package BLACS (Basic Linear Algebra Communication Subprograms) which can be built with MPI or PVM ScaLAPACK, LAPACK, and BLAS can be obtained from http://netlib.org/

181. Libraries - Commercial Most vendors provide their own HPC libraries for mathematical operations. They tend to perform better than the public domain versions and are fully supported by the vendor HP�s MLIB (Mathematical software LIBrary) IBM�s ESSL (Engineering and Scientific Subroutine Library) SGI�s SCSL (Scientific Computing Software Library) Sun�s PL (Performance Library) There are also numerous independent vendors that provide libraries: NAG (Numerical Algorithms Group) http://www.nag.co.uk/ IMSL ( International Mathematical and Statistical Library) http://www.vni.com/ The IMSL Fortran Library is the only commercially available library that has 100% thread safety It is available for Fortran, C, C++, and Java HSL (Harwell Scientific Library) http://www.cse.clrc.ac.uk/nag/hsl/ Can be obtained without charge for academic purposes

182. Debugging Programming is an error-prone activity Most programmers (unix) will use dbx, and some sort of print statement (printf, print) to catch and isolate bugs Debugging parallel programs is much harder since: More things can go wrong Multiple processes are performing multiple computations and interact with other processes using some sort of message passing Programmers do not have access to good tools. However, you can purchase TotalView, Vampir, etc.

183. Debugging � General Hints Make sure your single processor version runs correctly and is efficient. You can take advantage of serial debuggers such as dbx or gdb to test values, set breakpoints, etc. Work with the smallest number of processors you can that can test the programs functionality � usually two or three Use a smaller problem size. Typically a 4 x 4 array can functionally test a program that requires a 1024 x 1024 matrix. Using a smaller problem size allows you to put in additional print statements to look at data structures. You will have a lot less output to consider

184. Debugging � General Hints Put a flush statement (i.e. fflush(stdout)) after every printf statement to collect all output from every process before the program deadlocks For point-to-point messages, to make sure what the values are, print the data that you sending and receiving For a serial program, messages received will be in a chronological order. This is not the case for parallel programming. You cannot assume that if a message from process A appears before a message from process B that A�s message was printed before B�s. So in MPI prefix each message with a process rank then sort the output of the program (which will organize output by process)

185. Debugging � General Hints Debug the initialization phase of the program first to make sure that you have set up the data structures correctly Do not try and optimize the program. First get the logic right. Then worry about performance enhancing steps Reuse code or use library functions wherever possible

186. Performance Analysis Accurately predicting the performance of your parallel algorithm will should help you decide whether it is worth coding and debugging Explanation and a closer look at: Amdahl�s Law which relies upon the evaluation of sequential program to predict an upper limit to the speedup using multiple processors Gustafson-Barsis�s Law which gives an estimate of scaled speedup Karp-Flatt metric which examines the speedup achieved by a parallel program solving a problem of fixed size Isoefficiency metric which is used to determine the scalability of a parallel system

187. Performance analysis � Amdahl�s Law Amdahl�s Law S(p) = p/(1+(p-1)f) = 1/(f+(1-f)/p) where: f - fraction of the computation that cannot be divided into concurrent tasks, 0 = f = 1 and p - the number of processors So if we have 20 processors and a serial portion of 5% we will get a speedup of 20/(1+(20-1).05) =10.26 Amdahl�s Law ignores overhead

188. Performance Analysis � Gustafson-Barsis�s Law Given a parallel program solving a problem of size n using p processors, let s denote the fraction of total execution time spent in serial code. The maximum speedup ? = p + ( 1 � p )s Example: suppose I told my boss that if I buy a supercomputer with 3000 processors we can get a speedup of at least 2,700 times on the bosses favorite program. To do so how much time can be spent on the serial portion of the code in order to meet this goal? 2700 = 3000 � 2999s s = 300/2999 = .1 = 10%

189. Performance Analysis � Karp-Flatt Metric Since both the previous 2 laws ignore the parallel overhead term (time for interprocessor communication), they can overestimate speedup or scaled speedup. Therefore we have the Karp-Flatt metric which states: given a parallel computation exhibiting a speedup ? on p processors where p=1, the experimentally determined serial fraction e is defined as e = (1/? � 1/p)/(1 � 1/p)

190. Performance Analysis � Karp-Flatt Metric Example : suppose we have a 4 processor machine experiencing the following speedup p 1 2 3 4 ? 1.00 1.82 2.50 3.08 Why do we only have a speedup of 3.08 on the 4 processors? Well if we compute e for the above we will notice that e .10 .10 .10 .10 which shows that the serial fraction is not increasing with the number of processors and is the result of the program being inherently sequential

191. Performance Analysis � Isoefficiency metric Efficiency of a parallel system can be represented as e(n, p) where n is the problem size and p the number of processors and let C = e(n, p) / (1 - e(n, p) )) T(n, 1) be the sequential execution time and T0(n, 1) be the total amount of overhead time Then in order to maintain the same level of efficiency as the number of processors increase, the problem size must increase to satisfy the following inequality: T(n, 1) = CT0(n, p)

192. Performance Analysis � Isoefficiency metric Example We have a sequential reduction algorithm whose computational complexity is T(n) with the reduction step having a time complexity of T(log p) and every processor is used so T0(n, p) = T(p log p) and hence the isoefficiency metric for this algorithm would be n= C p log p To check. If we are sum-reducing n values over p processors then each processor adds n/p values and is included in a reduction that has ?log p? steps. Doubling n and p would still keep each processors share the same but the number of steps needed to perform the reduction increases slightly to ?log (2p)? (the efficiency has dropped a bit). So in order to maintain the same efficiency, every time p is doubled n will have to be more than doubled

HPC Parallel Programming: From Concept to Compile

HPC Parallel Programming: From Concept to Compile

Presentation Transcript

Parallel Programming for Managed Developers with Visual Studio 2010

Lecture 1 – Parallel Programming Primer

Parallel Computing 2

The State of Parallel Programming

CS 2110: Parallel Programming

CS 194 Parallel Programming Creating and Using Threads

Parallel Computing

Towards a Science of Parallel Programming

Why Parallel? Why Now?

Surveying how Parallel Programming Models Address Computation Distribution

Potential for parallel computers/parallel programming

“Parallel Programming with Async and Await ”

Parallel programming in Java

Towards a Science of Parallel Programming

Parallel Programming in MPI part 1

Models and Languages for Parallel Computation

Parallel Programming in C for Multiprocessor å¤šè™•ç†æ©Ÿå¹³è¡Œç¨‹å¼è¨è¨ˆ

Parallel Computing

Aspects of practical parallel programming Parallel programming models Data parallel

Seminar Parallel Programming and Parallel Algorithms

Parallel Software & Parallel Programming Language Dr. Stephen Tse stephen_tse@qc

HPC Parallel Programming: From Concept to Compile