670 likes | 684 Views
Parallelism. Why need Parallelism?. Faster, of course Finish the work earlier Same work in less time Do more work More work in the same time. How to Parallelize an Application?. Break down the computational part into small pieces Assign the small jobs to the parallel running processes
E N D
Why need Parallelism? • Faster, of course • Finish the work earlier • Same work in less time • Do more work • More work in the same time
How to Parallelize an Application? • Break down the computational part into small pieces • Assign the small jobs to the parallel running processes • May become complicated when the small piece of jobs depend upon others
Easy Case: Parameter Set • You are running experiments to support your claims and/or better understand a problem • Experiment here means an application that you are interesting in the results by running it with different input parameters • The pieces of computation are the same program with different parameters • Each piece is independent from each other
Parameter Set using Scripts • Your experiment should be able to run in batch • Read all parameters (and other inputs) from the command line and files • Write all output to a file (whose name you can specify as an input) • Use ssh to start the experiment in many machines • If there is no common file system, use scp to stage the inputs and collect the results • Use nice
Parameter Set via TDG Cluster • A simple script that uses ssh to start experiments in many machines will save you a lot of time • However, it is possible to do better by carefully considering resource selection, work distribution, input staging, output collection, and the like • That is, scheduling can really help in this scenario, using PBS
Hard Case: Dependent Pieces of Computation • If you are running one huge simulation • the pieces of computation are not independent anymore • The processes that form the application will have to communicate these dependencies
Hard Case: Dependent Pieces of Computation • Think how to break the application apart in parallel-running processes • Consider carefully if parallelizing your application is really worth • Parallelize it only if your application really takes too much to run and is going to be used many times
Programming Alternatives • Shared Memory • Does not scale that well • Message Passing • Sockets • too low-level • Usually parallel applications are not client-server • MPI (Message Passing Interface) is the standard API to do this
Steps for Writing Parallel Program • If you are starting with an existing serial program, debug the serial code completely • Identify which parts of the program can be executed concurrently: • Requires a thorough understanding of the algorithm • Exploit any parallelism which may exist • May require restructuring of the program and/or algorithm. May require an entirely new algorithm. • Decompose the program: • Functional Parallelism • Data Parallelism • Combination of both
Steps for Writing Parallel Program • Code development • Code may be influenced/determined by machine architecture • Choose a programming paradigm • Determine communication • Add code to accomplish process control and communications • Compile, Test, Debug • Optimization • Measure Performance • Locate Problem Areas • Improve them
Program Decomposition • There are three methods for decomposing a problem into smaller processes to be performed in parallel: Functional Decomposition, Domain Decomposition, or a combination of both
Functional Decomposition (Functional Parallelism) • Decomposing the problem into different processes which can be distributed to multiple processors for simultaneous execution • Good to use when there is not static structure or fixed determination of number of calculations to be performed
Functional Decomposition (Functional Parallelism) The Problem Machine 2 Machine 1 Machine 3 Machine 4
Domain Decomposition (Data Parallelism) • Partitioning the problem's data domain and distributing portions to multiple processors for simultaneous execution • Good to use for problems where: • data is static (factoring and solving large matrix or finite difference calculations) • dynamic data structure tied to single entity where entity can be subset (large multi-body problems) • domain is fixed but computation within various regions of the domain is dynamic (fluid vortices models)
Domain Decomposition (Data Parallelism) The Problem Machine 2 Machine 1 Machine 3 Machine 4
Other Decomposition Methods – One Dimensional Data Distribution • Block Distribution • Cyclic Distribution
Block Block Distribution Other Decomposition Methods – Two Dimensional Data Distribution
Block Cyclic Distribution Other Decomposition Methods – Two Dimensional Data Distribution
Cyclic Block Distribution Other Decomposition Methods – Two Dimensional Data Distribution
Programming • Understanding the inter-processor communications of your program is essential • Message Passing communication is programmed explicitly. The programmer must understand and code the communication • Data Parallel compilers and run-time systems do all communications behind the scenes. The programmer need not understand the underlying communications. On the other hand to get good performance from your code you should write your algorithm with the best communication possible
Considerations: Amdahl's Law • It states that potential program speedup is defined by the fraction of code (f) which can be parallelized • If none of the code can be parallelized, f = 0 and the speedup = 1 (no speedup). If all of the code is parallelized, f = 1 and the speedup is infinite (in theory)
Considerations: Amdahl's Law • Introducing the number of processors performing the parallel fraction of work, the relationship can be modeled by the equation where: • P: parallel fraction • N: number of processors • S: serial fraction
Considerations: Amdahl's Law • It is obvious that there are limits to the scalability of parallelism. For example, at P = .50, .90 and .99 (50%, 90% and 99% of the code is parallelizable)
Considerations: Amdahl's Law • Problems which increase the percentage of parallel time with their size are more "scalable" than problems with a fixed percentage of parallel time
Considerations: Load Balancing • Load balancing refers to the ways to distribute processes so as to insure the most time efficient parallel execution • If processes are not distributed in a balanced way, some processes are waiting while other processes are idle • Performance can be increased if work can be more evenly distributed • For example, if there are many processes of varying sizes, it may be more efficient to maintain a process pool and distribute to processors as each finishes • Consider a heterogeneous environment where there are machines of widely varying power and user load versus a homogeneous environment with identical processors running one job per processor
Considerations: Granularity • In order to coordinate between different processors working on the same problem, some form of communication between them is required • The ratio between computation and communication is known as granularity • The most efficient granularity is dependent on the algorithm and the hardware environment in which it runs • In most cases overhead associated with communications and synchronization is high relative to execution speed so it is advantageous to have coarse granularity
Fine-grain Parallelism • All processes execute a small number of instructions between communication cycles • Facilitates load balancing • Low computation to communication ratio • Implies high communication overhead and less opportunity for performance enhancement • If granularity is too fine it is possible that the overhead required for communications and synchronization between processes takes longer than the computation
Fine-grain Parallelism Computation Computation Computation Communication Communication Communication Computation Computation Computation Communication Communication Communication Computation Computation Computation … … …
Coarse-grain Parallelism • Typified by long computations consisting of large numbers of instructions between communication synchronization points • High computation to communication ratio • Implies more opportunity for performance increase • Harder to load balance efficiently • Imagine that the computation work load is a 10 kg. of material: • Sand = fine-grain • Cinder blocks = coarse grain • Which is easier to distribute?
Coarse-grain Parallelism Computation Computation Computation Communication Communication Communication Computation Computation Computation Communication Communication Communication … … …
Considerations: Data Dependency • Data dependency exists when there is multiple use of the same storage location • Types of data dependencies • Flow Dependent: Process 2 uses a variable computed by Process 1. Process 1 must store/send the variable before Process 2 fetches • Output Dependent: Process 1 and Process 2 both compute the same variable and Process 2's value must be stored/sent after Process 1's • Control Dependent: Process 2's execution depends upon a conditional statement in Process 1. Process 1 must complete before a decision can be made about executing Process 2
Considerations: Data Dependency • How to handle data dependencies? • Distributed memory • Communicate required data at synchronization points • Shared memory • Synchronize read/write operations between processes
Considerations: Communication Patterns and Bandwidth • For some problems, increasing the number of processors will: • Decrease the execution time attributable to computation • But also, increase the execution time attributable to communication • Communication patterns also affect the computation to communication ratio. • For example, gather-scatter communications between a single processor and N other processors will be impacted more by an increase in latency than N processors communicating only with nearest neighbors • They have to wait until all have reached a certain point
Considerations: I/O Operation • I/O operations are generally regarded as inhibitors to parallelism • In an environment where all processors see the same file space, write operations will result in file overwriting • Read operations will be affected by the fileserver's ability to handle multiple read requests at the same time • I/O which must be conducted over the network (non-local) can cause severe bottlenecks
Considerations: I/O Operation • Some alternatives: • Reduce overall I/O as much as possible • Confine I/O to specific serial portions of the job • For example, process 0 could read an input file and then communicate required data to other processes. Likewise, process 1 could perform write operation after receiving required data from all other processes. • Create unique filenames for each processes' input/output file(s) • For distributed memory systems with shared file space, perform I/O in local, non-shared file space • For example, each processor may have /tmp filespace which can used. This is usually much more efficient than performing I/O over the network to one's home directory
Considerations: Fault Tolerance and Restarting • In parallel programming, it is usually the programmer's responsibility to handle events such as: • machine failures • task failures • checkpoint • restarting
Considerations: Deadlock • Deadlock describes a condition where two or more processes are waiting for an event or communication from one of the other processes. • The simplest example is demonstrated by two processes which are both programmed to read/receive from the other before writing/sending. Process 2 Y = 10 Recv (Process 1, X) Send (Process 1, Y) Z=X+Y … Process 1 X = 1 Recv (Process 2, Y) Send (Process 2, X) Z=X+Y …
Considerations: Debugging • Debugging parallel programs is significantly more of a challenge than debugging serial programs • Debug the program as soon as the development start • Use a modular approach to program development • Pay as close attention to communication details as to computation details
Essentials of Loop Parallelism • Problems that has a loop construct forms the main computational component of the code. Loops are a main target for parallelizing and vectorizing code. A program often spends much of its time in loops. When it can be done, parallelizing these sections of code can have dramatic benefits. • A step-wise refinement procedure for developing the parallel algorithms will be employed. An initial solution for each problem will be presented and improved by considering performance issues
Essentials of Loop Parallelism • Pseudo-code will be used to describe the solutions. The solutions will address the following issues: • identification of parallelism • program decomposition • load balancing (static vs. dynamic) • task granularity in the case of dynamic load balancing • communication patterns - overlapping communication and computation • Note the difference in approaches between message passing and data parallel programming. Message passing explicitly parallelizes the loops where data parallel replaces loops by working on entire arrays in parallel
Example: Calculation (Serial) • Problem is: • Computationally intensive • Minimal communication • The value of PI can be calculated in a number of ways, many of which are easily parallelized • Consider the following method of approximating PI • Inscribe a circle in a square • Randomly generate points in the square • Determine the number of points in the square that are also in the circle • Let r be the number of points in the circle divided by the number of points in the square • PI ~ 4 r • Note that the more points generated, the better the approximation
Example: Calculation (Serial) • Serial pseudo code for this procedure: • npoints = 10000 • circle_count = 0 • do j = 1,npoints • generate 2 random numbers between 0 and 1 • xcoordinate = random1 • ycoordinate = random2 • if (xcoordinate, ycoordinate) inside circle • then circle_count = circle_count + 1 • end do • PI = 4.0*circle_count/npoints • Note that most of the time in running this program would be spent executing the loop
Example: Calculation (Parallel) • Parallel strategy: break the loop into portions which can be executed by the processors. • For the task of approximating PI: • each processor executes its portion of the loop a number of times • each processor can do its work without requiring any information from the other processors (there are no data dependencies). This situation is known as Embarrassingly Parallel • Use SPMD (Single Processor/Multiple Data) Model – One process acts as master and collects the results
Example: Calculation (Parallel) • Message passing pseudo code: • npoints = 10000 • circle_count = 0 • p = number of processors • num = npoints/p • find out if I am master or worker • do j = 1,num • generate 2 random numbers between 0 and 1 • xcoordinate = random1; ycoordinate = random2 • if (xcoordinate, ycoordinate) inside circle • then circle_count = circle_count + 1 • end do • if I am master • receive from workers their circle_counts • compute PI (use master and workers calculations) • else if I am worker • send to master circle_count • endif
Example: Calculation (Parallel) • Data parallel solution: • The data parallel solutions processes entire arrays at the same time. • No looping is used. • Arrays automatically distributed to processors. All message passing is done behind the scenes. In data parallel, one node, a sort of master, usually holds all scalar values. The SUM function does a reduction and leaves the value in a scalar variable. • A temporary array, COUNTER, with the same size as RANDOM is created for the sum operation
Example: Calculation (Parallel) • Data parallel pseudo code: • fill RANDOM with 2 random numbers between 0 and 1 • where (the values of RANDOM are inside the circle) • COUNTER = 1 • else where • COUNTER = 0 • end where • circle_count = sum (COUNTER) • PI = 4.0*circle_count/npoints
Example:Array Elements Calculation (Serial) • This example shows calculations on array elements that require very little communication. • Elements of 2-dimensional array are calculated. • The calculation of elements is independent of one another - leads to embarrassingly parallel situation. • The problem should be computation intensive. • Serial code could be of the form: • do j = 1,n • do i = 1,n • a(i,j) = fcn(i,j) • end do • end do • The serial program calculates one element at a time in the specified order
Example:Array Elements Calculation (Parallel) • Message Passing • Arrays are distributed so that each processor owns a portion of an array. • Independent calculation of array elements insures no communication amongst processors is needed. • Distribution scheme is chosen by other criteria, e.g. unit stride through arrays. • Desirable to have unit stride through arrays, then the choice of a distribution scheme depends on the programming language. • Fortran: block cyclic distribution • C: cyclic block distribution • After the array is distributed, each processor executes the portion of the loop corresponding to the data it owns. • Notice only the loop variables are different from the serial solution