530 likes | 1.07k Views
Parallel Computing 5 Parallel Application Design Ond řej Jakl Institute of Geonics, Academy of Sci. of the CR. Outline of the lecture. Task/channel model Foster’s design methodology Partitioning Communication analysis Agglomeration Mapping to processors Examples.
E N D
Parallel Computing 5Parallel Application DesignOndřej JaklInstitute of Geonics, Academy of Sci. of the CR
Outline of the lecture • Task/channel model • Foster’s design methodology • Partitioning • Communication analysis • Agglomeration • Mapping to processors • Examples
Design of parallel algorithms • In general a very creative process • Only methodical frameworks available • Usually more alternatives to be considered • The best parallel solution may differ from suggestions of the sequential approach
Task/channel model (1) • Introduced in Ian Foster’s Designing and Building Parallel Programs [Foster 1995] • http://www-unix.mcs.anl.gov/dbpp • Represents a parallel computation as set of tasks • task is a program, its local memory and a collection of I/O ports • task can send local data values to other tasks via output ports • task can receive data values from other tasks via input ports • The tasks may interact with each other by sending messages through channels • channel is a message queue that connects one task’s output port with another task’s input port • nonblocking asynchronous send and blocking receive issupposed • An abstraction close to the message passing model
Task/channel model (2) Input port Program Output port Task after [Quinn 2004] Channel Directed graph of tasks (vertices) and channels (edges)
Foster’s methodology [Foster 1995] • Design stages: • partitioning into concurrent tasks • communication analysis to coordinate tasks • agglomeration into larger tasks with respect to the target platform • mapping of tasks to processors • 1, 2 conceptual level, 3, 4 implementation dependent • In practice often considered simultaneously
Partitioning (decomposition) • Process of dividing the computation and the data into pieces – primitive tasks • Goal: Expose the opportunities for parallel processing • Maximal (fine-grained) decomposition for greater flexibility • Complementary techniques: • domain decomposition (data centric approach) • functional decomposition (computation centric approach) • Combinations possible • usual scenario:primary decomposition – functional secondary decomposition – domain
Domain (data) decomposition • Primary object of decomposition: processed data • first, data associated with the problem is divided into pieces • focus on the largest and/or most frequently accessed data • pieces should be of comparable size • next, the computation is partitioned according to the data on which it operates • usually the same code for each task (SPMD – Single Program Multiple Data) • may be non-trivial, may bring up complex mathematical problems • Most often used technique in parallel programming 3D grid data: one-, two-, three-dimensional decomposition[Foster 1995]
Functional (task) decomposition Climate model [Foster 1995] • Primary object of decomposition: computation • first, computation is decomposed into disjoint tasks • different codes of the tasks (MPMD – Multiple Program Multiple Data) • methodological benefits: implies program structuring • gives rise to simpler modules with interfaces • c.f. object oriented programming, etc. • next, data is partitioned according to the requirements of the tasks • data requirements may be disjoint, or overlap ( communication) • Sources of parallelism: • concurrent processing of independent tasks • concurrent processing of a stream of data through pipelining • a stream of data is passed on through a succession of tasks, each of which perform some operation on it • MPSD – Multiple Program Single Data • The number of task usually does not scale with the problem size – for greater scalability combine with domain decomposition on the subtasks
Good decomposition • More tasks (at least by order of magnitude) then processors • if not: little flexibility • No redundancy in processing and data • if not: little scalability • Comparable size of tasks • if not: difficult load balancing • Number of task proportional to the size of the problem • if not: problems utilizing additional processors • Alternate partitions available?
4.0 F(x) = 4/(1+x2) 2.0 0.0 1.0 x Example: PI calculation • Calculation ofπ by the standard numerical integration formula • Consider numerical integration based on the rectangle method • integral is approximated by the area of evenly spaced rectangular strips • height of the strips is calculated as the value of the integrated function at the midpoint of the strips
PI calculation – sequential algorithm Seqential pseudocode setn (number of strips) for each strip calculate the height y of the strip (rectangle) at its midpoint sum all y to the result S endfor multiply S by the width of the strips print result
PI calculation – parallel algorithm Parallel pseudocode (for the task/channel model) if master then setn (number of strips) send n to the workers else // worker receive n from the master endif for each strip assigned to this task calculate the height y of the strip (rectangle) at midpoint sum all y to the (partial) result S endfor if master then receive Sfrom all workers sum all Sand multiply by the width of the strips print result else // worker send Sto the master endif
Parallel PI calculation – partitioning • Domain decomposition: • primitive task– calculation of one strip height • Functional decomposition: • manager task: controls the computationworker task(s): perform the main calculation • manager/worker technique (also called control decomposition) • more or less technical decomposition • A perfectly/embarrassingly parallelproblem: the (worker) processes are (almost) independent
Communication analysis • Determination of the communication pattern among the primitive tasks • Goal: Expose the information flow • The tasks generated by partitioning are as a rule not independent– they cooperate by exchanging data • Communication means overhead – minimize! • not included in the sequential algorithm • Efficient communication may be difficult to organize • especially in domain-decomposed problems
Parallel communication Cathegorization local: between small number of “neighbours” global: many “distant” tasks participate structured: regular and repeated communication patterns in placeand time unstructured: communication networks are arbitrary graphs static: communication partners do not change over time dynamic: communication depends on the computation history and changes at runtime synchronous: communication partners cooperate in data transfer operations asynchronous: producers are not able to determine data requests of consumers The first items are to be preferred in parallel programs
Good communication • Preferably no communication involved in parallel algorithm • if not: overhead decreasing parallel efficiency • Tasks have comparable communication demands • if not: little scalability • Tasks communicate only with a small number of neighbours • if not: loss of parallel efficiency • Communication operations and computation in different tasks can proceed concurrentlycommunication and computation can overlap • if not: inefficient and nonscalable algorithm
Example: Jacobi differences • Jacobi finite difference method • Repeated update (in timesteps) of values assigned to points of a multidimensional grid • In 2-D, the grid point i, j may get in timestep t+1 a value given by the formula (weighted mean) [Foster 1995]
Jacobi: parallel algorithm • Decomposition (domain): • primitive task – calculation of the weighted mean in one grid point • Parallel codemain loop • for each timestep t • send Xi,j(t) to each neighbour • receive Xi-1,j(t), Xi+1,j(t), Xi,j-1(t), Xi,j+1(t) from neighbours • calculate Xi,j(t+1) • endfor • Communication: • communication channels between neighbours • local, structured, static, synchronous
Example: Gauss-Seidel scheme • More efficient in sequential computing • Not easy to parallelize [Foster 1995]
Agglomeration • Process of gouping primitive tasks into larger tasks • Goal: revision of the (abstract, conceptual) partitioning and communication to improve performance • choose granularity appropriate to the target parallel computer • Large number of fine-grained tasks tend to be inefficient because of great • communication cost • task creation cost • spawn operation rather expensive (and to simplify programmingdemands) • Agglomeration increases granularity • potential conflict with retaining flexibility and scalability [next slides] • Closely related with mapping to processors
Agglomeration& granularity • Measure characterizing the size and quantity of tasks • Increasing granularity by combining several tasks into larger ones • reduces communication cost • less communication (a) • fewer, but larger messages (b) • reduces task creation cost • less processes • Agglomerate tasks that • frequently communicate with each other • increaseslocality • cannot execute concurrently • Consider also [next slides] • surface-to-volume effects • replicationof computation/data [Quinn 2004]
Surface-to-volume effects (1) • The communication/computation ratio decreases with increasing granularity: • computation cost is proportional to the “volume” of the subdomain • communication cost is proportional to the “surface” • Agglomeration in all dimension is most efficient • reduces surface for given volume • in practice is more difficult to code • Difficult with unstructured communication • Ex.: Jacobi finite differences [next slide]
Agglomeration 4 x 4: No agglomeration: Surface-to-volume effects (2) Ex.: Jacobi finite differences – agglomeration [Foster 1995] >
Agglomeration& flexibility • Ability to make use of diverse computing environments • good parallel programs are resilient to changes in processor count • scalability - ability to employ increasing number of tasks • Too coarse granularity reduces flexibility • Usual practical design: agglomerate one task per processor • can be controlled by a compile-time or runtime parameters • with some MPS (PVM, MPI-2) on-the-fly (dynamic spawn) • But consider also creating more tasks than processors: • when tasks often wait for remote data: several tasks mapped to one processor permitoverlapping computation and communication • greater scope for mapping strategies that balance computational load over available processors • a rule of thumb: an order of magnitude more tasks • Optimal number of tasks: determined by a combination of analytic modelling and empirical studies
s s s s s s d d d s Replicating computation • To reduce communication requirements, the same computation is repeated in several tasks • compute once & distribute vs. compute repeatedly & don’t communicate – a trade off • Redundant computation pays off when its computational cost is less then the communication cost • moreover it removes dependences • Ex.: summation of numbers (located on separate processors) with distribution of the result • Without replication:2(n – 1) steps • (n – 1) additions • necessary minimum • With replication:(n – 1) steps • n (n – 1) additions • (n – 1)2 redundant
Good agglomeration • Increased locality of communication • Beneficial replication of computation • Replication of data does not compromise scalability • Similar computation and communication costs of the agglomerated tasks • Number of tasks can scale with the problem size • Fewer larger-grained tasks is usually more efficient than more fine-grained tasks
Mapping • Process of assigning (agglomerated) tasks to processors for execution • Goal: Maximize processor utilization, minimize interprocessor communication • load balancing • Concerns multicomputers only • multiprocessors: automatic task scheduling • Guidelines to minimize execution time: • concurrent task place on different processors (increase concurrency) • tasks with frequent communication place on the same processor (enhance locality) • Optimal mapping is generally an NP-complete problem • strategies, heuristics for special classes of problems available conflicting
Basic mapping strategies [Quinn 2004]
barrier Load balancing • Mapping strategy with the aim to keep all processors busy during the execution of the parallel program • minimization of the idle time • In heterogeneous computing environmentevery parallel application may need (dynamic) load balancing • Static load balancing • performed before the program enters the solution phase • Dynamic load balancing • needed when task created/destroyed at run-time and/or comm./comp requirements of tasks vary widely • invoked occasionally during the execution of the parallel program • analyses the current computation and rebalances it • may imply significant overhead! Bad load balancing [LLNL 2010]
Load-balancing algorithms • Most appropriate for domain decomposed problems • Representative examples [next slides] • recursive bisection • probabilistic methods • local algorithms
Recursive bisection • Recursive cuts into subdomains of nearly equal computational cost while attempting to minimize communication • allows the partitioning algorithm itself to be executed in parallel Irregular grid for a superconductivity simulation[Foster 1995] • Coordinate bisection: • for irregular grids with local communication • cuts into halves based on physical coordinates of grid points • simple, but does care for communication • unbalanced bisection: does not necessarily divide into halves • to reduce communication • a lot of variants • e.g. recursive graph bisection
proc. #1 proc. #2 proc. #3 4.0 F(x) = 4/(1+x2) 2.0 0.0 1.0 x Probabilistic methods • Allocate tasks randomly on processors • about the same computation load can be expected for large number of tasks • typically at least ten times as many tasks as processors required • Communication is usually not considered • appropriate for tasks with little communication and/or little locality in communication • Simple, low cost, scalable • Variant: cyclic mappingfor spatial locality in load levels • each of p processors is allocated every pth task • Variant: block cyclic distributions • blocks of tasks are allocated to processors
Local algorithms • Compensate for changes in computational load using only local information obtained from a small number of “neighbouring” tasks • do not require expensive global knowledge of computational state • If imbalance exists (threshold), some computation load is transferred to the less loaded neighbour • Simple, but less efficient then global algorithms • slow when adjusting major changes in load characteristics • Advantageous for dynamic load balancing Local algorithm for a grid problem [Foster 1995]
Task-scheduling algorithms • Suitable for a poolof independent tasks • represent stand-alone problems, contain solution code + data • can be conceived as special kind of data • Often obtained from functional decomposition • many tasks with weak locality • Centralized or distributed variants • Dynamic load balancing by default • Examples: • (hierarchical) manager/worker • decentralized schemes
Manager/worker • Simple task scheduling scheme • sometimes called “master/slave” • Central manager task is responsible for problem allocation • maintains a pool (queue) of problems • e.g. a search in a particular tree branch • Workers run on separate processors and repeatedly request and solve assigned problems • may also sent new problems to the manager • Efficiency: • consider cost of problemtransfer • prefetching, caching applicable • manager must not become a bottleneck • Hierarchical manager/worker variant • introduces a layer of submanagers responsible for subset of workers [Wilkinson 1999]
Decentralized schemes • Task-scheduling without global management • Task pool is a data structure distributed among many processors • The pool is accessed asynchronously by idle workers • various access polices: neighbours, by random, etc. • Termination detection may be difficult
Good mapping • In general: Try to balance conflicting requirements for equitable load distribution and low communication cost • When possible, use static mapping allocating each process to a single processor • Dynamic load balancing / task scheduling can be appropriate when the number or size of tasks is variable or not known until runtime • With centralized load-balancing schemes verify that the manager will not become a bottleneck • Consider implementation cost
Conclusions • Foster’s design methodology is conveniently applicable • in [Quinn 2004] made use of for the design of many parallel programs in MPI (OpenMP) • In practice, all phases often considered in parallel • In bad practice, conceptual phases skipped • machine-dependent design from the very beginning • Some kind of a “life-belt” (“fix point”) when the development comes into troubles
Further study • [Foster 1995] Designing and Building Parallel Programs • [Quinn 2004] Parallel Programming in C with MPI and OpenMP • In most textbooks a chapter like “Principles of parallel algorithm design” • often concentrated on the mapping step