240 likes | 368 Views
Virtues of Good (Parallel) Software. Concurrency Able to exploit concurrencies in algorithm/problem/hardware Scalability Resilient to increasing processor count Locality More frequent access to local data than to remote data Modularity Employ abstraction and modular design.
E N D
Virtues of Good (Parallel) Software • Concurrency • Able to exploit concurrencies in algorithm/problem/hardware • Scalability • Resilient to increasing processor count • Locality • More frequent access to local data than to remote data • Modularity • Employ abstraction and modular design
Two Basic Requirements for Parallel Program • Safety: Produce correct results • Result computed on P processors and on 1 processor must be IDENTICAL. • Livelihood: Able to proceed and finish; free of deadlock.
Sources ofOverhead • Execution time • The time that elapses from when the first processor starts executing on the problem to when the last processor completes execution • Execution time = computation time + communication time + idle time • Communication / interprocess interaction: usually main source of overhead • T_comm = t_s + t_w*L • Minimize the volume and frequency of communications; overlap computation/communication • Idling: lack of computation or lack of data • Load imbalance • Synchronization • Presence of serial components • Wait on remote data • Replicated Computation • Communicate or replicate
Speedup & Efficiency • Relative speed-up: the factor by which the execution time is reduced on multiple processors • S(p) = T_1/T_p • T_1 is the execution time on one processor • T_p is execution time on p processors • Absolute speed-up: where is T_1 is the uniprocessor time for best-known (sequential) algorithm • S(p) <= p • Embarrassingly parallel (EP): no communication among cpus. • Superlinear speedup: exists in reality • Efficiency: the fraction of time that processors spend doing useful work. • E = S/p = T_1/(p*T_p) • Parallel cost: p*T_p • Parallel overhead: T_o = p*T_p – T_1
Amdahl’s Law Alpha – fraction of operations in serial code that can be parallelized P – number of processors • This is for a fixed problem size • T_p = alpha*T_1/p + (1-alpha)*T_1 • S 1/(1-alpha) as Pinfinity • Alpha = 90%, S10 • Alpha =99%, S100 • Alpha = 99.9%, S1000 “Mental block”
Gustafson’s Law Alpha – fraction of time spent on parallel operations in the parallel program • This is for a scaled problem size; or constant run time. • T_1 = (1-alpha)*T_p + p*alpha*T_p • As problem size increases, fraction of parallel operations increases
Iso-Efficiency Function • For fixed problem size N, as P increases, increase in speedup S slows down or levels off, efficiency E decreases • For fixed P, as the N increases, S increases, efficiency E increases • As P increases, can increase the problem size N such that the efficiency is kept constant • This N(p) for fixed efficiency is called iso-efficiency function • Rate of increase in N(p), dN/dp, measures the scalability of a parallel program • Smaller rate of increase more scalable
Parallel Program Design PCAM Model (I. Foster) Concurrency, scalability Locality, performance-related issues
Partitioning • Decompose the computation to be performed and the data operated on by this computation into small tasks • Purpose: expose opportunities of parallel execution • Ignore practical issues such as number of processors in target machine etc • Avoid replicating computation and data • Focus: Define a large number of small tasks in order to yield a fine-grained decomposition of the problem • Fine grained decomposition provides the greatest flexibility in terms of potential parallel algorithms Maximize concurrency
Partitioning • Good partition: divides both the computation associated with a problem and the data this computation operates on • Domain/Data decomposition: first focus on data • Partition the data associated with the problem • Associate computations with partitioned data • Functional decomposition: first focus on computation • Decompose computations to be performed • Deal with data decomposed computations work on
Domain Decomposition • Decompose the data first, and then associated computations • “owner computes” • Outcome: tasks comprising some data and a set of operations on that data • Some operation may require data from several tasks communication • Data can be input data, output data, intermediate data, or all of them. • Rule of thumb: focus first on largest data structure or the data structure accessed most frequently • Mesh-based problems: • Structured mesh: 1D, 2D, 3D decompositions • Unstructured mesh: graph partitioning tools such as METIS • Favor the most aggressive decomposition possible at this stage
Functional Decomposition • Focus first on computation to be performed; • Divide computations into disjoint tasks • Then consider the data associated with each sub-task • Data requirements may be disjoint done • Data may overlap significantly, communications; May just as well try domain decomposition • Provide an alternative way of thinking about problem; Hybrid decomposition maybe best • E.g. multi-physics simulations, overall functional decomposition, each component domain decomposition
Partitioning: Questions to Ask • Does your partition define more tasks (an order of magnitude more?) than the number of processors of the target machine? • No reduced flexibility in subsequent stages • Does your partition avoid redundant computation and storage requirements? • No may not be scalable to large problems • Are tasks of comparable size? • No hard to allocate to cpus with equal amount of work load imbalance • Does the number of tasks scale with problem size? • Ideal: increased problem size increase in number of tasks • No may not be able to solve larger problems with more processors • Have you identified alternative partitions? • Maximize flexibility; try both domain and functional decompositions
Communication • Purpose: Determine the interaction among tasks • Distribute communication operations among many tasks • Organize communication operations in a way that permits concurrent execution • 4 categories of communications: • Local/global communications: • Local: each task communicates with a small set of other tasks (neighbors) • Global: communicate with many or all other tasks
Communication • Structured/un-structured communication • Structured: A task and neighbors form a regular structure, grid or tree • Un-structured: communication represented by arbitrary graphs • Static/dynamic communication: • Static: identity of communication partners does not change over time • Dynamic: identity of partners determined by data computed at runtime and highly variable • Synchronous/asynchronous communication • Synchronous: requires coordination between communication partners • Asynchronous: without cooperation
Task Dependency Graph • Task dependencies: one task cannot start until some other task(s) finishes. • E.g. the output of one task is the input to another task • Represented by the task dependency graph: • Directed acyclic • Nodes: tasks (task size as the weight of node) • Directed edges: dependencies among tasks
Task Dependency Graph • Degree of concurrency: number of tasks that can run concurrently • Maximum degree of concurrency: the maximum number of tasks that can be executed simultaneously at any given time • Average degree of concurrency: the average number of tasks that can run concurrently over the duration of program • Critical path: The longest vertex-weighted directed path between any pair of start and finish nodes • Critical path length: sum of vertex weights along the critical path • Average degree of concurrency = total amount of work / critical path length
Task Interaction Graph • Even independent tasks may need to interact, e.g. sharing data • Interaction graph: captures interaction patterns among tasks • Nodes: tasks • Edges: communications / interactions • Usually contains task dependency graph as sub-graph Example interaction graph
Communication: Questions to Ask • Do all tasks perform the same number of communication operations? • Unbalanced communication poor scalability • Distribute communications equitably • Does each task communicate only with a small number of neighbors? • May need to re-formulate global communication in terms of local communication structures • Can communications proceed concurrently? • Can computations associated with different tasks proceed concurrently? • No may need to re-order computations / communications
Agglomeration • Improve performance: Combine tasks to reduce the task interaction strength, increase locality, increase the computation and communication granularity. Also determine if it is worthwhile to replicate data/computation • Dependent tasks will be combined • Independent tasks may also be agglomerated to increase granularity • Goals: reduce communication cost, retain flexibility w.r.t. scalability and mapping decisions
Increasing Granularity • Coarse-grain usually performs better: • Send less data (reduce volume of communication) • Use fewer messages when sending same amount of data (reducing frequency of communications) • Surface-to-volume effects: • Communication cost usually proportional to surface area of domain • Computation cost usually proportional to volume of domain • As task size increases, amount of communication per unit computation decreases • High-D decomposition usually more efficient than low-D decompositions, due to reduced surface area for a given volume. • Replicate computation: • May trade off replicated computation for reduced communication or execution time.
Agglomeration: Questions to Ask • Has agglomeration reduced communication costs by increasing locality? • If computation is replicated, have you verified that the benefits of replication out-weigh its costs for a range of problem size and processor counts? • If data is replicated, have you verified that it does not comprise scalability • Do the tasks have similar computation and communication costs after agglomeration? • Load balance • Does the number of tasks still scale with problem size?
Mapping • Map tasks to processors or processes. • If the number of tasks is larger than the number of processors, may need to place more than one task on a single processor • Goal: minimize total execution time • Place tasks that execute concurrently on different processors • Place tasks that communicate frequently on the same processor • In general case, no computationally tractable algorithm for the mapping problem, NP-complete. • If SPMD-style, one task per processor
Parallel Algorithm Models • Data parallel model: processors perform similar operations on different data • Work/task pool model (replicated workers): • Pool of tasks, a number of processors • A processor can remove a task from pool and work on it • A processor may generate a new task during computation and add it to the pool • Master-slave/manager-worker model: master processors generate work and allocate it to worker processors • Pipeline/producer-consumer model: a stream of data passes through a succession of processors, each perform some task on it. • Hybrid model: combination of two or more models