1 / 24

Virtues of Good (Parallel) Software

Virtues of Good (Parallel) Software. Concurrency Able to exploit concurrencies in algorithm/problem/hardware Scalability Resilient to increasing processor count Locality More frequent access to local data than to remote data Modularity Employ abstraction and modular design.

amora
Download Presentation

Virtues of Good (Parallel) Software

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Virtues of Good (Parallel) Software • Concurrency • Able to exploit concurrencies in algorithm/problem/hardware • Scalability • Resilient to increasing processor count • Locality • More frequent access to local data than to remote data • Modularity • Employ abstraction and modular design

  2. Two Basic Requirements for Parallel Program • Safety: Produce correct results • Result computed on P processors and on 1 processor must be IDENTICAL. • Livelihood: Able to proceed and finish; free of deadlock.

  3. Sources ofOverhead • Execution time • The time that elapses from when the first processor starts executing on the problem to when the last processor completes execution • Execution time = computation time + communication time + idle time • Communication / interprocess interaction: usually main source of overhead • T_comm = t_s + t_w*L • Minimize the volume and frequency of communications; overlap computation/communication • Idling: lack of computation or lack of data • Load imbalance • Synchronization • Presence of serial components • Wait on remote data • Replicated Computation • Communicate or replicate

  4. Speedup & Efficiency • Relative speed-up: the factor by which the execution time is reduced on multiple processors • S(p) = T_1/T_p • T_1 is the execution time on one processor • T_p is execution time on p processors • Absolute speed-up: where is T_1 is the uniprocessor time for best-known (sequential) algorithm • S(p) <= p • Embarrassingly parallel (EP): no communication among cpus. • Superlinear speedup: exists in reality • Efficiency: the fraction of time that processors spend doing useful work. • E = S/p = T_1/(p*T_p) • Parallel cost: p*T_p • Parallel overhead: T_o = p*T_p – T_1

  5. Amdahl’s Law Alpha – fraction of operations in serial code that can be parallelized P – number of processors • This is for a fixed problem size • T_p = alpha*T_1/p + (1-alpha)*T_1 • S 1/(1-alpha) as Pinfinity • Alpha = 90%, S10 • Alpha =99%, S100 • Alpha = 99.9%, S1000 “Mental block”

  6. Gustafson’s Law Alpha – fraction of time spent on parallel operations in the parallel program • This is for a scaled problem size; or constant run time. • T_1 = (1-alpha)*T_p + p*alpha*T_p • As problem size increases, fraction of parallel operations increases

  7. Iso-Efficiency Function • For fixed problem size N, as P increases, increase in speedup S slows down or levels off, efficiency E decreases • For fixed P, as the N increases, S increases, efficiency E increases • As P increases, can increase the problem size N such that the efficiency is kept constant • This N(p) for fixed efficiency is called iso-efficiency function • Rate of increase in N(p), dN/dp, measures the scalability of a parallel program • Smaller rate of increase  more scalable

  8. Parallel Program Design PCAM Model (I. Foster) Concurrency, scalability Locality, performance-related issues

  9. Partitioning • Decompose the computation to be performed and the data operated on by this computation into small tasks • Purpose: expose opportunities of parallel execution • Ignore practical issues such as number of processors in target machine etc • Avoid replicating computation and data • Focus: Define a large number of small tasks in order to yield a fine-grained decomposition of the problem • Fine grained decomposition provides the greatest flexibility in terms of potential parallel algorithms Maximize concurrency

  10. Partitioning • Good partition: divides both the computation associated with a problem and the data this computation operates on • Domain/Data decomposition: first focus on data • Partition the data associated with the problem • Associate computations with partitioned data • Functional decomposition: first focus on computation • Decompose computations to be performed • Deal with data decomposed computations work on

  11. Domain Decomposition • Decompose the data first, and then associated computations • “owner computes” • Outcome: tasks comprising some data and a set of operations on that data • Some operation may require data from several tasks  communication • Data can be input data, output data, intermediate data, or all of them. • Rule of thumb: focus first on largest data structure or the data structure accessed most frequently • Mesh-based problems: • Structured mesh: 1D, 2D, 3D decompositions • Unstructured mesh: graph partitioning tools such as METIS • Favor the most aggressive decomposition possible at this stage

  12. Functional Decomposition • Focus first on computation to be performed; • Divide computations into disjoint tasks • Then consider the data associated with each sub-task • Data requirements may be disjoint  done • Data may overlap significantly, communications; May just as well try domain decomposition • Provide an alternative way of thinking about problem; Hybrid decomposition maybe best • E.g. multi-physics simulations, overall functional decomposition, each component domain decomposition

  13. Partitioning: Questions to Ask • Does your partition define more tasks (an order of magnitude more?) than the number of processors of the target machine? • No  reduced flexibility in subsequent stages • Does your partition avoid redundant computation and storage requirements? • No  may not be scalable to large problems • Are tasks of comparable size? • No  hard to allocate to cpus with equal amount of work  load imbalance • Does the number of tasks scale with problem size? • Ideal: increased problem size  increase in number of tasks • No  may not be able to solve larger problems with more processors • Have you identified alternative partitions? • Maximize flexibility; try both domain and functional decompositions

  14. Communication • Purpose: Determine the interaction among tasks • Distribute communication operations among many tasks • Organize communication operations in a way that permits concurrent execution • 4 categories of communications: • Local/global communications: • Local: each task communicates with a small set of other tasks (neighbors) • Global: communicate with many or all other tasks

  15. Communication • Structured/un-structured communication • Structured: A task and neighbors form a regular structure, grid or tree • Un-structured: communication represented by arbitrary graphs • Static/dynamic communication: • Static: identity of communication partners does not change over time • Dynamic: identity of partners determined by data computed at runtime and highly variable • Synchronous/asynchronous communication • Synchronous: requires coordination between communication partners • Asynchronous: without cooperation

  16. Task Dependency Graph • Task dependencies: one task cannot start until some other task(s) finishes. • E.g. the output of one task is the input to another task • Represented by the task dependency graph: • Directed acyclic • Nodes: tasks (task size as the weight of node) • Directed edges: dependencies among tasks

  17. Task Dependency Graph • Degree of concurrency: number of tasks that can run concurrently • Maximum degree of concurrency: the maximum number of tasks that can be executed simultaneously at any given time • Average degree of concurrency: the average number of tasks that can run concurrently over the duration of program • Critical path: The longest vertex-weighted directed path between any pair of start and finish nodes • Critical path length: sum of vertex weights along the critical path • Average degree of concurrency = total amount of work / critical path length

  18. Task Interaction Graph • Even independent tasks may need to interact, e.g. sharing data • Interaction graph: captures interaction patterns among tasks • Nodes: tasks • Edges: communications / interactions • Usually contains task dependency graph as sub-graph Example interaction graph

  19. Communication: Questions to Ask • Do all tasks perform the same number of communication operations? • Unbalanced communication  poor scalability • Distribute communications equitably • Does each task communicate only with a small number of neighbors? • May need to re-formulate global communication in terms of local communication structures • Can communications proceed concurrently? • Can computations associated with different tasks proceed concurrently? • No  may need to re-order computations / communications

  20. Agglomeration • Improve performance: Combine tasks to reduce the task interaction strength, increase locality, increase the computation and communication granularity. Also determine if it is worthwhile to replicate data/computation • Dependent tasks will be combined • Independent tasks may also be agglomerated to increase granularity • Goals: reduce communication cost, retain flexibility w.r.t. scalability and mapping decisions

  21. Increasing Granularity • Coarse-grain usually performs better: • Send less data (reduce volume of communication) • Use fewer messages when sending same amount of data (reducing frequency of communications) • Surface-to-volume effects: • Communication cost usually proportional to surface area of domain • Computation cost usually proportional to volume of domain • As task size increases, amount of communication per unit computation decreases • High-D decomposition usually more efficient than low-D decompositions, due to reduced surface area for a given volume. • Replicate computation: • May trade off replicated computation for reduced communication or execution time.

  22. Agglomeration: Questions to Ask • Has agglomeration reduced communication costs by increasing locality? • If computation is replicated, have you verified that the benefits of replication out-weigh its costs for a range of problem size and processor counts? • If data is replicated, have you verified that it does not comprise scalability • Do the tasks have similar computation and communication costs after agglomeration? • Load balance • Does the number of tasks still scale with problem size?

  23. Mapping • Map tasks to processors or processes. • If the number of tasks is larger than the number of processors, may need to place more than one task on a single processor • Goal: minimize total execution time • Place tasks that execute concurrently on different processors • Place tasks that communicate frequently on the same processor • In general case, no computationally tractable algorithm for the mapping problem, NP-complete. • If SPMD-style, one task per processor

  24. Parallel Algorithm Models • Data parallel model: processors perform similar operations on different data • Work/task pool model (replicated workers): • Pool of tasks, a number of processors • A processor can remove a task from pool and work on it • A processor may generate a new task during computation and add it to the pool • Master-slave/manager-worker model: master processors generate work and allocate it to worker processors • Pipeline/producer-consumer model: a stream of data passes through a succession of processors, each perform some task on it. • Hybrid model: combination of two or more models

More Related