1 / 49

Parallel Programming

Parallel Programming. Sathish S. Vadhiyar Course Web Page: http://www.serc.iisc.ernet.in/~vss/courses/PPP2009. Motivation for Parallel Programming. Faster Execution time due to non-dependencies between regions of code Presents a level of modularity Resource constraints. Large databases.

morgan
Download Presentation

Parallel Programming

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallel Programming Sathish S. Vadhiyar Course Web Page: http://www.serc.iisc.ernet.in/~vss/courses/PPP2009

  2. Motivation for Parallel Programming • Faster Execution time due to non-dependencies between regions of code • Presents a level of modularity • Resource constraints. Large databases. • Certain class of algorithms lend themselves • Aggregate bandwidth to memory/disk. Increase in data throughput. • Clock rate improvement in the past decade – 40% • Memory access time improvement in the past decade – 10% • Grand challenge problems (more later)

  3. Challenges / Problems in Parallel Algorithms • Building efficient algorithms. • Avoiding • Communication delay • Idling • Synchronization

  4. Challenges P0 P1 Idle time Computation Communication Synchronization

  5. How do we evaluate a parallel program? • Execution time, Tp • Speedup, S • S(p, n) = T(1, n) / T(p, n) • Usually, S(p, n) < p • Sometimes S(p, n) > p (superlinear speedup) • Efficiency, E • E(p, n) = S(p, n)/p • Usually, E(p, n) < 1 • Sometimes, greater than 1 • Scalability – Limitations in parallel computing, relation to n and p.

  6. Speedups and efficiency S E p p Ideal Practical

  7. Limitations on speedup – Amdahl’s law • Amdahl's law states that the performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used. • Overall speedup in terms of fractions of computation time with and without enhancement, % increase in enhancement. • Places a limit on the speedup due to parallelism. • Speedup = 1 (fs + (fp/P))

  8. Amdahl’s law Illustration S = 1 / (s + (1-s)/p) Courtesy: http://www.metz.supelec.fr/~dedu/docs/kohPaper/node2.html http://nereida.deioc.ull.es/html/openmp/pdp2002/sld008.htm

  9. Amdahl’s law analysis • For the same fraction, speedup numbers keep moving away from processor size. • Thus Amdahl’s law is a bit depressing for parallel programming. • In practice, the number of parallel portions of work has to be large enough to match a given number of processors.

  10. Gustafson’s Law • Amdahl’s law – keep the parallel work fixed • Gustafson’s law – keep computation time on parallel processors fixed, change the problem size (fraction of parallel/sequential work) to match the computation time • For a particular number of processors, find the problem size for which parallel time is equal to the constant time • For that problem size, find the sequential time and the corresponding speedup • Thus speedup is scaled or scaled speedup

  11. Metrics (Contd..) Table 5.1: Efficiency as a function of n and p.

  12. Scalability • Efficiency decreases with increasing P; increases with increasing N • How effectively the parallel algorithm can use an increasing number of processors • How the amount of computation performed must scale with P to keep E constant • This function of computation in terms of P is called isoefficiency function. • An algorithm with an isoefficiency function of O(P) is highly scalable while an algorithm with quadratic or exponential isoefficiency function is poorly scalable

  13. Scalability Analysis – Finite Difference algorithm with 1D decomposition For constant efficiency, a function of P, when substituted for N must satisfy the following relation for increasing P and constant E. Can be satisfied with N = P, except for small P. Hence isoefficiency function = O(P2) since computation is O(N2)

  14. Scalability Analysis – Finite Difference algorithm with 2D decomposition Can be satisfied with N = sqroot(P) Hence isoefficiency function = O(P) 2D algorithm is more scalable than 1D

  15. Parallel Algorithm Design

  16. Steps • Decomposition – Splitting the problem into tasks or modules • Mapping – Assigning tasks to processor • Mapping’s contradictory objectives • To minimize idle times • To reduce communications

  17. Mapping • Static mapping • Mapping based on Data partitioning • Applicable to dense matrix computations • Block distribution • Block-cyclic distribution • Graph partitioning based mapping • Applicable for sparse matrix computations • Mapping based on task partitioning 0 0 0 1 1 1 2 2 2 0 1 2 0 1 2 0 1 2

  18. Based on Task Partitioning • Based on task dependency graph • In general the problem is NP complete 0 0 4 0 2 4 6 0 1 2 3 4 5 6 7

  19. Mapping • Dynamic Mapping • A process/global memory can hold a set of tasks • Distribute some tasks to all processes • Once a process completes its tasks, it asks the coordinator process for more tasks • Referred to as self-scheduling, work-stealing

  20. Interaction Overheads • In spite of the best efforts in mapping, there can be interaction overheads • Due to frequent communications, exchanging large volume of data, interaction with the farthest processors etc. • Some techniques can be used to minimize interactions

  21. Parallel Algorithm Design - Containing Interaction Overheads • Maximizing data locality • Minimizing volume of data exchange • Using higher dimensional mapping • Not communicating intermediate results • Minimizing frequency of interactions • Minimizing contention and hot spots • Do not use the same communication pattern with the other processes in all the processes

  22. Parallel Algorithm Design - Containing Interaction Overheads • Overlapping computations with interactions • Split computations into phases: those that depend on communicated data (type 1) and those that do not (type 2) • Initiate communication for type 1; During communication, perform type 2 • Overlapping interactions with interactions • Replicating data or computations • Balancing the extra computation or storage cost with the gain due to less communication

  23. Parallel Algorithm Classification – Types - Models

  24. Parallel Algorithm Types • Divide and conquer • Data partitioning / decomposition • Pipelining

  25. Divide-and-Conquer • Recursive in structure • Divide the problem into sub-problems that are similar to the original, smaller in size • Conquer the sub-problems by solving them recursively. If small enough, solve them in a straight forward manner • Combine the solutions to create a solution to the original problem

  26. Divide-and-ConquerExample: Merge Sort • Problem: Sort a sequence of n elements • Divide the sequence into two subsequences of n/2 elements each • Conquer: Sort the two subsequences recursively using merge sort • Combine: Merge the two sorted subsequences to produce sorted answer

  27. Partitioning • Breaking up the given problem into p independent subproblems of almost equal sizes • Solving the p subproblems concurrently • Mostly splitting the input or output into non-overlapping pieces • Example: Matrix multiplication • Either the inputs (A or B) or output (C) can be partitioned.

  28. Pipelining Occurs with image processing applications where a number of images undergoes a sequence of transformations.

  29. Parallel Algorithm Models • Data parallel model • Processes perform identical tasks on different data • Task parallel model • Different processes perform different tasks on same or different data – based on task dependency graph • Work pool model • Any task can be performed by any process. Tasks are added to a work pool dynamically • Pipeline model • A stream of data passes through a chain of processes – stream parallelism

  30. Parallel Program Classification - Models - Structure - Paradigms

  31. Parallel Program Models • Single Program Multiple Data (SPMD) • Multiple Program Multiple Data (MPMD) Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/

  32. Parallel Program Structure Types • Master-Worker / parameter sweep / task farming • Embarassingly/pleasingly parallel • Pipleline / systolic / wavefront • Tightly-coupled • Workflow P0 P1 P2 P3 P4 P0 P1 P2 P3 P4

  33. Programming Paradigms • Shared memory model – Threads, OpenMP • Message passing model – MPI • Data parallel model – HPF Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/

  34. Parallel Architectures Classification - Classification - Cache coherence in shared memory platforms - Interconnection networks

  35. Classification of Architectures – Flynn’s classification • Single Instruction Single Data (SISD): Serial Computers • Single Instruction Multiple Data (SIMD) - Vector processors and processor arrays - Examples: CM-2, Cray-90, Cray YMP, Hitachi 3600 Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/

  36. Classification of Architectures – Flynn’s classification • Multiple Instruction Single Data (MISD): Not popular • Multiple Instruction Multiple Data (MIMD) - Most popular - IBM SP and most other supercomputers, clusters, computational Grids etc. Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/

  37. Classification of Architectures – Based on Memory • Shared memory • 2 types – UMA and NUMA NUMA Examples: HP-Exemplar, SGI Origin, Sequent NUMA-Q UMA Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/

  38. Classification of Architectures – Based on Memory • Distributed memory Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/ • Recently multi-cores • Yet another classification – MPPs, NOW (Berkeley), COW, Computational Grids

  39. Cache Coherence - for details, read 2.4.6 of book Interconnection networks - for details, read 2.4.2-2.4.5 of book

  40. Cache Coherence in SMPs • All processes read variable ‘x’ residing in cache line ‘a’ • Each process updates ‘x’ at different points of time CPU0 CPU1 CPU2 CPU3 a a a a cache0 cache1 cache2 cache3 a • Challenge: To maintain consistent view of the data • Protocols: • Write update • Write invalidate Main Memory

  41. Caches Coherence Protocols and Implementations • Write update – propagate cache line to other processors on every write to a processor • Write invalidate – each processor get the updated cache line whenever it reads stale data • Which is better??

  42. Caches –False sharing • Different processors update different parts of the same cache line • Leads to ping-pong of cache lines between processors • Situation better in update protocols than invalidate protocols. Why? CPU1 CPU0 A0, A2, A4… A1, A3, A5… cache0 cache1 A0 – A8 A9 – A15 • Modify the algorithm to change the stride Main Memory

  43. Caches Coherence using invalidate protocols • 3 states associated with data items • Shared – a variable shared by 2 caches • Invalid – another processor (say P0) has updated the data item • Dirty – state of the data item in P0 • Implementations • Snoopy • for bus based architectures • Memory operations are propagated over the bus and snooped • Instead of broadcasting memory operations to all processors, propagate coherence operations to relevant processors • Directory-based • A central directory maintains states of cache blocks, associated processors • Implemented with presence bits

  44. Interconnection Networks • An interconnection network defined by switches, links and interfaces • Switches – provide mapping between input and output ports, buffering, routing etc. • Interfaces – connects nodes with network • Network topologies • Static – point-to-point communication links among processing nodes • Dynamic – Communication links are formed dynamically by switches

  45. Interconnection Networks • Static • Bus – SGI challenge • Completely connected • Star • Linear array, Ring (1-D torus) • Mesh – Intel ASCI Red (2-D) , Cray T3E (3-D), 2DTorus • k-d mesh: d dimensions with k nodes in each dimension • Hypercubes – 2-logp mesh – e.g. many MIMD machines • Trees – our campus network • Dynamic – Communication links are formed dynamically by switches • Crossbar – Cray X series – non-blocking network • Multistage – SP2 – blocking network. • For more details, and evaluation of topologies, refer to book

  46. Evaluating Interconnection topologies • Diameter – maximum distance between any two processing nodes • Full-connected – • Star – • Ring – • Hypercube - • Connectivity – multiplicity of paths between 2 nodes. Maximum number of arcs to be removed from network to break it into two disconnected networks • Linear-array – • Ring – • 2-d mesh – • 2-d mesh with wraparound – • D-dimension hypercubes – 1 2 p/2 logP 1 2 2 4 d

  47. Evaluating Interconnection topologies • bisection width – minimum number of links to be removed from network to partition it into 2 equal halves • Ring – • P-node 2-D mesh - • Tree – • Star – • Completely connected – • Hypercubes - 2 Root(P) 1 1 P2/4 P/2

  48. Evaluating Interconnection topologies • channel width – number of bits that can be simultaneously communicated over a link, i.e. number of physical wires between 2 nodes • channel rate – performance of a single physical wire • channel bandwidth – channel rate times channel width • bisection bandwidth – maximum volume of communication between two halves of network, i.e. bisection width times channel bandwidth

  49. END

More Related