240 likes | 259 Views
This presentation introduces the concept of pattern programming in parallel computing, aiming to simplify and structure parallel programming for educational environments. By utilizing design patterns and hierarchical designs, the program abstracts the computing environment, reducing source code size and automating the conversion into parallel programs. The presentation covers both high-level and low-level parallel design patterns and explains common patterns used in distributed computing, such as master-slave, workpool, pipeline, divide and conquer, stencil, and map-reduce. Additionally, it delves into low-level MPI message-passing patterns like point-to-point, broadcast, scatter, gather, and reduce. The advantages and disadvantages of pattern programming are discussed, along with examples of collective and all-to-all communication patterns. The presentation concludes by exploring specialized high-level patterns like pipeline, divide and conquer, all-to-all, and stencil, providing insights into their applications in parallel computing.
E N D
Pattern Parallel Programming B. Wilkinson PatternProgIntro.ppt Modification date: Feb 21, 2016 1
Traditional programming approach • Explicitly specify message-passing (MPI) • Low-level threads APIs (Pthreads, Java threads, OpenMP, …). Both require programmers to use low-level routines Need to make parallel programming easier, more structured and more scalable, especially in an educational environment 2
Pattern Programming Concept Programmer begins by constructing his program using established computational or algorithmic “patterns” that provide a structure. Design patterns - part of software engineering for many years: • Reusable solutions to commonly occurring problems * • Patterns provide guide to “best practices”, not a final implementation • Provides good scalable design structure • Avoids common problem with ad-hoc designs • Can reason more easily about programs and debug * http://en.wikipedia.org/wiki/Design_pattern_(computer_science)
Parallel Patterns -- Advantages • Abstracts/hides underlying computing environment • Generally avoids deadlocks and race conditions • Reduces source code size (lines of code) • Leads to automated conversion into parallel programs without need to write with low level MPI message-passing routines. • Hierarchical designs with patterns embedded into patterns, and pattern operators to combine patterns. Disadvantages • New approach to learn • Takes away some of the freedom from programmer • Performance reduced (c.f. using high level languages instead of assembly language)
What parallel design patterns are we talking about? Higher level patterns for forming a complete computation: • master-slave • workpool, • pipeline • divide and conquer • stencil • map-reduce, ... • Low-level patterns: • fork-join • point-to point • broadcast • scatter • gather, reduce, ...
Low level MPI message-passing patterns MPI point-to-point Data Transfer (Send-Receive) Destination Source Data
Collective patterns Broadcast Pattern Sends same data to each of a group of processes. A common pattern to get same data to all processes, especially at beginning of a computation Destinations Same data sent to all destinations Source Note: Patterns given do not mean the implementation does them as shown. Only the final result is the same in any parallel implementation. Patterns do not describe the implementation.
Scatter Pattern Distributes a collection of data items to a group of processes. A common pattern to get data to all processes. Usually data sent are parts of an array Destinations Different data sent to each destinations Source
Gather pattern Sources Destination Essentially reverse of scatter pattern. It receives data items from a group of processes Data Data Data collected at destination in an array Data Common pattern especially at the end of a computation to collect results.
Reduce Pattern Sources A common pattern to get data back to master from all processes and then aggregate it by combining collected data into one answer. Destination Data Reduction operation Data Data collected at destination and combined to get one answer with a commutative operation Reduction needs to be associative operation (e.g. 3 + (4 + 5) = (3 + 4) + 5) to allow the implementation to do the operations in any order. Also being communicative (e.g. 3 + 4 = 4 + 3) allows more flexibility in the parallel implementation. Data Note subtraction is not associative e.g. 3 – (4 – 5) != (3 – 4) – 5 but one can use addition with negative numbers
Collective all-to-all broadcast Sources and destinations are the same processes Destinations Sources A common all-to-all pattern, often within a computation, is to send data from all processes to all processes Every process sends data to every other process (one-way) Versions of this can be found in MPI.
Some Higher Level Message-Passing Patterns Slaves Master/slave Master Two-way connection Computation divided into parts, which are then passed out to slaves to perform and return their results, basis of most parallel computing Compute node Source/sink
Workpool Slaves/Workers Another task if task queue not empty Very widely applicable pattern Result Task from task queue Aggregate answers Task queue Once a slave completes a task, slave given another task from task queue master -- load-balancing quality. Need to differentiate between master-slave pattern, which does not imply a task queue, and workpool with task queue. Master
More Specialized High-level Patterns Pipeline Stage 1 Stage 2 Stage 3 Stage n Slaves (workers) One-way connection Master Compute node Source/sink
Divide and Conquer Two-way connection Divide Merge Compute node Source/sink
All-to-All All compute nodes can communicate with all the other nodes Two-way connection Compute node Source/sink Master
Stencil All compute nodes can communicate with only neighboring nodes Usually a synchronous computation - Performs number of iterations to converge on solution, e.g. solving Laplace’s/heat equation On each iteration, each node communicates with neighbors to get stored computed values Two-way connection Compute node Source/sink
Iterative synchronous patterns • When a pattern is repeated until some termination condition occurs. • Synchronization at each iteration, to establish termination condition, often a global condition. • Note this is two patterns merged together sequentially if we call iteration a pattern. Pattern Check termination condition Repeat Stop
Iterative synchronous stencil pattern Stencil: All compute nodes can communicate with only neighboring nodes • Applications: • Solving Laplace’s/heat equation - perform number of iterations to converge on solution. Repeat Check termination condition Stop 19
Iterative synchronous all-to-all pattern Repeat Check termination condition Stop Example: N-body problem needs an “iterative synchronous all-to-all” pattern, where on each iteration all processes exchange data with each other. 20
Previous/Existing Work Patterns explored in several projects. • Industrial efforts • Intel Threading Building Blocks (TBB), Cilk plus, Array Building Blocks (ArBB). Focus on very low level patterns such as fork-join • Universities: • University of Illinois at Urbana-Champaign and University of California, Berkeley • University of Torino/Università di Pisa Italy “Structured Parallel Programming: Patterns for Efficient Computation,” Michael McCool, James Reinders, Arch Robison, Morgan Kaufmann, 2012 Intel tools, TBB, Cilk, ArBB
Our approach We have developed several tools at different levels of abstraction that avoid using low level MPI and enable students to create working patterns very quickly. • Suzaku framework – provides pre-written pattern-based routines and macros that hide the MPI code. Low level patterns, workpool, ... . • Paraguin compiler – Compiler directive approach that creates MPI code. Patterns implemented include scatter-gather for a master slave pattern, stencil, … • Seeds framework – high-level Java-based software. Many patterns implemented including workpool, pipeline, synchronous iterative all-to-all, stencil. Self deploys and executes on any platform, local computers or distributed computers Historical Seeds was developed first as part of a UNC-C PhD project by Jeremy Villalobos, 2007-2011.
Acknowledgements The Seeds framework was developed by Jeremy Villalobos in his PhD thesis “Running Parallel Applications on a Heterogeneous Environment with Accessible Development Practices and Automatic Scalability,” UNC-Charlotte, 2011. Extending work to teaching environment supported by the National Science Foundation under grant "Collaborative Research: Teaching Multicore and Many-Core Programming at a Higher Level of Abstraction" #1141005/1141006 (2012-2015). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.