Multi-Tasking Models and Algorithms

Multi-Tasking Models and Algorithms General Concepts (Part I)

Outline for Multi-Tasking Models Note: Items in black are in this slide set (Part I). • Preliminaries • Common Decomposition Methods • Characteristics of Tasks and Interactions • Mapping Techniques for Load Balancing • Some Parallel Algorithm Models • The Data-Parallel Model • The Task Graph Model • The Work Pool Model • The Master-Slave Model • The Pipeline or Producer-Consumer Model • Hybrid Models

Outline (cont.) • Algorithm examples for most of preceding algorithm models. • This part currently missing & need to add next time. • Some could be added as examples under Task/Channel model • Task-Channel (Computational) Model • Asynchronous Communication and Performance Evaluation • Modeling Asynchronous Communicaiton • Performance Metrics and Asynchronous Communications • The Isoefficiency Metric & Scalability • Future revision plans for preceding material. • BSP (Computational) Model • Slides posted separately on course website

References • Michael Quinn, Parallel Programming in C with MPI and OpenMP, McGraw Hill, 2004. • Particularly, Chapters 3 and 7 plus algorithm examples. • Textbook slides for this book • Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar, Introduction to Parallel Computing, 2nd Edition, Addison Wesley, 2003. • Particularly, Chapter 3 (available online) • Also, Section 2.5 (Asynchronous Communications) • Slides by the Authors’ • Barry Wilkinson and Michael Allen, “Parallel Programming: Techniques and Applications • http://www-unix.mcs.anl.gov/dbpp/text/book.html • Using Networked Workstations and Parallel Computers ”, Second Edition, Prentice Hall, 2005. • Ian Foster, Designing and Building Parallel Programs: Concepts and Tools for Parallel Software Engineering, Addison Wesley, 1995, Online at

Change in Chapter Title • This chapter consists of three sets of slides. • This chapter was formerly called Strictly Asynchronous Models • The name has now been changed to Multi-Tasking Models • However, the old name still occurs regularly in the internal slides.

Specifying Asynchronous Algorithms • Identifying parts that can be done concurrently • Tasks • Mapping of the tasks onto multiple processors • Processes vs processors • Distributing the input, output, and intermediate results across different processors • Management of access to shared data • Either input or intermediate • Synchronization of the processors at various stages of the parallel execution

Finding Concurrent Pieces of Work • Decomposition • The process of dividing the computation into smaller pieces of work called tasks • Tasks are programmer defined and are considered to be indivisible. • Tasks may be of arbitrary sizes • Simultaneous execution of multiple tasks is the key to reducing time required

Example: Dense Matrix-Vector Multiplication • Tasks can be of different size • Granularity of Task

Task-Dependency Graph • In most cases, there are dependencies between the different tasks • Certain task(s) can only start once some other task(s) have finished • Example: Producer-consumer relationships • These dependencies are represented using a DAG called a task-dependency graph

Task-Dependency Graph (cont) • A task-dependency graph is a directed acyclic graph in which the nodes represent tasks and the directed edges indicate the dependences between them • The task corresponding to a node can be executed when all tasks connected to this node by incoming edges have been completed. • The number and size of the tasks that the problem is decomposed into determines the granularity of the decomposition. • Called fine-grained for a large nr of small tasks • Called coarse-grained for a small nr of large tasks

Task-Dependency Graph (cont) • Key Concepts Derived from Task-Dependency Graph • Degree of Concurrency • The number of tasks that can be executed concurrently • We are usually most concerned about the average degree of concurrency • Critical Path • The longest vertex-weighted path in the graph • The weights inside nodes represent the task size • Is the sum of the weights of nodes along the path • The degree of concurrency and critical path length normally increase as granularity becomes smaller.

Task-Interaction Graph • Captures the pattern of interaction between tasks • This graph usually contains the task-dependency graph as a subgraph. • True since there may be interactions between tasks even if there are no dependencies. • These interactions usually due to accesses of shared data

Task Dependency and Interaction Graphs • These graphs are important in developing effective mapping of the tasks onto the different processors • Need to maximize concurrency and minimize overheads.

Processes vs Processors • Process vs Processor • Considered distinct concepts in this chapter. • Process: A logical computing agent that performs tasks. • Processor: Hardware units that physically perform computation. • Usually a 1:1 correspondence between processors and processes. • However, this distinction provides additional flexibility • In order to obtain any speedup over sequential programming, parallel program must have several processes active at the same time, working on different tasks.

Mapping Tasks to Processes • Mapping: The way that tasks are assigned to processes for execution. • Illustrated in Figures 3.5 and 3.7 • Good maps attempt to • Maximize the use of concurrency by mapping independent tasks onto different processors. • Minimize total completion time by ensuring that tasks on the critical path are executed as quickly as they become available. • Map tasks with a high degree of mutual interaction to the same process.

Decomposition Methods • Decomposition: Technique used to split the composition into a set of tasks. • Common Decomposition techniques • Data Decomposition • Recursive Decomposition • Exploratory Decomposition • Speculative Decomposition • Hybrid Decomposition • Data and Recursive decompositions are general methods. • Exploratory & Recursive decompositions special purpose. task decomposition methods

Recursive Decomposition • Suitable for problems that can be solved using the divide and conquer paradigm • Each of the subproblems generated by the divide step becomes a new task. • Results in natural concurrency, as different subproblems can be solved concurrently

Example: Quicksort

Another Example: Finding the Minimum • Note that we can obtain divide-and-conquer algorithms for problems that are usually solved by using other methods.

Recursive Decomposition • How good are the decompositions produced? • Average Concurrency? • Length of critical path? • How do the quicksort and min-finding decompositions measure up?

Data Decomposition • Used to derive concurrency for problems that operate on large amounts of data • The idea is to derive the tasks by focusing on the multiplicity of data • Data decomposition is often performed in two steps: • Step 1: Partition the data • Step 2: Induce a computational partitioning from the data partitioning. • Which data should we partition • Input/Output/Intermediate? • All of above • This leads to different data decomposition methods • How to induce a computational partitioning • Use the “owner-computes” rule

Example: Matrix-Matrix Multiplication

Matrix-Matrix Example (cont) Note tasks created by previous decomposition is not unique.

Partitioning Intermediate Data • The partitioning of the matrix multiplication in Figure 3.10 into four tasks can be partitioned further by partitioning intermediate data. • See next slide • The matrices Di,j created are not computed in sequential algorithm and requires a change in sequential algorithm. • Additionally, the creation of Di,j matrices require additional storage space.

“Owner-Computes" Rule • Used when data decomposition is used to partition the work into tasks. • This general principle requires that each partition performs all computations that involve the data it owns. • This is illustrated in the next two slides.

Exploratory Decomposition • Used to decompose computations that correspond to a search of the space of solutions. • The search space is partitioned into smaller parts and these are concurrently searched until desired solution is found. • The next slide shows the initial configuration for the 15 puzzle and a sequence of moves leading to the final configuration. • The subsequent slide shows how the state a state space search leads to the solution.

Exploratory Decomposition • Not general purpose • After sufficient branches are generated, each node can be assigned the task to explore further down one branch • As soon as one task finds a solution, the other tasks can be terminated. • It can result in speedup and slowdown anomalies • The work performed by the parallel formulation of an algorithm can be either smaller or greater than that performed by the serial algorithm.

Exploratory Decomposition • Not general purpose • Can result in speedup anomalies • Either engineered slow-down or superlinear speedup.

Speculative Decomposition • Used to extract concurrency in problems in which the next step is one of several actions that can only be determined when the current task finishes. • While the current task is executing, other tasks can perform the computation of the multiple branches in parallel • This decomposition method guarantees some wasteful computation. • An alternate version is to explore only the most promising branch • Or most promising branches

Speculative Decomposition • Difference from exploratory decompostion • In speculative decomposition, the input at a branch leading to multiple tasks is unknown. • In exploratory decomposition, the output of the multiple tasks originating at the branch is unknown. • Speculative decomposition can lead to more, less, or the same amount of work compared to the serial program.

Speculative Execution • If predictions are wrong • Work is wasted • Work may need to be undone • State-restoring overhead • Memory/computations • However, it may be the only way to extract concurrency!

Characteristics of Tasks • Task Generation • Static: All tasks are known before execution of algorithm starts. • Data decomposition usually results in static tasks • Example: Matrix Multiplication • Task Sizes • Relative amount of time to complete it • Uniform tasks: All require the same time • Non-uniform tasks: Execution time varies significantly. • Size of Data needed by a Task • Data must be available to process performing task • The size & location of this data may determine best process to perform task.

Some Task Interaction Characteristics • Static vs Dynamic Interactions • Static interactions occur at predetermined times and involved predetermined tasks. • Ex: Matrix multiplication • Otherwise, interaction is dynamic • 15 puzzle – Tasks that finish their work can pick up an unexplored state from queue of another busy task. • Regular vs Irregular Interactions • Regular if has some structure that can be used to obtain efficient implementation • Otherwise, irregular. • Ex: In sparse matrix-vector multiplication, must scan row of matrix to find out which of the vector entries are needed

Some Task Interactions Characteristics (cont) • Read-only vs Read-Write Data Sharing • Read-only: Task only needs to read data shared with other tasks • Ex: Matrix multiplication in Fig 3:10 • Read-Write: Multiple tasks need to read and write to some shared data. • Using heuristic search solution to solve 15 puzzle.

Mapping Tasks to Processors • A good mapping strives to achieve the following conflicting goals: • Reducing the amount of time processor spend interacting with each other. • Reducing the amount of total time that some processors are active while others are idle. • Good mappings attempt to reduce the parallel processing overheads • If Tp is the parallel runtime using p processors and Ts is the sequential runtime (for the same algorithm), then the the total overheadTo is p×Tp – Ts. • This is the work that is done by the parallel system that is beyond that required for the serial system.

Mapping Tasks to Processors (cont) • Two Main sources of overheads • Load inbalance • Results in process inactivity during execution • Inter-process communications • Coordination • Synchronization • Data-sharing • Goal of mapping tasks to processes is to minimize the overheads. • Goal of minimizing both of above overheads are often in conflict with each other.

Why Mappings can be Complicated • Mappings need to consider the task-dependency graph • Are tasks available a priority? • Static vs dynamic task generation • Computation requirements factors • Are they uniform or non-uniform • Do we know tasks a priority • How much data is associated with each task • Mappings need to consider the task-interaction graph to determine the interactions between tasks • Are they static or dynamic • Do we know about them a priori • Are they data instance dependent • Are they regular or irregular • Are they read-only or read-write? • Depending on above characteristics, different mapping techniques are required with differing complexities and costs.

Simple & Complex Task Interactions Example • Consider the task-interaction graph for image dithering • The color of each pixel is determined as weighted average of its original color and values of neighboring pixels • If break image up into square regions and assign a different task to each, have simple task interactions • Consider sparse matrix-vector graph. • Assign i-th row and i-th vector value to i-th task. • If j-th entry in i-th row is non-zero, then i-th row must obtain the j-th vector value from j-th task (unless i=j). • Result is a complex task interaction graph.

Example: Simple & Complex Task Interactions

Mapping Techniques for Load Balancing • Problem: The assignment of tasks who total computational requirements are the same does not automatically ensure load balanced. • Each processor below is assigned three tasks, but (a) is better than (b).

Load Balancing Techniques • Static Mapping • The tasks are distributed among the processors prior to execution • Applicable for tasks that are • Generated statically • Known and/or uniform computational requirements • Optimal mapping for non-uniform tasks is NP-hard so requires a heuristic mapping for acceptable solutions • Dynamic Mapping • The tasks are distributed among the processors during the execution of the algorithm • i.e., tasks & data are migrated during execution • Applicable for tasks that are either • Generated dynamically • Unknown computational requirements

Static Mapping – Array Distribution • Suitable for algorithms that • Use data decomposition • Their underlying data is in the form of arrays • i.e., input, output, or intermediate data • Block Distribution • Cyclic Distribution • Block-Cyclic Distribution • Randomized Distribution 1D/2D/3D

1D Block Distributions • Partitioning a nm two-dimensional array along one dimension among p processes. • Process k can be given the k-th block of n/p consecutive rows. • i.e, row numbers kn/p, ... ,(k+1)n/p is given to process k. • If n/p is not an integer, • all processes except the last can be given a block of n/p rows and last process the remaining block of rows • Alternately, the initial rows could receive n/p rows, and the rest receive n/p -1 rows • Similarly, process k can be given the k-th block of m/p consecutive columns.

2D Block Distributions • We could partition along more than one dimension. • With a d-dimensional array, we can partition along up to d dimensions. • If we have p process and p = p1p2, the p2, n we could partition an nn block into p subblocks of size n/p1 n/p2 and assign one to each process. • The preceding 1D and 2D distributions are illustrated in the next slide.

Example: Block Distributions

Multi-Tasking Models and Algorithms