Parallelizing Algorithms with Task/Channel Model

Consortium for Computational Science and High Performance Computing 2005 Summer Workshop, July 29-July 31, 2005 July 29, 2005Distributed Computing2:00 pm - 3:00 pmParallelizing Algorithms

Parallelizing Algorithms Stephen V. Providence Ph.D. Department of Computer ScienceNorth Carolina A&T State U.

Topics Task/Channel Model I. Foster’s Methodology Two Problems - BV, N-body Summary Current Research

Introduction • This session will focus on Foster’s design steps for parallelizing an algorithm • We focus on two research areas endemic of other problems found in science and engineering • BVP or boundary value problem leading to a PDE solver • N-body O(n^2) problem, not Greengard, Rohklin O(n) FMM • The given design scheme will facilitate implementation of an efficient MPI code • We describe the necessary steps next!

Introduction2 • Our method is based on the Task/Channel model • Attendees will learn parallelizing methods from the two research areas applied to their interests • Some areas of interest where this session is applicable: • Grid computing • Signal & image processing • Crypto. and net. sec. • Bio-informatics • Monte Carlo simulation • Matrix computations and linear system solvers

Task/Channel Model • Represents a parallel computation as a set of tasks that may interact w/ each other by sending messages through channels.

Tasks • A task is a program, its local memory and a collection of I/O ports • Local memory contains the program’s instructions and its private data • A task can send local data values to other tasks via output ports • A task can receive data values from other tasks via input ports

Channels • A channel is a message queue that connects one task’s output port w/ another task’s input port • Data values appear at the input port in the same order in which they were placed in the output port at the other end of the channel

Task/Channel Model Memory b) a parallel computation as a digraph in which the vertices are tasks and the edges are channels • a task consists of a program, • local memory, and a collection of • I/O ports

Sending and Receiving • Receiving is synchronous • If a task tries to receive a value at an input port and no value is available, the task must wait until the value appears, the receiving task is BLOCKED. • Sending is asynchronous • A process sending a message is NON-BLOCKING, even if previous messages it has sent along the same channel have not been received.

Foster’s Design Methodology There are four design steps: • Partitioning • Communication • Agglomeration • mapping

Foster’s methodology Partitioning Communication Problem Mapping Agglomeration

Partitioning • The process of dividing the computation and the data into pieces • Data-centric approach • Computation-centric approach

Domain Decomposition • The parallel algorithm design approach in which we first divide the data into pieces and then determine how to associate computations with the data. • Focus is on the largest and/or most frequently accessed data structure in the program • Consider a three-dimensional matrix as the largest and most frequently accessed data structure

Three Domain Decompositions of a 3-D matrix 1-D 2-D 3-D

Functional Decomposition • We first divide the computation into pieces and then determine how to associate data items with the individual computations • Consider a high-performance system supporting interactive image-guided brain surgery

Functional Decomposition of a system supporting interactive image-guided surgery Track positions of instruments Acquire patient images Register images Determine image locations Display image

Quality of Partitioning For either decomposition, each piece is a Primitive task. The following checklist evaluates the quality of partitioning: • There are at least an order ofmagnitude more primitive tasks than processors in the target parallel computer • Redundant computations and redundant data structure storage are minimized • Primitive tasks are roughly the same size • The number of tasks is an increasing function of the problem size

Communication • Local communication • When a task needs values from a small number of other tasks in order to perform a computation, we create channels from the tasks supplying the data to the task consuming the data • Global communication • Exists when a significant number of the primitive tasks must contribute data in order to perform a computation.

Evaluation of Communication Structure Foster’s checklist to evaluate communication structure of our parallel algorithm • The communications operations are balanced among the tasks • Each task communicates with only a small number of neighbors • Tasks can perform their communications concurrently • Tasks can perform their computations concurrently

Agglomeration • The process of grouping tasks into larger tasks in order to improve performance or simplify programming. • Often when developing MPI programs we leave the agglomeration step with one task per processor

Tactics of Agglomeration If we agglomerate primitive tasks that communicate with each other, then communication is completely eliminated. • Increase the locality or lower communication overhead • Combine groups of sending and receiving tasks reducing the number of messages being sent. Sending longer fewer messages takes less time than sending more shorter messages with the same total length, because of message startup cost • Maintain scalability of the parallel design • Do not combine so many tasks that we will not be able to port our program at some point in the future to a computer with more processors • Reduce S/W engineering costs • In parallelizing sequential program, one agglomeration may make greater use of existing sequential code, reducing time and expense of developing the parallel program

Evaluate Quality of Agglomeration Foster’ checklist to evaluate the quality of agglomeration: • The agglomeration has increased the locality of the parallel algorithm • Replicated computations take less time than the communications they replace • The amount of replicated data is small enough to allow the algorithm to scale

Evaluate Quality of Agglomeration Continued • Agglomerated tasks have similar computational and communications costs • The number of tasks is an increasing function of the problem size • The number of tasks is as small as possible, yet at least as great as the number of processors in the likely target computers • The trade-off between the chosen agglomeration and the cost of modifications to existing sequential code is reasonable

Mapping This is the task of assigning tasks to processors. • If we are executing our program on a centralized multiprocessor, the OS automatically maps processes to processors. • We assume the target system is a distributed-memory parallel computer (the reason why we focus on MPI and not Open-MP) • The goals are too maximize processor utilization and minimize inter-processor communication

Processor Utilization • The average percentage of time the system’s processors are actively executing tasks necessary for the solution of the problem • It is maximized when the computation is balanced evenly, allowing all processors to begin and end execution at the same time

Inter-processor Communication • Increases when two tasks connected by a channel are mapped to different processors • Decreases when two tasks connected by a channel are mapped to the same processor

The Mapping Process A A C B C E H D E F B F G H D G b) Mapping of eight tasks to three processors, some channels now represent intra-processor communication while others represent inter-processor communication • a task/channel graph

Eight Tasks onto Three Processors • The l.h.s. and r.h.s processors have two tasks each • The middle processor has four tasks • If all processors have the same speed and every task requires the same amount of time the then middle processor will spend twice as much time executing tasks as the other two • If every channel communicated the same amount of data, then the middle processor will also be responsible for twice as many inter-processor communications as the other two

P-processor Mapping Suppose there are p processors available. • Mapping every task to the same processor reduces inter-processor communication to zero, but reduces utilization to 1/p. • We wish to choose a mapping that represents a middle point between maximizing utilization and minimizing communication • Finding an optimal solution to the mapping problem is NP-hard (there are no polynomial-time algorithms to map tasks to processors to minimize the execution time)

Decision Tree to Choose a Mapping Strategy Static number of tasks Dynamic number of tasks Structured communication pattern Unstructured communication pattern Computation timeper task varies by region Roughly constant computation time per task Many short-livedtasks, no inter-task communications Frequent communications between tasks Agglomerate tasks to minimize communication,create one task per processor Cyclically map tasks toprocessors to balancecomputational load Use a static load balancing algorithm Use a run-timetask-schedulingalgorithm Use a dynamic load balancing algorithm

Choosing the Best Design Alternative • Designs based on one task per processor and multiple tasks per processor have been considered • Both static and dynamic allocation of tasks to processors have been evaluated • If a dynamic allocation of tasks to processors has been chosen, the manager (task allocator) is not a bottleneck to performance • If a static allocation of tasks to processors has been chosen, the ratio of tasks to processors is at least 10:1

Boundary-Value Problem • A thin rod made of uniform material is surrounded by a blanket of insulation so that temperature changes along the length of the rod are a result of heat transfer at the ends of the rod and heat conduction along the length of the rod • The rod has length unit 1. • Both ends of the rod are exposed to an ice bath having a temperature of 0˚C, while the initial temperature at distance x from the end of the rod is 100 sin(πx)

Boundary-value Problem continued • Over time the rod gradually cools • A PDE models the temperature at any point of the rod at any point in time • The finite difference method is one way to solve a PDE on a computer

Depiction of BVP Ice bath Ice bath Thin rod time temperature position

Depiction continued T h k Time u = 0 u = 0 ui, j+1 ui-1, j ui, j Ui+1, j U = 100 sin(π x) space 1 0 Data structure used in a finite difference approx. to the rod-cooling problem. Every point ui, j represents a matrix element containing the temperature at position i on the rod at time j. At the end of the rod the temperature is always 0. At time 0, the temperature at point x is 100 sin(π x).

Finite Difference Formula ui, j+1 = r ui-1, j + (1 – 2r) ui, j + r ui+1, j where r = k / h2 • The rod is divided into n sections of length h, so each row has n+1elements • Time is divided into m discrete entities of length k, so the matrix has m+1 rows

BVP Partitioning • In this case the data being manipulated are easy to identify: • There is one data item per grid point • we associate one primitive task w/ each grid point • This yields 2-D domain decomposition

BVP Communication • If a task A requires a value from task B to perform its computation, we must create a channel from task A to task B • Since the task computing ui, j+1requires the values of ui-1, j ,ui, j and ui+1, j , each task will have three incoming channels and three outgoing channels

BVP Agglomeration and Mapping • Even if enough processors were available, it would be impossible to compute every task concurrently, because of the data dependency between tasks computing rod temperatures later in time and the results produced by tasks computing rod temperatures earlier in time.

First Domain Decomposition Depiction a) This domain decomposition associates one task with each temperature to be computed.

First and Second Agglomeration b) A single task represents the computation of the temperature atelement i for all time steps c) A task is responsible for computing over all time steps,the temperatures for a contiguous group of rod locations

Analysis Brief • Overall parallel execution time per iteration:  (n - 1) / p + 2 where:  denotes time to compute ui, j+1 n - 1 is the number of interior values  is the time to send (receive) to (from) another processor • Estimated parallel execution time for all m iterations: m ( (n - 1) / p + 2 )

N-body Problem • Some problems in physics can be solved by performing computations on all pairs of objects in a dataset • We are simulating the motion of n particles of varying mass in two dimensions • During each iteration of the algorithm we need to compute the new position and velocity vector of each particle, given the positions of all the other particles

N-body Depiction • Every particle exerts a gravitational pull on every other particle. • In this 2-D example, the clear particle has a particular position and velocity vector (indicated by the arrows). • Its future position is influenced by the gravitational forces exertedby the other two particles.

N-body Partition • In the Newtonian n-body simulation, gravitational forces have infinite range • We assume one task per particle • On order for this task to compute the new location of the particle, it must know the locations of all the other particles

N-body Communication • The gather operation takes a dataset and distributed among a group of tasks and collects the items on a single task • An all-gather operation is similar except at the end of the communication, every task has a copy of the entire dataset • In this case we want to update the location of every particle, so an all-gather communication is required.

All-gather Gather and All-gather Gather builds the concatenation of aset of data items on a single task Gather All-gather builds the concatenation of aset of data items on all tasks

N-body Communication Continued • We must put a channel between every pair of tasks • During each communication step each task sends its vector element to one other task • After n-1 communication steps, each task has the positions of all the other particles and it can perform the calculations needed to determine the new location and velocity vector for its particle

N-body Communication Continued2 • There is a faster way, by looking bottom-up • Suppose there are two particles, if each task has a single particle, they can exchange copies of their values. Each task sends one value on one channel and receives a value on another channel.

Parallelizing Algorithms with Task/Channel Model

Parallelizing Algorithms with Task/Channel Model

Presentation Transcript

Distributed Computing Paradigms

Distributed computing

Election Algorithms and Distributed Processing

Distributed Algorithms

COS 497 - Cloud Computing 2. Distributed Computing

Distributed Computing Environment (DCE)

Lecture #12 Distributed Algorithms (I)

Distributed Algorithms – 2g1513

KLEE : A Framework for Distributed Top-k Query Algorithms

Distributed Computing and Analysis

X-Com: Distributed Computing Software

Distributed Computing

Distributed Mutual Exclusion

Session 1 Second Part Day 1 Monday 10 th July

DISTRIBUTED COMPUTING

Distributed Algorithms (22903)

Locality Sensitive Distributed Computing

Java Distributed Computing

Distributed Algorithms (22903)

Parallelizing ROMS for Distributed Memory Machines using the Scalable Modeling System (SMS)

Distributed Systems and Algorithms

Distributed Algorithms and Biological Systems