Parallel and Distributed Algorithms Spring 2010

Parallel and Distributed Algorithms Spring 2010 Johnnie W. Baker

Chapter 1 Introduction and General Concepts

References • Selim Akl, Parallel Computation: Models and Methods, Prentice Hall, 1997, Chapter 1, Updated online version available through website. -- primary reference • Selim Akl, The Design of Efficient Parallel Algorithms, Chapter 2 in “Handbook on Parallel and Distributed Processing” edited by J. Blazewicz, K. Ecker, B. Plateau, and D. Trystram, Springer Verlag, 2000. • Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar, Introduction to Parallel Computing, 2nd Edition, Addison Wesley, 2003. • Harry Jordan and Gita Alaghband, Fundamentals of Parallel Processing: Algorithms Architectures, Languages, Prentice Hall, 2003. • Michael Quinn, Parallel Programming in C with MPI and OpenMP, McGraw Hill, 2004, Chapter 1-2. -- secondary reference • Michael Quinn, Parallel Computing: Theory and Practice, McGraw Hill, 1994 • Barry Wilkenson and Michael Allen, Parallel Programming, 2nd Ed.,Prentice Hall, 2005.

Outline • Need for Parallel & Distributed Computing • Flynn’s Taxonomy of Parallel Computers • Two Main Types of MIMD Computers • Examples of Computational Models • Data Parallel & Functional/Control/Job Parallel • Granularity • Analysis of Parallel Algorithms • Elementary Steps: computational and routing steps • Running Time & Time Optimal • Parallel Speedup • Cost and Work • Efficiency • Linear and Superlinear Speedup • Speedup and Slowdown Folklore Theorems • Amdahl’s and Gustafon’s Law

Reasons to Study Parallel & DistributedComputing • Sequential computers have severe limits to memory size • Significant slowdowns occur when accessing data that is stored in external devices. • Sequential computational times for most large problems are unacceptable. • Sequential computers can not meet the deadlines for many real-time problems. • Some problems are distributed in nature and natural for distributed computation

Grand Challenge Problems Computational Science Categories (1989) • Quantum chemistry, statistical mechanics, and relativistic physics • Cosmology and astrophysics • Computational fluid dynamics and turbulence • Materials design and superconductivity • Biology, pharmacology, genome sequencing, genetic engineering, protein folding, enzyme activity, and cell modeling • Medicine, and modeling of human organs and bones • Global weather and environmental modeling

Weather Prediction • Atmosphere is divided into 3D cells • Data includes temperature, pressure, humidity, wind speed and direction, etc • Recorded at regular time intervals in each cell • There are about 5×103 cells of 1 mile cubes. • Calculations would take a modern computer over 100 days to perform calculations needed for a 10 day forecast • Details in Ian Foster’s 1995 online textbook • Design & Building Parallel Programs • See author’s online copy for further information

Flynn’s Taxonomy • Best-known classification scheme for parallel computers. • Depends on parallelism it exhibits with its • Instruction stream • Data stream • A sequence of instructions (the instruction stream) manipulates a sequence of operands (the data stream) • The instruction stream (I) and the data stream (D) can be either single (S) or multiple (M) • Four combinations: SISD, SIMD, MISD, MIMD

SISD • Single Instruction, Single Data • Single-CPU systems • i.e., uniprocessors • Note: co-processors don’t count • Concurrent processing allowed • Instruction prefetching • Pipelined execution of instructions • Primary Example: PCs

SIMD • Single instruction, multiple data • One instruction stream is broadcast to all processors • Each processor (also called a processing element or PE) is very simplistic and is essentially an ALU; • PEs do not store a copy of the program nor have a program control unit. • Individual processors can be inhibited from participating in an instruction (based on a data test).

SIMD (cont.) • All active processor executes the same instruction synchronously, but on different data • On a memory access, all active processors must access the same location in their local memory. • The data items form an array and an instruction can act on the complete array in one cycle.

SIMD (cont.) • Quinn (2004) calls this architecture a processor array • Two examples are Thinking Machines’ • Connection Machine CM2 • Connection Machine CM200 • Also, MasPar’s MP1 and MP2 are examples • Quinn also considers a pipelined vector processor to be a SIMD • Somewhat non-standard. • An example is the Cray-1

MISD • Multiple instruction, single data • Quinn (2004) argues that a systolic array is an example of a MISD structure (pg 55-57) • Some authors include pipelined architecture in this category • This category does not receive much attention from most authors

MIMD • Multiple instruction, multiple data • Processors are asynchronous, since they can independently execute different programs on different data sets. • Communications is handled either by use of message passing or through shared memory. • Considered by most researchers to contain the most powerful, least restricted computers.

MIMD (cont.) • Have major communication costs that are usually ignored when • comparing to SIMDs • when computing algorithmic complexity • A common way to program MIMDs is for all processors to execute the same program. • Called SPMD method (for single program, multiple data) • Usual method when number of processors are large.

Multiprocessors(Shared Memory MIMDs) • All processors have access to all memory locations . • The processors access memory through some type of interconnection network. • This type of memory access is called uniform memory access (UMA) . • Most multiprocessors have hierarchical or distributed memory systems in order to provide fast access to all memory locations. • This type of memory access is called nonuniform memory access (NUMA).

Multiprocessors (cont.) • Normally, fast cache is used with NUMA systems to reduce the problem of different memory access time for PEs. • This creates the problem of ensuring that all copies of the same date in different memory locations are identical. • Symmetric Multiprocessors (SMPs) are a popular example of shared memory multiprocessors

Multicomputers (Message-Passing MIMDs) • Processors are connected by an interconnection network • Each processor has a local memory and can only access its own local memory. • Data is passed between processors using messages, as dictated by the program. • A common approach to programming multiprocessors is to use a message passing language (e.g., MPI, PVM) • The problem is divided into processes that can be executed concurrently on individual processors. Each processor is normally assigned multiple processes.

Multicomputers (cont. 2/3) • Programming disadvantages of message-passing • Programmers must make explicit message-passing calls in the code • This is low-level programming and is error prone. • Data is not shared but copied, which increases the total data size. • Data Integrity: difficulty in maintaining correctness of multiple copies of data item.

Multicomputers (cont. 3/3) • Programming advantages of message-passing • No problem with simultaneous access to data. • Allows different PCs to operate on the same data independently. • Allows PCs on a network to be easily upgraded when faster processors become available.

Data Parallel Computation • All tasks (or processors) apply the same set of operations to different data. • Example: • SIMDs always use data parallel computation. • All active processors execute the operations synchronously. for i  0 to 99 do a[i]  b[i] + c[i] endfor

Data Parallel Computation for MIMDs • A common way to support data parallel programming on MIMDs is to use SPMD (single program multiple data) programming as follows: • Processors (or tasks) executing the same block of instructions concurrently but asynchronously • No communication or synchronization occurs within these concurrent instruction blocks. • Each instruction block is normally followed by synchronization and communication steps • If processors have identical tasks (with different data), this method allows these to be executed in a data parallel fashion.

Functional/Control/Job Parallelism • Involves executing different sets of operations on different data sets. • Typical MIMD control parallelism • Problem is divided into different non-identical tasks • Tasks are distributed among the processors • Tasks are usually divided between processors so that their workload is roughly balanced • Each processor follows some scheme for executing its tasks. • Frequently, a task scheduling algorithm is used. • If the task being executed has to wait for data, the task is suspended and another task is executed.

Grain Size • Quinn defines grain size to be the average number of computations performed between communication or synchronization steps. • Quinn (2004), pg 411 • Data parallel programming usually results in smaller grain size computation • SIMD computation is considered to be fine-grain • MIMD data parallelism is usually considered to be medium grain • Generally, increasing the grain size of increases the performance of MIMD programs.

Examples of Parallel Models • Synchronous PRAM (Parallel RAM) • Each processor stores a copy of the program • All execute instructions simultaneously on different data • Generally, the active processors execute the same instruction • The full power of PRAM allows processors to execute different instructions simultaneously. • Here, “simultaneous” means they execute using same clock. • All computers have the same (shared) memory. • data exchanges occur through share memory • All processors have I/O capabilities. • See Akl, Figure 1.2

Examples of Parallel Models (cont) • Binary Tree of Processors (Akl, Fig. 1.3) • No shared memory • Memory distributed equally between PEs • PEs communicate through binary tree network. • Each PE stores a copy of the program. • PE execute different instructions • Normally synchronous, but asychronous is also possible. • Only leaf and root PEs have I/O capabilities.

Examples of Parallel Models (cont) • Symmetric Multiprocessors (SMPs) • Asychronous processors of the same size. • Shared memory • Memory locations equally distant from each PE in smaller machines. • Formerly called “tightly coupled multiprocessors”.

Analysis of Parallel Algorithms • Elementary Steps: • A computational step is a basic arithmetic or logic operation • A routing step is used to move data of constant size between PEs using either shared memory or a link connecting the two PEs.

Communication Time Analysis • Worst Case • Usually default for synchronous computation • Difficult to use for asynchronous computation for asynchronous computation • Average Case • Usual default for asynchronous computation • Only occasionally used with synchronous co • Important to distinguish which is being used • Often “worst case” synchronous timings are compared to “average case” asynchronous timings. • Default in Akl’s text is “worst case” • Default in Casanova et.al. text is probably “average case”

Shared Memory Communication Time • Memory Accesses – Two Memory Accesses • P(i) writes datum to a memory cell. • P(j) reads datum in that memory cell • Uniform Analysis • Charge constant time for memory access • Unrealistic, as shown for PRAM in Chapter 2 of Akl’s book • Non-Uniform Analysis • Charge O(lg M) cost for a memory access, where M is the number of memory cells • More realistic, based on how memory is accessed • Traditional to use Uniform Analysis • Considering a second variable M makes the analysis complex • Most non-constant time algorithms have the same complexity using either the uniform and non-uniform analysis.

Communication through links • Assumes constant time to communicate along one link. • An earlier paper (mid-90?) estimates a communication step along a link to be on order of 10-100 times slower than a computational step. • When two processors are not connected, routing a datum between the two processors requires O(m) time units, where m is the number of links along path used.

Running/Execution Time • Running Time for algorithm • Time between when first processor starts executing & when last processor terminates. • Either worst case or average case can be used. • Often not stated which is used • Normally worst case for synchronous & average case for asynchronous • An upper bound is given by the number of steps of the fastest algorithm known for the problem. • The lower bound gives the fewest number of steps possible to solve a worst case of the problem • Time optimal algorithm: • Algorithm in which the number of steps for the algorithm is the same complexity as the lower bound for problem. • Casanova, et. al. book uses term efficientinstead.

Speedup • A measure of the decrease in running time due to parallelism • Let t1 denote the running time of the fastest (possible/known) sequential algorithm for the problem. • If the parallel algorithm uses p PEs, let tp denote its running time. • The speedup of the parallel algorithm using p processors is defined by S(1,p) = t1 / tp. • The goal is to obtain largest speedup possible. • For worst (average) case speedup, both t1 and tp should be worst (average) case, respectively.

Optimal Speedup • Theorem: The maximum possible speedup for parallel computers with n PEs for “traditional” problems is n. • Proof (for “traditional problems”): • Assume a computation is partitioned perfectly into n processes of equal duration. • Assume no overhead is incurred as a result of this partitioning of the computation – (e.g., partitioning process, information passing, coordination of processes, etc), • Under these ideal conditions, the parallel computation will execute n times faster than the sequential computation. • The parallel running time is ts /n. • Then the parallel speedup of this computation is S(n,1) = ts /(ts /n) = n

Linear Speedup • Preceding slide argues that S(n,1)  n and optimally S(n,1) = n. • This speedup is called linear since S(n,1) = (n) • The next slide formalizes this statement and provides an alternate proof.

Speedup Folklore Theorem Theorem: For a traditional problem, the maximum speedup possible for a parallel algorithm using p processors is at most p. That is, S(1,p)  p. An Alternate Proof (for “traditional problems”) • Let t1 and tp be as above and assume t1/tp > p • This parallel algorithm can be simulated by a sequential computer which executes each parallel step with p serial steps. • The sequential algorithm runs in p tp time, which is faster than the fastest sequential algorithm since ptp< t1 • This contradiction establishes the “theorem”. Note: We will see that this result is not valid for some non-traditional problems.

Speedup Usually Suboptimal • Unfortunately, the best speedup possible for most applications is much less than n, as • Assumptions in first proof are usually invalid. • Usually some portions of algorithm are sequential, with only one PE active. • Other portions of algorithm may allow a non-constant number of PEs to be idle for more than a constant number of steps. • E.g., during parts of the execution, many PEs may be waiting to receive or to send data.

Speedup Example • Example: Compute the sum of n numbers using a binary tree with n leaves. • Takes tp = lg n steps to compute the sum in the root node. • Takes t1 = n-1 steps to compute the sum sequentially. • the number of processors is p =2n-1, so for n>1, S(1,p) = n/(lg n) < n < 2n - 1 = p • Speedup is not optimal.

Optimal Speedup Example • Example: Computing the sum of n numbers optimally using a binary tree. • Assume binary tree has n/(lg n) leaves on the tree. • Each leaf is given lg n numbers and computes the sum of these numbers initially in (lg n) time. • The overall sum of the values in the n/(lg n) leaves are found next, as in preceding example, in lg(n/lg n) = (lg n) time. • Then tp = (lg n) and S(1,p) = (n/lg n) = cp for some constant c since p = 2(n/lg n) -1. • The speedup of this parallel algorithm is optimal, as the Folklore Speedup Theorem is valid for traditional problems like this one so speedup  p.

Superlinear Speedup • Superlinear speedup occurs when S(n) > n • Most texts other than Akl’s argue that • Linear speedup is the maximum speedup obtainable. • The preceding “optimal speedup proof” is used to argue that superlinearity is impossible • Linear speedup is optimal for traditional problems • Those stating that superlinearity is impossible claim that any ‘superlinear behavior’ occurs due the following type of natural reasons: • the extra memory in parallel system. • a sub-optimal sequential algorithm was used. • Random luck, e.g., in case of an algorithm that has a random aspect in its design (e.g., random selection)

Superlinearity (cont) • Some problems cannot be solved without use of parallel computation. • It seems reasonable to consider parallel solutions to these to be “superlinear”. • Speedup would be infinite. • Examples include many large software systems • Data too large to fit in the memory of a sequential computer • Problem contain deadlines that must be met, but the required computation is too great for a sequential computer to meet deadlines. • Those not accepting superlinearity discount these, as indicated in previous slide.

Superlinearity (cont.) • Textbook by Casanova, et. al. discuss superlinearity primarily on pgs 11-12 and in Exercise 4.3 on pg 140. • Accepts that superlinearity is possible • Indicates natural causes (e.g., larger memory) as primary (perhaps only) reason for superlinearity. • Treats cases of infinite speedup (where sequential processor can not do the job) as an anomaly. • E.g., two men can carry a piano upstairs, but one man can not do this, even if given unlimited time.

Some Non-Traditional Problems • Problems with any of the following characteristics are “non-traditional” • Input may arrive in real time • Input may vary over time • For example, find the convex hull of a set of points, where points can be added & removed over time. • Output may effect future input • Output may have to meet certain deadlines. • See discussion on page 16-17 of Akl’ text • See Chapter 12 of Akl’s text for additional discussions.

Other Superlinear Examples • Selim Akl has given a large number of examples of different types of superlinear algorithms. • Example 1.17 in Akl’s online textbook (pg 17). • See last Chapter (Ch 12) of his online text. • See his 2007 book-chapter titled “Evolving Computational Systems” • To be posted to website shortly. • Some of his conference & journal publications

A Superlinearity Example • Example: Tracking • 6,000 streams of radar data arrive during each 0.5 second and each corresponds to a aircraft. • Each must be checked against the “projected position box” for each of 4,000 tracked aircraft • The position of each plane that is matched is updated to correspond with its radar location. • This process must be repeated each 0.5 seconds • A sequential processor can not process data this fast.

Slowdown Folklore Theorem: If a computation can be performed with p processors in time tp, with q processors in time tq, and q < p, then tptq (1 + p/q) tp Comments: • The new run time tq is O(tp p/q) • This result puts a tight bound on the amount of extra time required to execute an algorithm with fewer PEs. • Establishes a smooth degradation of running time as the number of processors decrease. • For q = 1, this is essentially the speedup theorem, but with a less sharp constant. • Similar to Proposition 1.1, pg 11, in Canasova, et. al. textbook.

Proof: (for traditional problems) • Let Wi denote the number of elementary steps performed collectively during the i-th time period of the algorithm (using p PEs). • Then W = W1 + W2 + ... + Wk (where k=tp) is the total number of elementary steps performed collectively by the processors. • Since all p processors are not necessarily busy all the time, W  p tp . • With q processors and q < p, we can simulate the Wi elementary steps of i-th time period by having each of the q processors do Wi /q elementary steps. • The time taken for the simulation is tq W1 /q + W2 /q + ...+ Wk /q  W1 /q +1 + ...+ Wk /q +1  tp + p tp /q

A Non-Traditional Example for “Folklore Theorems” • Problem Statement: (from Akl, Example 1.17) • n sensors are each used to repeat the same set {s1, s2, ..., sn } of measurements n times at an environmental site. • It takes each sensor n time units to make each measurement. • Each si is measured in turn by another sensor, producing n distinct cyclic permutations of the sequence S = (s1, s2, ..., sn) • The sequences produced by 3 sensors are (s1, s2, s3), (s3, s1 ,s2), (s2 ,s3,s1)

Nontraditional Example (cont.) • The stream of data from each sensor is sent to a different processor, with all corresponding si values in each stream arriving at the same time. • The data from a sensor is consistently sent to the same processor. • Each si must be read and stored in unit time or stream is terminated. • Additionally, the minimum of the n new data values that arrive during each n steps must be computed, validated, and stored. • If minimum is below a safe value, process is stopped • The overall minimum of the sensor reports over all n steps is also computed and recorded.

Sequential Processor Solution • A single processor can only monitor the values in one data stream. • After first value of a selected stream is stored, it is too late to process the other n-1 items that arrived at the same time in other streams. • A stream remains active only if its preceding value was read. • Note this is the best we can do to solve problem • We assume that a processor can read a value, compare it to the smallest previous value received, and update its “smallest value” in one unit of time. • Sequentially, the computation takes n computational steps and n(n-1) time unit delays between arrivals, or exactly n2 time units to process all n streams. • Summary: Calculating first minimum takes n2 time units

Parallel and Distributed Algorithms Spring 2010