1.29k likes | 1.31k Views
Explore the benefits and applications of parallel computing in solving complex problems faster, improving precision, and gaining competitive advantage. Learn about Moore's Law, Flynn’s Taxonomy, and the categories of computational science challenges.
E N D
Parallel Real-Time Systems Parallel Computing Overview
References • Website for Parallel & Distributed Computing: www.cs.kent.edu/~jbaker/PDC-F08/ • The slides here are primarily a subset of the “Introduction to Parallel Computing” slide set above. • Michael Quinn, Parallel Programming in C with MPI and Open MP, McGraw Hill, 2004. • See posted information. • Must access using your cs.kent.edu account • Selim Akl, “Parallel Computation: Models and Methods”, Prentice Hall, 1997, Updated online version available at • http://www.cs.kent.edu/~jbaker/Akl-Book.pdf • Password will be given in class • Ian Foster, Designing and Building Parallel Programs, Free online copy posted at this site at Argonne Natl Labs
Outline • Why use parallel computing • Moore’s Law • Modern parallel computers • Flynn’s Taxonomy • Seeking Concurrency • Programming parallel computers
Why Use Parallel Computers • Solve compute-intensive problems faster • Make infeasible problems feasible • Reduce design time • Solve larger problems in same amount of time • Improve answer’s precision • Reduce design time • Increase memory size • More data can be kept in memory • Dramatically reduces slowdown due to accessing external storage increases computation time • Gain competitive advantage
1989 Grand Challenges to Computational Science Categories • Quantum chemistry, statistical mechanics, and relativistic physics • Cosmology and astrophysics • Computational fluid dynamics and turbulence • Materials design and superconductivity • Biology, pharmacology, genome sequencing, genetic engineering, protein folding, enzyme activity, and cell modeling • Medicine, and modeling of human organs and bones • Global weather and environmental modeling
Weather Prediction • Atmosphere is divided into 3D cells • Data includes temperature, pressure, humidity, wind speed and direction, etc • Recorded at regular time intervals in each cell • There are about 5×103 cells of 1 mile cubes. • Calculations would take a modern computer more than 10 days to perform calculations needed for a 10 day forecast (100 days in 2003) • Details in Ian Foster’s 1995 online textbook • Design & Building Parallel Programs • Included in Parallel Reference List, which will be posted on website.
Moore’s Law • In 1965, Gordon Moore [87] observed that the density of chips doubled every year. • That is, the chip size is being halved yearly. • This is an exponential rate of increase. • By the late 1980’s, the doubling period had slowed to 18 months. • Reduction of the silicon area causes speed of the processors to increase. • Moore’s law is sometimes stated: “The processor speed doubles every 18 months”
Micros Speed (log scale) Supercomputers Mainframes Minis Time Microprocessor Revolution Moore's Law
Some Definitions • Concurrent – Sequential events or processes which seem to occur or progress at the same time. • Parallel –Events or processes which occur or progress at the same time • Parallel computing provides simultaneous execution of operations within a single parallel computer • Distributed computing provides simultaneous execution of operations across a number of systems.
Flynn’s Taxonomy • Best known classification scheme for parallel computers. • Depends on parallelism it exhibits with its • Instruction stream • Data stream • A sequence of instructions (the instruction stream) manipulates a sequence of operands (the data stream) • The instruction stream (I) and the data stream (D) can be either single (S) or multiple (M) • Four combinations: SISD, SIMD, MISD, MIMD
SISD • Single Instruction, Single Data • A sequential computer is the primary example • i.e., uniprocessors • Note: co-processors don’t count as more processors • Concurrent processing allowed • Instruction prefetching • Pipelined execution of instructions • Independent concurrent tasks can execute different sequences of operations.
SIMD • Single instruction, multiple data • One instruction stream is broadcast to all processors • Each processor, also called a processing element (or PE), is very simplistic and is essentially an ALU; • PEs do not store a copy of the program nor have a program control unit. • Selected processors can be prevented from executing a block of instructions • Handled using a data test.
SIMD (cont.) • All active processor executes the same instruction synchronously, but on different data • On a memory access, all active processors must access the same location in their local memory. • The data items form an array (or vector) and an instruction can act on the complete array in one cycle.
SIMD (cont.) • Quinn calls this architecture a processor array. • Examples include • The STARAN and MPP (Dr. Batcher architect) • Connection Machine CM2, built by Thinking Machines).
How to View a SIMD Machine • Think of soldiers all in a unit. • The commander selects certain soldiers as active. • For example, every even numbered row. • The commander barks out an order to all the active soldiers, who execute the order synchronously.
MISD • Multiple instruction streams, single data stream • Primarily corresponds to multiple redundant computation, say for reliability. • Quinn argues that a systolic array is an example of a MISD structure (pg 55-57) • Some authors include pipelined architecture in this category • This category does not receive much attention from most authors, so we won’t discuss it further.
MIMD • Multiple instruction, multiple data • Processors are asynchronous and can independently execute different programs on different data sets. • Communications are handled either • through shared memory. (multiprocessors) • by use of message passing (multicomputers) • MIMD’s are considered by most researchers to include the most powerful, least restricted computers.
MIMD (cont. 2/4) • Have major communication costs • When compared to SIMDs • Internal ‘housekeeping activities’ are often overlooked • Maintaining distributed memory & distributed databases • Synchronization or scheduling of tasks • Load balancing between processors • The SPMDmethod of programming MIMDs • All processors to execute the same program. • SPMD stands for single program, multiple data. • An easy programming method to use when number of processors are large. • While processors have same code, they can each can be executing different parts at any point in time.
MIMD (cont 3/4) • Currently a more common technique for programming MIMDs is to use multi-tasking • The problem solution is broken up into various tasks. • Tasks are distributed among processors initially. • If new tasks are produced during executions, these may handled by parent processor or distributed • Each processor can execute its collection of tasks concurrently. • If some of its tasks must wait for results from other tasks or new data , the processor will focus the remaining tasks. • Larger programs usually require a load balancing algorithm to rebalance tasks between processors • Dynamic scheduling algorithms may be needed to assign a higher execution priority to time-critical tasks • E.g., on critical path, more important, earlier deadline, etc.
MIMD (cont 4/4) • Recall, there are two principle types of MIMD computers: • Multiprocessors (uses shared memory) • Multicomputers (uses message passing) • Both are important and will be covered in greater detail next.
Multiprocessors(Shared Memory MIMDs) • Consists of two types • Centralized Multiprocessors • Also called UMA (Uniform Memory Access) • Also called Symmetric Multiprocessors or SMPs • Distributed Multiprocessors • Also called NUMA (Nonuniform Memory Access)
Centralized Multiprocessors(SMPs) • Consists of identical CPUs connected by a bus and to common block of memory. • Each processor requires the same amount of time to access memory. • Usually limited to a few dozen processors due to memory bandwidth. • Each processor uses a cache to minimize the number of memory access • Creates a cache coherency problem. • SMPs and “clusters of SMPs” are currently popular
Distributed Multiprocessors(or NUMA) • Has a distributed memory system • Each memory location has the same address for all processors. • Access time to a given memory location varies considerably for different CPUs. • Normally, uses fast cache to reduce the problem of different memory access time for processors. • Creates problem of ensuring all copies of the same data in different memory locations are identical.
Multicomputers (Message-Passing MIMDs) • Processors are connected by a network • Normally an interconnection network • Discussed later in Chapter • Each processor has a local memory and can only access its own local memory. • Data is passed between processors using messages, when specified by the program.
Multicomputers (cont) • Message passing between processors is controlled by a message passing language • e.g., MPI, PVM • The problem is divided into processes or tasksthat can be executed concurrently on individual processors. • Each processor is normally assigned multiple tasks.
Multiprocessors vs Multicomputers • Programming disadvantages of message-passing • Programmers must make explicit message-passing calls in the code • This is low-level programming and is error prone. • Data is not shared but copied into private memories, increasing the total data size. • Data Integrity: difficulty in maintaining correctness of multiple copies of data item.
Multiprocessors vs Multicomputers (cont) • Programming advantages of message-passing • No problem with simultaneous access to data. • Allows different PCs to operate on the same data independently. • Allows PCs on a network to be easily upgraded when faster processors become available. • Mixed “distributed shared memory” systems exist • An example is a cluster of SMPs.
Types of Parallel Execution • Data parallelism • Control/Job/Functional parallelism • Pipelining • Virtual parallelism
Data Parallelism • All tasks (or processors) apply the same set of operations to different data. • Example: • Operations may be executed concurrently • Accomplished on SIMDs by having all active processors execute the operations synchronously. • Can be accomplished on MIMDs by assigning 100/p tasks to each processor and having each processor to calculated its share asynchronously. for i 0 to 99 do a[i] b[i] + c[i] endfor
Supporting MIMD Data Parallelism • SPMD (single program, multiple data) programming is not really data parallel execution, as processors typically execute different sections of the program concurrently. • Data parallel programming can be strictly enforced when using SPMD as follows: • Processors execute the same block of instructions concurrently but asynchronously • No communication or synchronization occurs within these concurrent instruction blocks. • Each instruction block is normally followed by a synchronization and communication block of steps
MIMD Data Parallelism (cont.) • Strict data parallel programming is unusual for MIMDs, as the processors usually execute independently.
Job Parallelism Features • Also called control parallelism • Problem is divided into different non-identical tasks • Tasks are divided between the processors so that their workload is roughly balanced
Data Dependence Graph • Can be used to identify data parallelism and job parallelism. • Most realistic jobs contain both types of parallelisms • Can be viewed as branches in data parallel tasks • If no path from vertex u to vertex v, then job parallelism can be used to execute the tasks u and v concurrently. - If larger tasks can be subdivided into smaller identical tasks, data parallelism can be used to execute these concurrently.
For example, “mow lawn” becomes • Mow N lawn • Mow S lawn • Mow E lawn • Mow W lawn • If 4 people are available to mow, then data parallelism can be used to do these tasks simultaneously. • Similarly, if several people are available to “edge lawn” and “weed garden”, then we can use data parallelism to provide more concurrency.
Pipelining • Divide a process into stages • Produce several items simultaneously
Consider the for loop: p[0] a[0] for i 1 to 3 do p[i] p[i-1] + a[i] endfor This computes the partial sums: p[0] a[0] p[1] a[0] + a[1] p[2] a[0] + a[1] + a[2] p[3] a[0] + a[1] + a[2] + a[3] The loop is not data parallel as there are dependencies. However, we can stage the calculations in order to obtain parallelism. Compute Partial Sums
Virtual Parallelism • In data parallel applications, it is often simpler to initially design an algorithm or program assuming one data item per processor. • Particularly useful for SIMD programming • If more processors are needed in actual program, each processor is given a block of n/p or n/p data items • Typically, requires a routine adjustment in program. • Will result in a slowdown in running time of at least n/p. • Called Virtual Parallelism since each processor plays the role of several processors. • A SIMD computer has been built that automatically converts code to handle n/p items per processor. • Wavetracer SIMD computer.
Slides from Parallel Architecture Section See www.cs.kent.edu/~jbaker/PDC-F08/
References • Slides in this section are taken from the Parallel Architecture Slides at site www.cs.kent.edu/~jbaker/PDC-F08/ • Book reference is Chapter 2 of Quinn’s textbook.
Interconnection Networks • Uses of interconnection networks • Connect processors to shared memory • Connect processors to each other • Different interconnection networks define different parallel machines. • The interconnection network’s properties influence the type of algorithm used for various machines as it affects how data is routed.
Terminology for Evaluating Switch Topologies • We need to evaluate 4 characteristics of a network in order to help us understand their effectiveness • These are • The diameter • The bisection width • The edges per node • The constant edge length • We’ll define these and see how they affect algorithm choice. • Then we will introduce several different interconnection networks.
Terminology for Evaluating Switch Topologies • Diameter – Largest distance between two network nodes. • A low diameter is desirable • It puts a lower bound on the complexity of parallel algorithms which requires communication between arbitrary pairs of nodes.
Terminology for Evaluating Switch Topologies • Bisection width – The minimum number of edges between switch nodes that must be removed in order to divide the network into two halves. • Or within 1 node of one-half if the number of processors is odd. • High bisection width is desirable. • In algorithms requiring large amounts of data movement, the size of the data set divided by the bisection width puts a lower bound on the running time of an algorithm.
Terminology for Evaluating Switch Topologies • Number of edges per node • It is best if the maximum number of edges/node is a constant independent of network size, as this allows the processor organization to scale more easily to a larger number of nodes. • Degree is the maximum number of edges per node. • Constant edge length? (yes/no) • Again, for scalability, it is best if the nodes and edges can be laid out in 3D space so that the maximum edge length is a constant independent of network size.
Three Important Interconnection Networks • We will consider the following three well known interconnection networks: • 2-D mesh • linear network • hypercube • All three of these networks have been used to build commercial parallel computers.
2-D Meshes Note: Circles represent switches and squares represent processors in all these slides.
2-D Mesh Network • Switches arranged into a 2-D lattice or grid • Communication allowed only between neighboring switches • Torus: Variant that includes wraparound connections between switches on edge of mesh