Exploring Parallel Architectures: Speed and Scalability in Computing

Introduction to Parallel Architectures Dr. Laurence Boxer Niagara University CIS 270 - December '99

Parallel Computers • Purpose - speed • Divide a problem among processors • Let each processor work on its portion of problem in parallel (simultaneously) with other processors • Ideal - if p is the number of processors, get solution in 1/p of the time used by a computer of 1 processor • Actual - rarely get that much speedup, due to delays for interprocessor communications CIS 270 - December '99

Graphs of relevant functions CIS 270 - December '99

Architectural issues Limitations on speed: • Communications diameter - how many communication steps are necessary to send data from processor that has it to processor that needs it - large is bad • Bisection width - how many wires must be cut to cut network in half - measure of how fast massive amounts of data can be moved through network - large is good Limitation on expansion: • Degree of network - important to scalability (ability to expand number of processors) - large is bad CIS 270 - December '99

PRAM - Parallel Random Access Machine • Shared memory yields fast communications Any processor can send data to any other processor in time as follows: • Fast communications make this model theoretical ideal for fastest possible parallel algorithms for given # of processors • Impractical - too many wires if lots of processors • Source processor writes data to memory • Destination processor reads data from memory CIS 270 - December '99

CIS 270 - December '99

Notice the tree structure of the previous algorithm: CIS 270 - December '99

Linear array architecture • Degree of network: 2 - easily expanded • Bisection width: 1 - can’t move large amounts of data efficiently across network • Communication diameter: n-1- won’t perform global communication operations efficiently CIS 270 - December '99

Total on linear array: • Assume 1 item per processor • Communications diameter implies • Since this is the time required to total n items on a RAM, there is no asymptotic benefit to using a linear array for this problem CIS 270 - December '99

Input-based sorting on a linear array • The algorithm illustrated is a version of Selection Sort - each processor selects the smallest value it sees and passes others to the right. • Time is proportional to communication diameter, CIS 270 - December '99

Mesh architecture • Square grid of processors • Each processor connected by communication link to N, S, E, W neighbors • Degree of network: 4 - makes expansion easy - can introduce adjacent meshes and connect border processors CIS 270 - December '99

Application: sorting Could have initial data all in “wrong half” of mesh, as shown. In 1 time unit, amt. of data that can cross into correct half of mesh: Since all n items must get to correct half-mesh, time required to sort is CIS 270 - December '99

In a mesh, each of these steps takes time. Hence, time for broadcast is CIS 270 - December '99

Semigroup operation (e.g., total) in mesh 2. Roll up last row to get total in a corner. 1. “Roll up” columns in parallel, totaling each column in last row by sending data downward. Time: 3. Broadcast total from corner to all processors. Time: Time: CIS 270 - December '99

Mesh total algorithm - continued Previous algorithm could run in approximately half the time by gathering total in a center, than corner, processor. However, running time is still i.e., still approximately proportional to communication diameter (with smaller constant of proportionality). CIS 270 - December '99

Hypercube • Number n of processors is a power of 2 • Processors are numbered from 0 to n-1 • Connected processors are those whose binary labels differ in exactly 1 bit. CIS 270 - December '99

Illustration of total operation in hypercube. Reverse direction of arrows to broadcast result Time: CIS 270 - December '99

Coarse-grained parallelism • Most of previous discussion was of fine-grained parallelism - # of processors comparable to # of data items • Realistically, few budgets accommodate such expensive computers - more likely to use coarse-grained parallelism with relatively few processors compared with # of data items. • Coarse grained algorithms often based on each processor boiling its share of data down to single partial result, then using fine-grained algorithm to combine these partial results CIS 270 - December '99

Example: coarse-grained total Suppose n data are distributed evenly (n/p per processor among p processors) 1. In parallel, each processor totals its share of the data. Time: Θ(n/p) 2. Use a fine-grained algorithm to add the partial sums (total residing in one processor) and broadcast result to all processors. In case of mesh, time: Total time for mesh: Since , this is Θ(n/p) - optimal. CIS 270 - December '99

More info: Algorithms Sequential and Parallel by Russ Miller and Laurence Boxer Prentice-Hall, 2000 (available December, 1999) CIS 270 - December '99

Exploring Parallel Architectures: Speed and Scalability in Computing

Exploring Parallel Architectures: Speed and Scalability in Computing

Presentation Transcript

Introduction to Parallel Processing

Parallel Computer Architectures

Parallel Architectures in Biotechnology

Parallel Computer Architectures

Introduction to MIMD architectures

CS4961 Parallel Programming Lecture 3: Introduction to Parallel Architectures Mary Hall September 1, 2009

10.Introduction to Data-Parallel architectures

Introduction to Parallel Processing

Parallel Architectures

Parallel and Multiprocessor Architectures

Parallel Architectures

CS4 Parallel Architectures - Introduction

Paralleelarvutid Parallel Computer Architectures

Different parallel processing architectures

Different parallel processing architectures

Introduction to Parallel Architectures and Programming Models

Parallel Architectures: Topologies

Parallel Architectures

Parallel Architectures

Convergence of Parallel Architectures

Parallel Architectures History

10.Introduction to Data-Parallel architectures