200 likes | 215 Views
This introduction delves into the world of parallel architectures, focusing on the purpose of achieving speed through dividing problems among processors and enabling simultaneous work. It discusses limitations on speed related to communication diameter, bisection width, and network expansion. The concept of PRAM (Parallel Random Access Machine) is explored along with linear array and mesh architectures. The text also covers the application of sorting algorithms in mesh architectures, hypercube structures, and coarse-grained parallelism. Examples of total operations and algorithms are provided to illustrate the concepts discussed.
E N D
Introduction to Parallel Architectures Dr. Laurence Boxer Niagara University CIS 270 - December '99
Parallel Computers • Purpose - speed • Divide a problem among processors • Let each processor work on its portion of problem in parallel (simultaneously) with other processors • Ideal - if p is the number of processors, get solution in 1/p of the time used by a computer of 1 processor • Actual - rarely get that much speedup, due to delays for interprocessor communications CIS 270 - December '99
Graphs of relevant functions CIS 270 - December '99
Architectural issues Limitations on speed: • Communications diameter - how many communication steps are necessary to send data from processor that has it to processor that needs it - large is bad • Bisection width - how many wires must be cut to cut network in half - measure of how fast massive amounts of data can be moved through network - large is good Limitation on expansion: • Degree of network - important to scalability (ability to expand number of processors) - large is bad CIS 270 - December '99
PRAM - Parallel Random Access Machine • Shared memory yields fast communications Any processor can send data to any other processor in time as follows: • Fast communications make this model theoretical ideal for fastest possible parallel algorithms for given # of processors • Impractical - too many wires if lots of processors • Source processor writes data to memory • Destination processor reads data from memory CIS 270 - December '99
Notice the tree structure of the previous algorithm: CIS 270 - December '99
Linear array architecture • Degree of network: 2 - easily expanded • Bisection width: 1 - can’t move large amounts of data efficiently across network • Communication diameter: n-1- won’t perform global communication operations efficiently CIS 270 - December '99
Total on linear array: • Assume 1 item per processor • Communications diameter implies • Since this is the time required to total n items on a RAM, there is no asymptotic benefit to using a linear array for this problem CIS 270 - December '99
Input-based sorting on a linear array • The algorithm illustrated is a version of Selection Sort - each processor selects the smallest value it sees and passes others to the right. • Time is proportional to communication diameter, CIS 270 - December '99
Mesh architecture • Square grid of processors • Each processor connected by communication link to N, S, E, W neighbors • Degree of network: 4 - makes expansion easy - can introduce adjacent meshes and connect border processors CIS 270 - December '99
Application: sorting Could have initial data all in “wrong half” of mesh, as shown. In 1 time unit, amt. of data that can cross into correct half of mesh: Since all n items must get to correct half-mesh, time required to sort is CIS 270 - December '99
In a mesh, each of these steps takes time. Hence, time for broadcast is CIS 270 - December '99
Semigroup operation (e.g., total) in mesh 2. Roll up last row to get total in a corner. 1. “Roll up” columns in parallel, totaling each column in last row by sending data downward. Time: 3. Broadcast total from corner to all processors. Time: Time: CIS 270 - December '99
Mesh total algorithm - continued Previous algorithm could run in approximately half the time by gathering total in a center, than corner, processor. However, running time is still i.e., still approximately proportional to communication diameter (with smaller constant of proportionality). CIS 270 - December '99
Hypercube • Number n of processors is a power of 2 • Processors are numbered from 0 to n-1 • Connected processors are those whose binary labels differ in exactly 1 bit. CIS 270 - December '99
Illustration of total operation in hypercube. Reverse direction of arrows to broadcast result Time: CIS 270 - December '99
Coarse-grained parallelism • Most of previous discussion was of fine-grained parallelism - # of processors comparable to # of data items • Realistically, few budgets accommodate such expensive computers - more likely to use coarse-grained parallelism with relatively few processors compared with # of data items. • Coarse grained algorithms often based on each processor boiling its share of data down to single partial result, then using fine-grained algorithm to combine these partial results CIS 270 - December '99
Example: coarse-grained total Suppose n data are distributed evenly (n/p per processor among p processors) 1. In parallel, each processor totals its share of the data. Time: Θ(n/p) 2. Use a fine-grained algorithm to add the partial sums (total residing in one processor) and broadcast result to all processors. In case of mesh, time: Total time for mesh: Since , this is Θ(n/p) - optimal. CIS 270 - December '99
More info: Algorithms Sequential and Parallel by Russ Miller and Laurence Boxer Prentice-Hall, 2000 (available December, 1999) CIS 270 - December '99