730 likes | 888 Views
Introduction to Architecture for Parallel Computing. Models of parallel computers (sequential). Taxonomy. Control Address space Interconnection network Granularity Control SISD – Single instruction stream, single data stream Pros: 1. Single program less memory
E N D
Taxonomy • Control • Address space • Interconnection network • Granularity Control • SISD – Single instruction stream, single data stream Pros: 1. Single program less memory Cons: 1. Dedicated hardware (expensive) 2. Rigid
Taxonomy (continued) • SIMD – Single instruction stream, multiple data stream • MISD – Multiple instruction stream, single data stream • MIMD – Multiple instruction stream, multiple data stream Pros: 1. Over the shelf hardware – inexpensive 2. Flexible Cons: 1. Program and data on all PEs
SIMD and MIMD architectures SIMD MIMD
Address –Space organization • Two paradigms - Message passing - Shared-memory • Message Passing - pairs of PE and memory modules communicate between each other • Shared Address Space - each PE has access to any location in memory.
Interconnection Networks • Static (direct) – for message passing • Dynamic(indirect) - for shared-memory
Processor Granularity • Coarse–grain - for algorithms that require frequent communication • Medium-grain - for algorithms in which the ratio of the time required for basic communication to the time required for basic computation is small • Fine–grain - required frequent communication.
PRAM model(idealized parallel computer) • P is the number of processors - share common clock each works on its own instruction • M is the global memory - uniformly accessible to all PES • Synchronous shared memory MIMD computers • Interaction between PEs occurs at no cost
PRAM model (continued) • Depending on how memory is accessed for READ and WRITE operations following four subclasses: - EREW (exclusive read, exclusive write) - weakest, minimum concurrency - ERCW (excusive read, concurrent write) - CREW (concurrent read, exclusive write) - read access is concurrent, write access serialized - CRCW (concurrent read, concurrent write) - most powerful, can be simulated on a EREW model
PRAM model (continued) • CR (concurrent read) – does not need. Modification in program • CW (concurrent write) - access to memory requires arbitration • Protocols for CR - Common - iff all values are identical - Arbitrary – an arbitrary PE proceeds, other PEs fail - Priority – PE with highest priority wins - Sum – sum of all quantities is written
Dynamic Interconnection Networks • EREW PRAM - p processors - m words in global memory - switching elements (determine the memory word accessed) • Each P can access any of the memory words • Number of switching elements required (mp) for UMA. • Very expensive (-) • Very good performance (+)
Dynamic interconnection networks (continued) • Solution - Reduce m by using memory banks (m words organized in b banks) - Each P switches between b banks - Total number of switching elements (bp) - Less expensive • Note - This is a weak approximation of EREW because no 2 PE can access the same memory bank at the same time.
Crossbar switching networks(continued) • From the previous slide - p processors are connected to b memory banks by a crossbar switch. - each PE accesses one memory bank - m words are stored in the memory banks - if m = b, crossbar simulates EREW PRAM • Total number of switching elements (pb) • If p is very large and p > b - total number of switching elements grows as (p2) - and more P are unable to access any memory bank - not scalable
Bus-based networks(continued) • UMA - each PE accesses data from M over the same bus (UMA) - as p becomes very large, each PE spends an increasing amount of time waiting for memory access while the bus is used by other PEs. - not acceptable. • Solution (NUMA) - provide local-cache to reduce total number of accesses to global memory - this implies replicating data - cache coherency problems. - the architecture is shown on the following slide
Multistage interconnection networks • Multistage networks provides a trade-off between the two required metrics.
Multistage interconnection networks (graphs) Cost Performance Crossbar Multistage Crossbar Multistage Bus Bus p p
Multistage interconnection networks (continued) • Cost is less than the Crossbar networks • Performance better than the Bus-based networks • Omega network - p is the # of PEs - b is the # memory banks - number of stages is log2p - p= b - a link exists between input i and output j if: j =
Omega network with blocking An example of blocking in Omega network; one of the messages (010 to 111 or 110 to 100) is blocked at link AB.
Static interconnection networks • Used in message passing • Completely-connected network - similar to Crossbar but can support multiple channels - every PE connected to every other PE - communication takes place in 1 step • Star-connected network - central PE (bottleneck) - similar to bus-based - communication takes place in 2 steps • Examples are shown on the following slide
Completely-connectedand Star-connected networks Completely connected (non-blocking) Star connected
Static interconnection networks(continued) • Linear Array And Ring • Mesh Network (2-D, 3-D) - linear array spread over either 2 dimensions or 3 dimensions - each internal PE connected to 4 (2-D) other PEs or 6 (3-D) other PEs - if periphery PEs are connected we have a wrap-around mesh or torus
Tree networks • Only 1 path between any pair of PEs • Linear arrays and star- connected networks are special cases of tree nets • Static - each made corresponds to a PE • Dynamic - only leaf nodes are PEs - intermediate nodes are switching elements Examples of tree networks are shown on the following slides
Tree networks (continued) • Fat tree - increasing the number of communication links for PEs closer to the root. - in this way bottlenecks at higher levels in the tree are alleviated.
Hypercube networks • Multidimensional mesh with 2 PEs in each of the dimensions • A d-dimensional hypercube has p=2d PEs • Can be constructed recursively - A (d+1)–dimensional hypercube is constructed by connecting the corresponding PEs of 2 separate d- dimensional hypercubes. - The labels of the PEs of one hypercube are prefixed with 0 and of the labels of the second hypercube with 1 - This is shown on the following slide
HC2 Three distinct partitions of a 3D hypercube into two 2D subcubes. Links connecting processors within a partition are indicated by bold lines.
Hypercube properties • 2 PEs are connected by a direct link iff the binary representation of their labels differ at exactly one bit position. • In a d–dimensional HyC, each PE is directly connected to d other PEs • A d-dimensional HyC can be partitioned into 2 (d-1)- dimensional subcubes (see figure HC2). Since PEs labels have d bits, d such partitions exist • Fixing any k-bits in a d-dimensional HyC with d-bits => PEs that differ in the remaining (d-k) bit positions form a (d-k)-dimensional subcube composed of 2d-k 2PEs (2k such subcubes) see (fig HC3) • Example: K=2, d=4 - 4 subcubes by fixing 2 MSB - 4 subcubes by fixing 2 LSB - always 4 subcubes formed by fixing any 2 bits
Hypercube properties (continued) • The total of bit positions at which 2 labels differ is the HAMMING distance between the 2 PEs - s is the source, t is the destination - s t is the Hamming distance • The number of communication links in the shortest path between 2 PEs is the Hamming distance between their labels. • The shortest path between any 2 PEs in a HyC cannot have more than d links (since s t cannot contain more than d bits)
K-ary D-cube networks • A d–dimensional HyC is a binary d-cube or 2-ary d-cube (2 PEs along each link) • In general for K-ary d- cubes d is the dimension of the network k is the radix, i.e., the number of PEs in each dimension • The number of PEs is p = Kd - e.g. 2–d mesh with p PEs : p=k2 or k=p1/2
Evaluating static interconnection networks(in terms of cost and performance) • Diameter - the max distance between any 2 PEs in the network - the distance between 2 PEs is the shortest path between them - the distance determines communication time. • Diameter of networks - completely connected network: 1 - star-connected network: 2 - ring: - 2-d mesh: - wraparound 2-d mesh: - hypercube connected net: log2p - complete binary tree: 2(1ag((p+1)/2)) . i.e. p is the # of processors which is equal to the # of nodes (15PE) . h = log((p+1)/2) and diameter is 2h (see figure on following slide)
Evaluating static interconnection networks (continued) • Connectivity - is a measure of the multiplicity of paths between 2 PEs • High connectivity – lower contention for communication resources • Arc connectivity - is a measure of connectivity - the minimum number of arcs that must be removed to break the network into 2 disconnected networks • Arc connectivity of networks - linear arrays, star, tree: 1 - rings, 2D mesh without wraparound: 2: - 2D mesh with wraparound: 4 - d-dimensional hypercube: d
Bisection Width • The minimum number of communication links that have to be removed to partition the network into two equal halves. • Bisection width of networks - Ring: 2 - 2D mesh w/0 wraparounds: p - 2D mesh w/ wraparounds: 2 p - Tree , star: 1 (BWstar is the same as the Bwtree which is 1, since the star is special case of tree) - Fully-connected network: p2/4 - Hypercube: p/2 . i.e. d-dimensional HyC (p processors) consists of 2 X (d-1)–dimensional HyCs (2X P/2 processors) . connect corresponding links to P/2 links
HyC bisection width figure P/2 links HyC2 (d-1) p/2 HyC1 (d-1) p/2 HyC d p
Bisection bandwidth • Channel width - The number of bits that can be simultaneously communicated over a link connecting 2PEs (# of wires) • Channel rate - peak rate • Channel bandwidth (channel rate X channel width) - the peak rate at which data can be communicated between the ends of a communicated link • Bisection Bandwidth (bisection width x channel bandwidth) - minimum volume of communication allowed between any 2 halves of a network with an equal number of PEs
Cost • Measured in terms of (a) number of communication links and (b) bisection bandwidth • Number of communication links (number of wires required) - linear arrays, trees: p-1 (links) - d-dimensional mesh with wraparound: dp - HyC: (p/2) log2p • Bisection bandwidth - measures cost by providing a lower bound on the area (2D) or the volume (3D) of the packaging - if the bisection width of a networks is w . the lower bond an the area in 2D is (w2) . lower bond an volume in 3D is (w3/2)
Embedding other networks into a hypercube • Given 2 graphs : G(V,E) and G(V’,E’), embedding graph G into graph G’ maps each vertex in the set V onto a vertex (or a set of vertices) in set V’ and each edge in set E onto an edge (or a set of edges) in E’ - nodes correspond to PEs and edges corresponds to communication links. • Why do we need Embedding? - Answer: It may be necessary to adapt one network to another (when an application is written for a specific network. Which is not available at present) • Parameters - congestion (# of edges in E mapped to one edge in E’) - dilation (reverse of congestion) - expansion (ratio of # vertices in V’ corresponding to one vertex in V) - contraction (reverse of expansion)
Embedding a linear array into a HyC • Linear array of 2d processors (labeled 0…2d-1) can be embedded into a d-dimensional HyC by mapping processors i of the linear array to the processor with label G(i, d) of a HyC - G(0,1)= 0 - G(1, 1)= 1 - G(i, x+1) = - The function G is the Binary Reflected Gray Code (RGC)