Introduction to Architecture for Parallel Computing

Introduction to Architecture for Parallel Computing

Models of parallel computers (sequential)

Taxonomy • Control • Address space • Interconnection network • Granularity Control • SISD – Single instruction stream, single data stream Pros: 1. Single program less memory Cons: 1. Dedicated hardware (expensive) 2. Rigid

Taxonomy (continued) • SIMD – Single instruction stream, multiple data stream • MISD – Multiple instruction stream, single data stream • MIMD – Multiple instruction stream, multiple data stream Pros: 1. Over the shelf hardware – inexpensive 2. Flexible Cons: 1. Program and data on all PEs

SIMD and MIMD architectures SIMD MIMD

Address –Space organization • Two paradigms - Message passing - Shared-memory • Message Passing - pairs of PE and memory modules communicate between each other • Shared Address Space - each PE has access to any location in memory.

Message passing paradigm

Shared-memory paradigm (UMA)

Shared-memory paradigm (NUMA)

Interconnection Networks • Static (direct) – for message passing • Dynamic(indirect) - for shared-memory

Processor Granularity • Coarse–grain - for algorithms that require frequent communication • Medium-grain - for algorithms in which the ratio of the time required for basic communication to the time required for basic computation is small • Fine–grain - required frequent communication.

PRAM model(idealized parallel computer) • P is the number of processors - share common clock each works on its own instruction • M is the global memory - uniformly accessible to all PES • Synchronous shared memory MIMD computers • Interaction between PEs occurs at no cost

PRAM model (continued) • Depending on how memory is accessed for READ and WRITE operations following four subclasses: - EREW (exclusive read, exclusive write) - weakest, minimum concurrency - ERCW (excusive read, concurrent write) - CREW (concurrent read, exclusive write) - read access is concurrent, write access serialized - CRCW (concurrent read, concurrent write) - most powerful, can be simulated on a EREW model

PRAM model (continued) • CR (concurrent read) – does not need. Modification in program • CW (concurrent write) - access to memory requires arbitration • Protocols for CR - Common - iff all values are identical - Arbitrary – an arbitrary PE proceeds, other PEs fail - Priority – PE with highest priority wins - Sum – sum of all quantities is written

Dynamic Interconnection Networks • EREW PRAM - p processors - m words in global memory - switching elements (determine the memory word accessed) • Each P can access any of the memory words • Number of switching elements required (mp) for UMA. • Very expensive (-) • Very good performance (+)

Dynamic interconnection networks (continued) • Solution - Reduce m by using memory banks (m words organized in b banks) - Each P switches between b banks - Total number of switching elements (bp) - Less expensive • Note - This is a weak approximation of EREW because no 2 PE can access the same memory bank at the same time.

Crossbar switching networks

Crossbar switching networks(continued) • From the previous slide - p processors are connected to b memory banks by a crossbar switch. - each PE accesses one memory bank - m words are stored in the memory banks - if m = b, crossbar simulates EREW PRAM • Total number of switching elements (pb) • If p is very large and p > b - total number of switching elements grows as (p2) - and more P are unable to access any memory bank - not scalable

Bus-based networks

Bus-based networks(continued) • UMA - each PE accesses data from M over the same bus (UMA) - as p becomes very large, each PE spends an increasing amount of time waiting for memory access while the bus is used by other PEs. - not acceptable. • Solution (NUMA) - provide local-cache to reduce total number of accesses to global memory - this implies replicating data - cache coherency problems. - the architecture is shown on the following slide

Bus-based networks (NUMA)

Multistage interconnection networks • Multistage networks provides a trade-off between the two required metrics.

Multistage interconnection networks (graphs) Cost Performance Crossbar Multistage Crossbar Multistage Bus Bus p p

Multistage interconnection networks

Multistage interconnection networks (continued) • Cost is less than the Crossbar networks • Performance better than the Bus-based networks • Omega network - p is the # of PEs - b is the # memory banks - number of stages is log2p - p= b - a link exists between input i and output j if: j =

Omega network with blocking An example of blocking in Omega network; one of the messages (010 to 111 or 110 to 100) is blocked at link AB.

Switch configurations in one stage

Static interconnection networks • Used in message passing • Completely-connected network - similar to Crossbar but can support multiple channels - every PE connected to every other PE - communication takes place in 1 step • Star-connected network - central PE (bottleneck) - similar to bus-based - communication takes place in 2 steps • Examples are shown on the following slide

Completely-connectedand Star-connected networks Completely connected (non-blocking) Star connected

Static interconnection networks(continued) • Linear Array And Ring • Mesh Network (2-D, 3-D) - linear array spread over either 2 dimensions or 3 dimensions - each internal PE connected to 4 (2-D) other PEs or 6 (3-D) other PEs - if periphery PEs are connected we have a wrap-around mesh or torus

Tree networks • Only 1 path between any pair of PEs • Linear arrays and star- connected networks are special cases of tree nets • Static - each made corresponds to a PE • Dynamic - only leaf nodes are PEs - intermediate nodes are switching elements Examples of tree networks are shown on the following slides

Static and dynamic tree

Tree networks (continued) • Fat tree - increasing the number of communication links for PEs closer to the root. - in this way bottlenecks at higher levels in the tree are alleviated.

Hypercube networks • Multidimensional mesh with 2 PEs in each of the dimensions • A d-dimensional hypercube has p=2d PEs • Can be constructed recursively - A (d+1)–dimensional hypercube is constructed by connecting the corresponding PEs of 2 separate d- dimensional hypercubes. - The labels of the PEs of one hypercube are prefixed with 0 and of the labels of the second hypercube with 1 - This is shown on the following slide

Hypercube recursive construction

HC1

HC2 Three distinct partitions of a 3D hypercube into two 2D subcubes. Links connecting processors within a partition are indicated by bold lines.

HC3

Hypercube properties • 2 PEs are connected by a direct link iff the binary representation of their labels differ at exactly one bit position. • In a d–dimensional HyC, each PE is directly connected to d other PEs • A d-dimensional HyC can be partitioned into 2 (d-1)- dimensional subcubes (see figure HC2). Since PEs labels have d bits, d such partitions exist • Fixing any k-bits in a d-dimensional HyC with d-bits => PEs that differ in the remaining (d-k) bit positions form a (d-k)-dimensional subcube composed of 2d-k 2PEs (2k such subcubes) see (fig HC3) • Example: K=2, d=4 - 4 subcubes by fixing 2 MSB - 4 subcubes by fixing 2 LSB - always 4 subcubes formed by fixing any 2 bits

Hypercube properties (continued) • The total of bit positions at which 2 labels differ is the HAMMING distance between the 2 PEs - s is the source, t is the destination - s t is the Hamming distance • The number of communication links in the shortest path between 2 PEs is the Hamming distance between their labels. • The shortest path between any 2 PEs in a HyC cannot have more than d links (since s t cannot contain more than d bits)

K-ary D-cube networks • A d–dimensional HyC is a binary d-cube or 2-ary d-cube (2 PEs along each link) • In general for K-ary d- cubes d is the dimension of the network k is the radix, i.e., the number of PEs in each dimension • The number of PEs is p = Kd - e.g. 2–d mesh with p PEs : p=k2 or k=p1/2

Evaluating static interconnection networks(in terms of cost and performance) • Diameter - the max distance between any 2 PEs in the network - the distance between 2 PEs is the shortest path between them - the distance determines communication time. • Diameter of networks - completely connected network: 1 - star-connected network: 2 - ring: - 2-d mesh: - wraparound 2-d mesh: - hypercube connected net: log2p - complete binary tree: 2(1ag((p+1)/2)) . i.e. p is the # of processors which is equal to the # of nodes (15PE) . h = log((p+1)/2) and diameter is 2h (see figure on following slide)

Tree network diameter

Evaluating static interconnection networks (continued) • Connectivity - is a measure of the multiplicity of paths between 2 PEs • High connectivity – lower contention for communication resources • Arc connectivity - is a measure of connectivity - the minimum number of arcs that must be removed to break the network into 2 disconnected networks • Arc connectivity of networks - linear arrays, star, tree: 1 - rings, 2D mesh without wraparound: 2: - 2D mesh with wraparound: 4 - d-dimensional hypercube: d

Bisection Width • The minimum number of communication links that have to be removed to partition the network into two equal halves. • Bisection width of networks - Ring: 2 - 2D mesh w/0 wraparounds: p - 2D mesh w/ wraparounds: 2 p - Tree , star: 1 (BWstar is the same as the Bwtree which is 1, since the star is special case of tree) - Fully-connected network: p2/4 - Hypercube: p/2 . i.e. d-dimensional HyC (p processors) consists of 2 X (d-1)–dimensional HyCs (2X P/2 processors) . connect corresponding links to P/2 links

HyC bisection width figure P/2 links HyC2 (d-1) p/2 HyC1 (d-1) p/2 HyC d p

Bisection bandwidth • Channel width - The number of bits that can be simultaneously communicated over a link connecting 2PEs (# of wires) • Channel rate - peak rate • Channel bandwidth (channel rate X channel width) - the peak rate at which data can be communicated between the ends of a communicated link • Bisection Bandwidth (bisection width x channel bandwidth) - minimum volume of communication allowed between any 2 halves of a network with an equal number of PEs

Cost • Measured in terms of (a) number of communication links and (b) bisection bandwidth • Number of communication links (number of wires required) - linear arrays, trees: p-1 (links) - d-dimensional mesh with wraparound: dp - HyC: (p/2) log2p • Bisection bandwidth - measures cost by providing a lower bound on the area (2D) or the volume (3D) of the packaging - if the bisection width of a networks is w . the lower bond an the area in 2D is (w2) . lower bond an volume in 3D is (w3/2)

Embedding other networks into a hypercube • Given 2 graphs : G(V,E) and G(V’,E’), embedding graph G into graph G’ maps each vertex in the set V onto a vertex (or a set of vertices) in set V’ and each edge in set E onto an edge (or a set of edges) in E’ - nodes correspond to PEs and edges corresponds to communication links. • Why do we need Embedding? - Answer: It may be necessary to adapt one network to another (when an application is written for a specific network. Which is not available at present) • Parameters - congestion (# of edges in E mapped to one edge in E’) - dilation (reverse of congestion) - expansion (ratio of # vertices in V’ corresponding to one vertex in V) - contraction (reverse of expansion)

Embedding a linear array into a HyC • Linear array of 2d processors (labeled 0…2d-1) can be embedded into a d-dimensional HyC by mapping processors i of the linear array to the processor with label G(i, d) of a HyC - G(0,1)= 0 - G(1, 1)= 1 - G(i, x+1) = - The function G is the Binary Reflected Gray Code (RGC)

Introduction to Architecture for Parallel Computing

Introduction to Architecture for Parallel Computing

Presentation Transcript

CS 258 Parallel Computer Architecture Lecture 1 Introduction to Parallel Architecture

Introduction to Parallel Computing with MPI

Introduction to Parallel Computing

Introduction to Parallel Computing

Introduction to MATLAB parallel computing toolbox

Introduction To Parallel Computing

EPIC Architecture (Explicitly Parallel Instruction Computing)

Introduction to Parallel Computing, Fall 2009

Introduction to Parallel Computing

Introduction to MATLAB parallel computing toolbox

Introduction to Parallel Computing

Introduction to Parallel Computing

Introduction to Parallel Computing

Introduction to Parallel Computing

Introduction to Parallel Computing with MPI

An Introduction to Parallel Computing

Introduction to Parallel Computing

Introduction to Parallel Computing

Introduction to Parallel Computing

Lecture 1 Introduction to Parallel Computing

CS427 Multicore Architecture and Parallel Computing

CS 258 Parallel Computer Architecture Lecture 1 Introduction to Parallel Architecture