220 likes | 322 Views
Course Outline. Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster computers, BlueGene Programming methods, languages, and environments Message passing (SR, MPI, Java)
E N D
Course Outline • Introduction in algorithms and applications • Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster computers, BlueGene • Programming methods, languages, and environments Message passing (SR, MPI, Java) Higher-level language: HPF • Applications N-body problems, search algorithms, bioinformatics • Grid computing Multimedia content analysis on Grids (guest lecture Frank Seinstra)
Parallel Machines Parallel Computing – Techniques and Applications UsingNetworked Workstations and Parallel Computers (2/e) Section 1.3 + (part of) 1.4 Barry Wilkinson and Michael Allen Pearson, 2005
Overview • Processor organizations • Types of parallel machines • Processor arrays • Shared-memory multiprocessors • Distributed-memory multicomputers • Cluster computers • Blue Gene
Processor Organization • Network topology is a graph • A node is a processor • An edge is a communication path • Evaluation criteria • Diameter (maximum distance) • Bisection width (minimum number of edges that should be removed to split the graph into 2 -almost- equal halves) • Number of edges per node
Mesh q-dimensional lattice q=2 -> 2-D grid Number of nodes k² Diameter 2(k - 1) Bisection width k Edges per node 4
Binary Tree • Number of nodes 2k - 1 • Diameter 2 (k -1) • Bisection width 1 • Edges per node 3
Hypertree • Tree with multiple roots (see Figure 3-3), gives better bisection width • 4-ary tree: • Number of nodes 2k ( 2 k+1 - 1) • Diameter 2k • Bisection width 2 k+1 • Edges per node 6
Engineering solution: fat tree • Tree with more bandwidth at links near the root • CM-5
Hypercube • k-dimensional cube, each node has binary value, nodes that differ in 1 bit are connected • Number of nodes 2k • Diameter k • Bisection width 2k-1 • Edges per node k
Hypercube • Label nodes with binary value, connect nodes that differ in 1 coordinate • Number of nodes 2k • Diameter k • Bisection width 2k-1 • Edges per node k
Types of parallel machines • Processor arrays • Shared-memory multiprocessors • Distributed-memory multicomputers
Processor Arrays • Instructions operate on scalars or vectors • Processor array = front-end + synchronized processing elements • Front-end • Sequential machine that executes program • Vector operations are broadcast to PEs • Processing element • Performs operation on its part of the vector • Communicates with other PEs through a network
Examples of Processor Arrays • CM-200, Maspar MP-1, MP-2, ICL DAP (~1970s) • Japanese Earth Simulator (2002, former #1 of top-500)
Shared-Memory Multiprocessors • Bus easily gets saturated => add caches to CPUs • Central problem: cache coherency • Snooping cache: monitor bus, invalidate copy on write • Write-through or copy-back • Bus-based multiprocessors do not scale
Other Multiprocessor Designs (1/2) • Switch-based multiprocessors (e.g., crossbar) • Expensive (requires many very fast components)
Other Multiprocessor Designs (2/2) • Non-Uniform Memory Access (NUMA) multiprocessors • Memory is distributed • Some memory is faster to access than other memory • Example: • Teras at Sara,Dutch NationalSupercomputer(1024-node SGI)
Distributed-Memory Multicomputers • Each processor only has a local memory • Processors communicate by sending messages over a network • Routing of messages: • Packet-switched message routing: split message into packets, buffered at intermediate nodes • Store-and-forward • Cut-through routing, wormhole routing • Circuit-switched message routing: establish path between source and destination
Store-and-forward Routing • Messages are forwarded one node at a time • Forwarding is done in software • Every processor on path from source to destination is involved • Latency linear to distance x message length • Examples: Parsytec GCel (T800 transputers), Intel Ipsc
Circuit-switched Message Routing • Each node has a routing module • Circuit set up between source and destination • Latency linear to distance + message length • Example: Intel iPSC/2
Modern routing techniques • Circuit switching: needs to reserve all links in the path (cf. old telephone system) • Packet switching: high latency, buffering space (cf. postal mail) • Cut-through routing: packet switching, but immediately forward (without buffering) packets if outgoing link is available • Wormhole routing: transmit head (few bits) of message, rest follows like a worm
Distributed Shared Memory • Shared memory is relatively easy to program, but doesn’t scale • Distributed memory is hard to program, but does scale • Distributed Shared Memory (DSM): provide shared-memory programming model on top of distributed memory hardware • Shared Virtual Memory (SVM): use memory management hardware (paging), copy pages over the network • Object-based: provide replicated shared objects (Orca language) • Was hot research topic in 1990s, but performance remained the bottleneck
Flynn's Taxonomy • Instruction stream: sequence of instructions • Data stream: sequence of data manipulated by instructions SISD: Single Instruction Single Data Traditional uniprocessors SIMD: Single Instruction Multiple Data Processor arrays MISD: Multiple Instruction Single Data Nonexistent? MIMD: Multiple Instruction Multiple Data Multiprocessors and multicomputers