270 likes | 374 Views
Course Outline. Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500, clusters, many-cores Programming methods, languages, and environments Message passing (SR, MPI, Java) Higher-level language: HPF
E N D
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500, clusters, many-cores Programming methods, languages, and environments Message passing (SR, MPI, Java) Higher-level language: HPF Applications N-body problems, search algorithms Many-core (GPU) programming (Rob van Nieuwpoort)
Parallel Machines Parallel Computing – Techniques and Applications UsingNetworked Workstations and Parallel Computers (2/e) Section 1.3 + (part of) 1.4 Barry Wilkinson and Michael Allen Pearson, 2005
Overview • Processor organizations • Types of parallel machines • Processor arrays • Shared-memory multiprocessors • Distributed-memory multicomputers • Cluster computers • Blue Gene
Processor Organization • Network topology is a graph • A node is a processor • An edge is a communication path • Evaluation criteria • Diameter (maximum distance) • Bisection width (minimum number of edges that should be removed to splitthe graph into 2 -almost- equal halves) • Number of edges per node
Key issues in network design • Bandwidth: • Number of bits transferred per second • Latency: • Network latency: time to make a message transfer through the network • Communication latency: total time to send the message, including software overhead and interface delays • Message latency (startup time): time to send a zero-length message • Diameter influences latency • Bisection width influences bisection bandwidth (collective bandwidth over the ``removed’’ edges)
Mesh q-dimensional lattice q=2 -> 2-D grid Number of nodes: k² (k = SQRT(p)) Diameter 2(k - 1) Bisection width k Edges per node 4
Binary Tree • Number of nodes: 2k – 1 (k = 2LOG(P)) • Diameter: 2 (k -1) • Bisection width: 1 • Edges per node: 3
Hypertree • Tree with multiple roots, gives better bisection width • 4-ary tree: • Number of nodes 2k ( 2 k+1 - 1) • Diameter 2k • Bisection width 2 k+1 • Edges per node 6
Engineering solution: fat tree • Tree with more bandwidth at links near the root • CM-5 • CM-5
Hypercube • k-dimensional cube, each node has binary value, nodes that differ in 1 bit are connected • Number of nodes 2k • Diameter k • Bisection width 2k-1 • Edges per node k
Hypercube • Label nodes with binary value, connect nodes that differ in 1 coordinate • Number of nodes 2k • Diameter k • Bisection width 2k-1 • Edges per node k
Types of parallel machines • Processor arrays • Shared-memory multiprocessors • Distributed-memory multicomputers
Processor Arrays • Instructions operate on scalars or vectors • Processor array = front-end + synchronized processing elements
Processor Arrays • Front-end • Sequential machine that executes program • Vector operations are broadcast to PEs • Processing element • Performs operation on its part of the vector • Communicates with other PEs through a network
Examples of Processor Arrays • CM-200, Maspar MP-1, MP-2, ICL DAP (~1970s) • Earth Simulator (Japan, 2002, former #1 of top-500) • Ideas are now applied in GPUs and CPU-extensions like MMX
Shared-Memory Multiprocessors • Bus easily gets saturated => add caches to CPUs • Central problem: cache coherency • Snooping cache: monitor bus, invalidate copy on write • Write-through or copy-back • Bus-based multiprocessors do not scale
Other Multiprocessor Designs (1/2) • Switch-based multiprocessors (e.g., crossbar) • Expensive (requires many very fast components)
Other Multiprocessor Designs (2/2) • Non-Uniform Memory Access (NUMA) multiprocessors • Memory is distributed • Some memory is faster toaccess than other memory • Example: • Teras at Sara, Dutch NationalSupercomputer (1024-node SGI) • Ideas now applied in multi-cores
Distributed-Memory Multicomputers • Each processor only has a local memory • Processors communicate by sending messages over a network • Routing of messages: • Packet-switched message routing: split message into packets, buffered at intermediate nodes • Store-and-forward • Circuit-switched message routing: establish path between source and destination
Packet-switched Message Routing • Messages are forwarded one node at a time • Forwarding is done in software • Every processor on path from source to destination is involved • Latency linear to distance x message length • Old examples: Parsytec GCel (T800 transputers), Intel Ipsc
Circuit-switched Message Routing • Each node has a routing module • Circuit set up between source and destination • Latency linear to distance + message length • Example: Intel iPSC/2
Modern routing techniques • Circuit switching: needs to reserve all links in the path (cf. old telephone system) • Packet switching: high latency, buffering space (cf. postal mail) • Cut-through routing: packet switching, but immediately forward (without buffering) packets if outgoing link is available • Wormhole routing: transmit head (few bits) of message, rest follows like a worm
Performance Packet switching Network latency Wormhole routingCircuit switching Distance (number of hops)
Distributed Shared Memory • Shared memory is easier to program, but doesn’t scale • Distributed memory is hard to program, but does scale • Distributed Shared Memory (DSM): provide shared-memory programming model on top of distributed memory hardware • Shared Virtual Memory (SVM): use memory management hardware (paging), copy pages over the network • Object-based: provide replicated shared objects (Orca language) • Was hot research topic in 1990s, but performance remained the bottleneck
Flynn's Taxonomy • Instruction stream: sequence of instructions • Data stream: sequence of data manipulated by instructions
Flynn's Taxonomy SISD: Single Instruction Single Data Traditional uniprocessors SIMD: Single Instruction Multiple Data Processor arrays MISD: Multiple Instruction Single Data Nonexistent? MIMD: Multiple Instruction Multiple Data Multiprocessors and multicomputers