Lecture 7: Part 2: Message Passing Multicomputers

Lecture 7:Part 2: Message Passing Multicomputers (Distributed Memory Machines)

Message Passing Multicomputer • Consists of multiple computing units, called nodes • Each node is an autonomous computer, consists of • Processor(s) (may be an SMP) • Local memory • Disks or I/O peripherals (optional) • full-scale OS (some microkernel) • Nodes are communicated by message passing • No-remote-memory-access (NORMA) machines • Distributed memory machines

IBM SP2

SP2 • IBM SP2 => Scalable POWERparallel System • Developed based on RISC System/6000 architecture (POWER2 processor) • Interconnect: High-Performance Switch (HPS)

SP2 - Nodes • 66.7 MHz POWER2 processor with L2 cache. • POWER2 can perform six instructions (2 load/store, index increment, conditional branch, and two floating-point) per cycle. • 2 floating point units (FPU) + 2 fixed point units (FXU) • Perform up to four floating-point operations (2 multiply-add ops) per cycle. • A peak performance of 266 Mflops (66.7 x4) can be achieved.

IBM SP2 using Two types of nodes : • Thin node: • 4 micro-channel (I/O) slots, 96KB L2 cache, 64-512MB memory, 1-4 GB disk • Wide node : • 8 micro-channel slots, 288KB L2 cache, 64-2048MB memory, 1-8 GB disk

SP2 Wide Node

IBM SP2: Interconnect • Switch: • High Performance Switch (HPS), operates at 40 MHz, peak link bandwidth 40 MB/s (40 x 8-bit). • Omega-switch-based multistage network • Network interface: • Enhanced Communication Adapter. • The adapter incorporates an Intel i860 XR 64-bit microprocessor (40 MHz) does communication coprocessing, data checking

SP2 Switch Board • Each has 8 switch elements, operated at 40 MHz, for reliability, 16 elements installed • 4 routes between each pair of nodes (set at booting time) • hardware latency is 500 nsec (board) • capable of scaling bisectional bandwidth linearly with the number of nodes

SP2 HPS (a 16 x 16 switch board) • Maximum point-to-point bandwidth: 40MB/s • 1 packet consists of 256 bytes • flit size = 1 byte (wormhole routing) Vulcan chip

SP2 Communication Adapter • one adapter per node • one switch board unit per rack • send FIFO has 128 entries (256 bytes each) • receive FIFO has 64 entries (256 bytes each) • 2 DMA engines

SP2 Communication Adapter Host Node POWER2 Network Adapter

128-node SP2 (16 nodes per frame)

INTEL PARAGON

Intel Paragon (2-D mesh)

Intel Paragon Node Architecture • Up to three 50 MHz INTEL i860 processors (75 Mflop/s) per node (usually two in most implementation). • One of them is used as message processor(communication co-processor) handling all communication events. • Two are application processors (computation only) • Each node is a shared memory multiprocessor (64-bit bus, bus speed: 400 MB/s with cache coherence support) • Peak memory-to-processor bandwidth: 400 MB/s • Peak cache-to-processor bandwidth:1.2 GB/s.

Intel Paragon Node Architecture • message processor: • handles message protocol processing for the application program, • freeing the application processor to continue with numeric computation while messages are transmitted and received. • also used to implement efficient global operations such as synchronization, broadcasting, and global reduction calculations (e.g., global sum).

Paragon Node Architecture

Paragon Interconnect • 2-D Mesh • I/O devices attached on a single side • 16-bit link, 175 MB/s • Mesh Routing Components (MRCs), • one for each node. • 40 nsec per hop (switch delay) and 70 nsec if changes dimension (from x-dim to y-dim). • In a 512 PEs (16x32), 10 hops is 400-700nsec

CRAY T3D

Cray T3D Node Architecture • Each processing node contains two PEs, a network interface, and a block transfer engine. (shared by the two PEs) • PE: 150 MHz DEC 21064 Alpha AXP, 34-bit address, 64 MB memory, 150 MFLOPS • 1024 processor: sustained max speed 152 Gflop/s

T3D Node and Network Interface

Cray T3D Interconnect • Interconnect: 3D Torus, 16-bit data/link, 150 MHz • Communication channel peak rate: 300 MB/s.

T3D • The cost of routing data between processors through interconnect nodes is two clock cycles (6.67 nsec per cycle) per node traversed and one extra clock cycle to turn a corner • The overheads for using block transfer engine is high. (startup cost > 480 cycles x 6.67 nsec = 3.2 usec)

T3D: Local and Remote Memory • Local memory: • 16 or 64 MB DRAM per PE • Latency: 13 to 38 clock cycles (87 to 253 nsec) • Bandwidth: up to 320 MB/s • Remote memory: • Directly addressable by the processor, • Latency of 1 to 2 microseconds • Bandwidth: over 100 MB/s (measured in software).

T3D: Local and Remote Memory • Distributed Shared Memory Machine • All memory is directly accessible; no action is required by remote processors to formulate responses to remote requests. • NCC-NUMA : non-cache-coherence NUMA

T3D: Bisectional Bandwidth • The network moves data in packets with payload sizes of either one or four 64-bit words • The bisectional bandwidth of a 1024-PE T3D is 76 GB/s; • 512 node=8x8x8, 64 nodes/frame, 4x64x300

T3E Node Alpha 21164 4-issue (2 integer + 2 floating point) 600 Mflop/s (300 MHz) E-Register

Cluster:Network of Workstation (NOW)Cluster of Workstation (COW)Pile-of-PCs (POPC)

Clusters of Workstations • Several workstations which are connected by a network . • connected with Fast/Gigabit Ethernet, ATM, FDDI, etc. • some software to tightly integrate all resources • Each workstation is a independent machines

Cluster • Advantages • Cheaper • Easy to scale • Coarse-grain parallelism (traditionally) • Disadvantages of Clusters • Longer communication latency compared with other parallel system (traditionally)

ATM Cluster (Fore SBA-200) • Cluster node : Intel Pentium II, Pentium SMP, SGI, Sun Sparc, .. • NI location: I/O bus • Communication processor: Intel i960, 33MHz, 128KB RAM • Peak bandwidth: 19.4 MB/s or 77.6 MB/s per port • HKU: PearlCluster (16-node), SRG DP-ATM Cluster ($-node, 16.2 MB/s)

Myrinet Cluster • Cluster node: Intel Pentium II, Pentium SMP, SGI, Sun SPARC, .. • NI location: I/O bus • Communication processor: LANai, 25 MHz, 128 KB SRAM • Peak bandwidth: 80 MB/s --> 160 MB/s

Conclusion • Many current network interfaces employ a dedicated processor to offload communication tasks from the main processor. • Overlap computation with communication improve performance.

Paragon • Main processor : 50 MHz i860 XP, 75 Mflop/s. • NI location : Memory bus (64-bit, 400 MB/s) • Communication processor : 50 MHz i860 XP -- a processor • Peak bandwidth: 175 MB/s (16-bit link, 1 DMA engine)

SP2 • Main processor : 66.7 MHz POWER2, 266 MFLOPs • NI location : I/O bus (32-bit micro-channel) • Communication processor : 40 MHz i860 XR -- a processor • Peak bandwidth: 40 MB/s (8-bit link, 40 MHz)

T3D • Main processor: 150 MHz DEC 21064 Alpha AXP, 150 MFLOPS • NI location: Memory bus (320 MB/s local; or 100 MB/s remote) • Communication processor : Controller (BLT) -- hardware circuitry • Peak bandwidth: 300 MB/s (16-bit data/link at 150 MHz)

Lecture 7: Part 2: Message Passing Multicomputers

Lecture 7: Part 2: Message Passing Multicomputers

Presentation Transcript

An Introduction to MPI Parallel Programming with the Message Passing Interface

Lecture 2

An Introduction to MPI Parallel Programming with the Message Passing Interface

Using HPC.MNSU.EDU with Message Passing Interface (MPI)

Message Passing Programming with MPI

Message Passing

MPI The Message Passing Interface

MPI: Message-Passing Interface Chapter 2

A Message Passing Standard for MPP and Workstations

Programming Paradigms for Concurrency

Advanced Computer Architecture CSE 8383

CS 420

Comp 422: Parallel Programming

An Introduction to MPI Parallel Programming with the Message Passing Interface

Migrating Protocols In Multi-Threaded Message-Passing Systems

Architecture of Parallel Computers CSC / ECE 506 Message Passing and Open MPI Lecture 17

Message Passing Interface

Programming Clusters using Message-Passing Interface (MPI)

Parallel Programming with Message-Passing Interface (MPI)

1.10 Model 3: Message-Passing Text