370 likes | 521 Views
Lecture 7: Part 2: Message Passing Multicomputers. (Distributed Memory Machines). Message Passing Multicomputer. Consists of multiple computing units, called nodes Each node is an autonomous computer, consists of Processor(s) (may be an SMP) Local memory
E N D
Lecture 7:Part 2: Message Passing Multicomputers (Distributed Memory Machines)
Message Passing Multicomputer • Consists of multiple computing units, called nodes • Each node is an autonomous computer, consists of • Processor(s) (may be an SMP) • Local memory • Disks or I/O peripherals (optional) • full-scale OS (some microkernel) • Nodes are communicated by message passing • No-remote-memory-access (NORMA) machines • Distributed memory machines
SP2 • IBM SP2 => Scalable POWERparallel System • Developed based on RISC System/6000 architecture (POWER2 processor) • Interconnect: High-Performance Switch (HPS)
SP2 - Nodes • 66.7 MHz POWER2 processor with L2 cache. • POWER2 can perform six instructions (2 load/store, index increment, conditional branch, and two floating-point) per cycle. • 2 floating point units (FPU) + 2 fixed point units (FXU) • Perform up to four floating-point operations (2 multiply-add ops) per cycle. • A peak performance of 266 Mflops (66.7 x4) can be achieved.
IBM SP2 using Two types of nodes : • Thin node: • 4 micro-channel (I/O) slots, 96KB L2 cache, 64-512MB memory, 1-4 GB disk • Wide node : • 8 micro-channel slots, 288KB L2 cache, 64-2048MB memory, 1-8 GB disk
IBM SP2: Interconnect • Switch: • High Performance Switch (HPS), operates at 40 MHz, peak link bandwidth 40 MB/s (40 x 8-bit). • Omega-switch-based multistage network • Network interface: • Enhanced Communication Adapter. • The adapter incorporates an Intel i860 XR 64-bit microprocessor (40 MHz) does communication coprocessing, data checking
SP2 Switch Board • Each has 8 switch elements, operated at 40 MHz, for reliability, 16 elements installed • 4 routes between each pair of nodes (set at booting time) • hardware latency is 500 nsec (board) • capable of scaling bisectional bandwidth linearly with the number of nodes
SP2 HPS (a 16 x 16 switch board) • Maximum point-to-point bandwidth: 40MB/s • 1 packet consists of 256 bytes • flit size = 1 byte (wormhole routing) Vulcan chip
SP2 Communication Adapter • one adapter per node • one switch board unit per rack • send FIFO has 128 entries (256 bytes each) • receive FIFO has 64 entries (256 bytes each) • 2 DMA engines
SP2 Communication Adapter Host Node POWER2 Network Adapter
Intel Paragon Node Architecture • Up to three 50 MHz INTEL i860 processors (75 Mflop/s) per node (usually two in most implementation). • One of them is used as message processor(communication co-processor) handling all communication events. • Two are application processors (computation only) • Each node is a shared memory multiprocessor (64-bit bus, bus speed: 400 MB/s with cache coherence support) • Peak memory-to-processor bandwidth: 400 MB/s • Peak cache-to-processor bandwidth:1.2 GB/s.
Intel Paragon Node Architecture • message processor: • handles message protocol processing for the application program, • freeing the application processor to continue with numeric computation while messages are transmitted and received. • also used to implement efficient global operations such as synchronization, broadcasting, and global reduction calculations (e.g., global sum).
Paragon Interconnect • 2-D Mesh • I/O devices attached on a single side • 16-bit link, 175 MB/s • Mesh Routing Components (MRCs), • one for each node. • 40 nsec per hop (switch delay) and 70 nsec if changes dimension (from x-dim to y-dim). • In a 512 PEs (16x32), 10 hops is 400-700nsec
Cray T3D Node Architecture • Each processing node contains two PEs, a network interface, and a block transfer engine. (shared by the two PEs) • PE: 150 MHz DEC 21064 Alpha AXP, 34-bit address, 64 MB memory, 150 MFLOPS • 1024 processor: sustained max speed 152 Gflop/s
Cray T3D Interconnect • Interconnect: 3D Torus, 16-bit data/link, 150 MHz • Communication channel peak rate: 300 MB/s.
T3D • The cost of routing data between processors through interconnect nodes is two clock cycles (6.67 nsec per cycle) per node traversed and one extra clock cycle to turn a corner • The overheads for using block transfer engine is high. (startup cost > 480 cycles x 6.67 nsec = 3.2 usec)
T3D: Local and Remote Memory • Local memory: • 16 or 64 MB DRAM per PE • Latency: 13 to 38 clock cycles (87 to 253 nsec) • Bandwidth: up to 320 MB/s • Remote memory: • Directly addressable by the processor, • Latency of 1 to 2 microseconds • Bandwidth: over 100 MB/s (measured in software).
T3D: Local and Remote Memory • Distributed Shared Memory Machine • All memory is directly accessible; no action is required by remote processors to formulate responses to remote requests. • NCC-NUMA : non-cache-coherence NUMA
T3D: Bisectional Bandwidth • The network moves data in packets with payload sizes of either one or four 64-bit words • The bisectional bandwidth of a 1024-PE T3D is 76 GB/s; • 512 node=8x8x8, 64 nodes/frame, 4x64x300
T3E Node Alpha 21164 4-issue (2 integer + 2 floating point) 600 Mflop/s (300 MHz) E-Register
Cluster:Network of Workstation (NOW)Cluster of Workstation (COW)Pile-of-PCs (POPC)
Clusters of Workstations • Several workstations which are connected by a network . • connected with Fast/Gigabit Ethernet, ATM, FDDI, etc. • some software to tightly integrate all resources • Each workstation is a independent machines
Cluster • Advantages • Cheaper • Easy to scale • Coarse-grain parallelism (traditionally) • Disadvantages of Clusters • Longer communication latency compared with other parallel system (traditionally)
ATM Cluster (Fore SBA-200) • Cluster node : Intel Pentium II, Pentium SMP, SGI, Sun Sparc, .. • NI location: I/O bus • Communication processor: Intel i960, 33MHz, 128KB RAM • Peak bandwidth: 19.4 MB/s or 77.6 MB/s per port • HKU: PearlCluster (16-node), SRG DP-ATM Cluster ($-node, 16.2 MB/s)
Myrinet Cluster • Cluster node: Intel Pentium II, Pentium SMP, SGI, Sun SPARC, .. • NI location: I/O bus • Communication processor: LANai, 25 MHz, 128 KB SRAM • Peak bandwidth: 80 MB/s --> 160 MB/s
Conclusion • Many current network interfaces employ a dedicated processor to offload communication tasks from the main processor. • Overlap computation with communication improve performance.
Paragon • Main processor : 50 MHz i860 XP, 75 Mflop/s. • NI location : Memory bus (64-bit, 400 MB/s) • Communication processor : 50 MHz i860 XP -- a processor • Peak bandwidth: 175 MB/s (16-bit link, 1 DMA engine)
SP2 • Main processor : 66.7 MHz POWER2, 266 MFLOPs • NI location : I/O bus (32-bit micro-channel) • Communication processor : 40 MHz i860 XR -- a processor • Peak bandwidth: 40 MB/s (8-bit link, 40 MHz)
T3D • Main processor: 150 MHz DEC 21064 Alpha AXP, 150 MFLOPS • NI location: Memory bus (320 MB/s local; or 100 MB/s remote) • Communication processor : Controller (BLT) -- hardware circuitry • Peak bandwidth: 300 MB/s (16-bit data/link at 150 MHz)