1 / 37

Lecture 7: Part 2: Message Passing Multicomputers

Lecture 7: Part 2: Message Passing Multicomputers. (Distributed Memory Machines). Message Passing Multicomputer. Consists of multiple computing units, called nodes Each node is an autonomous computer, consists of Processor(s) (may be an SMP) Local memory

alcina
Download Presentation

Lecture 7: Part 2: Message Passing Multicomputers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 7:Part 2: Message Passing Multicomputers (Distributed Memory Machines)

  2. Message Passing Multicomputer • Consists of multiple computing units, called nodes • Each node is an autonomous computer, consists of • Processor(s) (may be an SMP) • Local memory • Disks or I/O peripherals (optional) • full-scale OS (some microkernel) • Nodes are communicated by message passing • No-remote-memory-access (NORMA) machines • Distributed memory machines

  3. IBM SP2

  4. SP2 • IBM SP2 => Scalable POWERparallel System • Developed based on RISC System/6000 architecture (POWER2 processor) • Interconnect: High-Performance Switch (HPS)

  5. SP2 - Nodes • 66.7 MHz POWER2 processor with L2 cache. • POWER2 can perform six instructions (2 load/store, index increment, conditional branch, and two floating-point) per cycle. • 2 floating point units (FPU) + 2 fixed point units (FXU) • Perform up to four floating-point operations (2 multiply-add ops) per cycle. • A peak performance of 266 Mflops (66.7 x4) can be achieved.

  6. IBM SP2 using Two types of nodes : • Thin node: • 4 micro-channel (I/O) slots, 96KB L2 cache, 64-512MB memory, 1-4 GB disk • Wide node : • 8 micro-channel slots, 288KB L2 cache, 64-2048MB memory, 1-8 GB disk

  7. SP2 Wide Node

  8. IBM SP2: Interconnect • Switch: • High Performance Switch (HPS), operates at 40 MHz, peak link bandwidth 40 MB/s (40 x 8-bit). • Omega-switch-based multistage network • Network interface: • Enhanced Communication Adapter. • The adapter incorporates an Intel i860 XR 64-bit microprocessor (40 MHz) does communication coprocessing, data checking

  9. SP2 Switch Board • Each has 8 switch elements, operated at 40 MHz, for reliability, 16 elements installed • 4 routes between each pair of nodes (set at booting time) • hardware latency is 500 nsec (board) • capable of scaling bisectional bandwidth linearly with the number of nodes

  10. SP2 HPS (a 16 x 16 switch board) • Maximum point-to-point bandwidth: 40MB/s • 1 packet consists of 256 bytes • flit size = 1 byte (wormhole routing) Vulcan chip

  11. SP2 Communication Adapter • one adapter per node • one switch board unit per rack • send FIFO has 128 entries (256 bytes each) • receive FIFO has 64 entries (256 bytes each) • 2 DMA engines

  12. SP2 Communication Adapter Host Node POWER2 Network Adapter

  13. 128-node SP2 (16 nodes per frame)

  14. INTEL PARAGON

  15. Intel Paragon (2-D mesh)

  16. Intel Paragon Node Architecture • Up to three 50 MHz INTEL i860 processors (75 Mflop/s) per node (usually two in most implementation). • One of them is used as message processor(communication co-processor) handling all communication events. • Two are application processors (computation only) • Each node is a shared memory multiprocessor (64-bit bus, bus speed: 400 MB/s with cache coherence support) • Peak memory-to-processor bandwidth: 400 MB/s • Peak cache-to-processor bandwidth:1.2 GB/s.

  17. Intel Paragon Node Architecture • message processor: • handles message protocol processing for the application program, • freeing the application processor to continue with numeric computation while messages are transmitted and received. • also used to implement efficient global operations such as synchronization, broadcasting, and global reduction calculations (e.g., global sum).

  18. Paragon Node Architecture

  19. Paragon Interconnect • 2-D Mesh • I/O devices attached on a single side • 16-bit link, 175 MB/s • Mesh Routing Components (MRCs), • one for each node. • 40 nsec per hop (switch delay) and 70 nsec if changes dimension (from x-dim to y-dim). • In a 512 PEs (16x32), 10 hops is 400-700nsec

  20. CRAY T3D

  21. Cray T3D Node Architecture • Each processing node contains two PEs, a network interface, and a block transfer engine. (shared by the two PEs) • PE: 150 MHz DEC 21064 Alpha AXP, 34-bit address, 64 MB memory, 150 MFLOPS • 1024 processor: sustained max speed 152 Gflop/s

  22. T3D Node and Network Interface

  23. Cray T3D Interconnect • Interconnect: 3D Torus, 16-bit data/link, 150 MHz • Communication channel peak rate: 300 MB/s.

  24. T3D • The cost of routing data between processors through interconnect nodes is two clock cycles (6.67 nsec per cycle) per node traversed and one extra clock cycle to turn a corner • The overheads for using block transfer engine is high. (startup cost > 480 cycles x 6.67 nsec = 3.2 usec)

  25. T3D: Local and Remote Memory • Local memory: • 16 or 64 MB DRAM per PE • Latency: 13 to 38 clock cycles (87 to 253 nsec) • Bandwidth: up to 320 MB/s • Remote memory: • Directly addressable by the processor, • Latency of 1 to 2 microseconds • Bandwidth: over 100 MB/s (measured in software).

  26. T3D: Local and Remote Memory • Distributed Shared Memory Machine • All memory is directly accessible; no action is required by remote processors to formulate responses to remote requests. • NCC-NUMA : non-cache-coherence NUMA

  27. T3D: Bisectional Bandwidth • The network moves data in packets with payload sizes of either one or four 64-bit words • The bisectional bandwidth of a 1024-PE T3D is 76 GB/s; • 512 node=8x8x8, 64 nodes/frame, 4x64x300

  28. T3E Node Alpha 21164 4-issue (2 integer + 2 floating point) 600 Mflop/s (300 MHz) E-Register

  29. Cluster:Network of Workstation (NOW)Cluster of Workstation (COW)Pile-of-PCs (POPC)

  30. Clusters of Workstations • Several workstations which are connected by a network . • connected with Fast/Gigabit Ethernet, ATM, FDDI, etc. • some software to tightly integrate all resources • Each workstation is a independent machines

  31. Cluster • Advantages • Cheaper • Easy to scale • Coarse-grain parallelism (traditionally) • Disadvantages of Clusters • Longer communication latency compared with other parallel system (traditionally)

  32. ATM Cluster (Fore SBA-200) • Cluster node : Intel Pentium II, Pentium SMP, SGI, Sun Sparc, .. • NI location: I/O bus • Communication processor: Intel i960, 33MHz, 128KB RAM • Peak bandwidth: 19.4 MB/s or 77.6 MB/s per port • HKU: PearlCluster (16-node), SRG DP-ATM Cluster ($-node, 16.2 MB/s)

  33. Myrinet Cluster • Cluster node: Intel Pentium II, Pentium SMP, SGI, Sun SPARC, .. • NI location: I/O bus • Communication processor: LANai, 25 MHz, 128 KB SRAM • Peak bandwidth: 80 MB/s --> 160 MB/s

  34. Conclusion • Many current network interfaces employ a dedicated processor to offload communication tasks from the main processor. • Overlap computation with communication improve performance.

  35. Paragon • Main processor : 50 MHz i860 XP, 75 Mflop/s. • NI location : Memory bus (64-bit, 400 MB/s) • Communication processor : 50 MHz i860 XP -- a processor • Peak bandwidth: 175 MB/s (16-bit link, 1 DMA engine)

  36. SP2 • Main processor : 66.7 MHz POWER2, 266 MFLOPs • NI location : I/O bus (32-bit micro-channel) • Communication processor : 40 MHz i860 XR -- a processor • Peak bandwidth: 40 MB/s (8-bit link, 40 MHz)

  37. T3D • Main processor: 150 MHz DEC 21064 Alpha AXP, 150 MFLOPS • NI location: Memory bus (320 MB/s local; or 100 MB/s remote) • Communication processor : Controller (BLT) -- hardware circuitry • Peak bandwidth: 300 MB/s (16-bit data/link at 150 MHz)

More Related