1 / 26

The IBM Blue Gene/L System Architecture

The IBM Blue Gene/L System Architecture. Presented by Sabri KANTAR. What is Blue Gene/L?. Blue Gene is an IBM Research project dedicated to exploring the frontiers in supercomputing. In November 2004, the IBM Blue Gene computer became the fastest supercomputer in the world.

almira
Download Presentation

The IBM Blue Gene/L System Architecture

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The IBM Blue Gene/L System Architecture Presented by Sabri KANTAR

  2. What is Blue Gene/L? • Blue Gene is an IBM Research project dedicated to exploring the frontiers in supercomputing. • In November 2004, the IBM Blue Gene computer became the fastest supercomputer in the world. • This project is designed to scale to 65,536 dual-processor nodes, with a peak performance of 360 TeraFLOPS. • Example usage: • hydrodynamics • quantum chemistry • molecular dynamics • climate modeling • financial modeling

  3. A High-Level View of the BG/L Architecture • Within node: • Low latency, high bandwidth memory system. • Strong floating point performance: 4 floating point operations/cycle. • Across nodes: • Low latency, high bandwidth networks. • Many nodes: • Low power/node. • Low cost/node. • RAS (reliability, availability and serviceability). • Familiar SW API: • C, C++, Fortan, MPI, POSIX subset, …

  4. Main Design Principles for Blue Gene/L • Some science & engineering applications scale up to and beyond 10,000 parallel processes. • Improve computing capability, holding total system cost. • Reduce cost/FLOP. • Reduce complexity and size. • ~25KW/rack is max for air-cooling in standard room. • Need to improve performance/power ratio. • 700MHz PowerPC440 for ASIC has excellent FLOP/Watt. • Maximize Integration: • On chip: ASIC with everything except main memory. • Off chip: Maximize number of nodes in a rack.. • Large systems require excellent reliability, availability, serviceability (RAS)

  5. Main Design Principles (cont’d) • Make cost/performance trade-offs considering the end-use: • Applications <> Architecture <> Packaging • Examples: • 1 or 2 differential signals per torus link. • I.e. 1.4 or 2.8Gb/s. • Maximum of 3 or 4 neighbors on collective network. • I.e. Depth of network and thus global latency. • Maximize the overall system efficiency: • Small team designed all of Blue Gene/L. • Example: Chose ASIC die and chip pin-out to ease circuit card routing.

  6. Reducing Cost and Complexity • Cables are bigger, costlier and less reliable than traces. • So want to minimize the number of cables. • So 3-dimensional torus is chosen as main BG/L network, with each node connected to 6 neighbors. • Maximize number of nodes connected via circuit card(s) only. • BG/L midplane has 8*8*8=512 nodes. • (Number of cable connections) / (all connections) = (6 faces * 8 * 8 nodes) / (6 neighbors * 8 * 8 * 8 nodes) = 1 / 8

  7. Blue Gene/L Architecture • Up to 32*32*64=65536 nodes (3D torus). • Max 360 teraFLOPS computation power. • Each processor can perform 4 floating point operations per cycle (in the form of two 64-bit floating point multiply-add’s per cycle) • 5 networks connect nodes to themselves and to the world.

  8. Node Architecture • IBM PowerPC embedded CMOS processors, embedded DRAM, and system-on-a-chip technique is used. • 11.1-mm square die size, allowing for a very high density of processing. • The ASIC uses IBM CMOS CU-11 0.13 micron technology. • 700 Mhz processor speed close to memory speed. • Two processors per node. • Second processor is intended primarily for handling message passing operations

  9. The BG/L node ASIC includes: • The two processing cores are standard PowerPC 440 core • each with a PowerPC 440 FP2 core • an enhanced “Double” 64-bit Floating-Point Unit • The two cores are not L1 cache coherent. • Each core has a small 2 KB L2 cache • 4 MB L3 cache made from embedded DRAM • An integrated external DDR memory controller • A gigabit Ethernet adapter • A JTAG interface

  10. BlueGene/L node diagram.

  11. Link ASIC • In addition to the compute ASIC, there is a “link” ASIC. • When crossing • a midplane boundary • BG/L’s torus • global combining tree • global interrupt signals pass through the BG/L link ASIC. • It redrives signals over the cables between BG/L midplanes. • The link ASIC can redirect signals between its different ports. • enables BG/L to be partitioned into multiple, logically separate systems in which there is no traffic interference between systems.

  12. The PowerPC 440 FP2 core • It consists of a primary side and a secondary side • Each side has • its own 64-bit by 32 element register file • a double-precision computational datapath and • a double-precision storage access datapath • The primary side is capable of executing standard PowerPC floating-point instructions • An enhanced set of instructions include those that are executed solely on the secondary side, and those that are simultaneously executed on both sides. • Enhanced set includes SIMD operations

  13. The FP2 core (cont’d) • This enhanced set goes beyond the capabilities of traditional SIMD architectures. • A single instruction can initiate a different but related operation on different data. • Single Instruction Multiple Operation Multiple Data (SIMOMD). • Either of the sides can access data from the other side’s register file. • This saves a lot of swapping when working purely on complex arithmetic operations.

  14. Memory System • It is designed for high bandwidth, low latency memory and cache accesses. • An L2 hit returns in 6 to 10 processor cycles • An L3 hit in about 25 cycles • An L3 miss in about 75 cycles • System has a 16 byte interface to nine 256Mb SDRAM-DDR devices. • Operating at a speed of one half or one third of the processor.

  15. 3D Torus Network • It is used for general-purpose, point-to-point message passing and multicast operations to a selected “class” of nodes. • The topology is a three-dimensional torus constructed with point-to-point, serial links between routers embedded within the BlueGene/L ASICs. • Each ASIC has six nearest-neighbor connections • Virtual cut-through routing with multipacket buffering on collision • Minimal, Adaptive, Deadlock Free

  16. Torus Network (cont’d) • Class Routing Capability (Deadlock-free Hardware Multicast) • Packets can be deposited along route to specified destination. • Allows for efficient one to many in some instances • Active messages allows for fast transposes as required in FFTs. • Independent on-chip network interfaces enable concurrent access.

  17. Other Networks • A global combining/broadcast tree for collective operations • A Gigabit Ethernet network for connection to other systems, such as hosts and file systems. • A global barrier and interrupt network • And another Gigabit Ethernet to JTAG network for machine control

  18. Collective Network • It has tree structure • One-to-all broadcast functionality • Reduction operations functionality • 2.8 Gb/s of bandwidth per link; Latency of tree traversal 2.5 µs • ~23TB/s total binary tree bandwidth (64k machine) • Interconnects all compute and I/O nodes (1024)

  19. Gb Ethernet Disk/Host I/O Network • IO nodes are leaves on collective network. • Compute and IO nodes use same ASIC, but: • IO node has Ethernet not torus. Provedes IO seperation on application. • Compute node has torus, not Ethernet: No need for 65536 cables. • Configurable ratio of IO to compute = 1:8,16,32,64,128. • Application runs on compute nodes, not IO nodes.

  20. Fast Barrier/Interrupt Network • Four Independent Barrier or Interrupt Channels • Independently Configurable as "or" or "and" • Asynchronous Propagation • Halt operation quickly (current estimate is 1.3usec worst case round trip) • 3/4 of this delay is time-of-flight. • Sticky bit operation • Allows global barriers with a single channel. • User Space Accessible • System selectable • It is partitioned along same boundaries as Tree, and Torus • Each user partition contains it's own set of barrier/ interrupt signals

  21. Control Network • JTAG interface to 100Mb Ethernet • direct access to all nodes. • boot, system debug availability. • runtime noninvasive RAS support. • non-invasive access to performance counters • direct access to shared SRAM in every node • Control, configuration and monitoring: • Make all active devices accessible through JTAG, I2C, or other “simple”bus. (Only clock buffers & DRAM are not accessible)

  22. Packaging • 2 nodes per compute card. • 16 compute cards per node board. • 16 node boards per 512-node midplane. • Two midplanes in a 1024-node rack. • For compiling, diagnostics, and analysis, a host computer is required. • An I/O node handles communication between a compute node and other systems, including the host and file servers.

  23. BlueGene/L packaging.

  24. Science Application • Study of protein folding and dynamics. • Aim is to obtain a microscopic view of the thermodynamics and kinetics of the folding process • Simulating longer and longer time-scales is the key challenge • Focus is on improving the speed of execution for a fixed size system by utilizing additional CPUs. • Understanding the logical limits to concurrency within the application is very important.

  25. Conclusion • The Blue Gene/L supercomputer is designed to improve cost/performance for a relatively broad class of applications with good scaling behavior. • This is achieved by using parallesim. • System on Chip technology. • The functionality of a node was contained within a single ASIC chip. • BG/L has significantly lower cost in terms of power, space, and service, while doing no worse than the other competitors.

  26. The End Questions ???

More Related