280 likes | 517 Views
The IBM Blue Gene/L System Architecture. Presented by Sabri KANTAR. What is Blue Gene/L?. Blue Gene is an IBM Research project dedicated to exploring the frontiers in supercomputing. In November 2004, the IBM Blue Gene computer became the fastest supercomputer in the world.
E N D
The IBM Blue Gene/L System Architecture Presented by Sabri KANTAR
What is Blue Gene/L? • Blue Gene is an IBM Research project dedicated to exploring the frontiers in supercomputing. • In November 2004, the IBM Blue Gene computer became the fastest supercomputer in the world. • This project is designed to scale to 65,536 dual-processor nodes, with a peak performance of 360 TeraFLOPS. • Example usage: • hydrodynamics • quantum chemistry • molecular dynamics • climate modeling • financial modeling
A High-Level View of the BG/L Architecture • Within node: • Low latency, high bandwidth memory system. • Strong floating point performance: 4 floating point operations/cycle. • Across nodes: • Low latency, high bandwidth networks. • Many nodes: • Low power/node. • Low cost/node. • RAS (reliability, availability and serviceability). • Familiar SW API: • C, C++, Fortan, MPI, POSIX subset, …
Main Design Principles for Blue Gene/L • Some science & engineering applications scale up to and beyond 10,000 parallel processes. • Improve computing capability, holding total system cost. • Reduce cost/FLOP. • Reduce complexity and size. • ~25KW/rack is max for air-cooling in standard room. • Need to improve performance/power ratio. • 700MHz PowerPC440 for ASIC has excellent FLOP/Watt. • Maximize Integration: • On chip: ASIC with everything except main memory. • Off chip: Maximize number of nodes in a rack.. • Large systems require excellent reliability, availability, serviceability (RAS)
Main Design Principles (cont’d) • Make cost/performance trade-offs considering the end-use: • Applications <> Architecture <> Packaging • Examples: • 1 or 2 differential signals per torus link. • I.e. 1.4 or 2.8Gb/s. • Maximum of 3 or 4 neighbors on collective network. • I.e. Depth of network and thus global latency. • Maximize the overall system efficiency: • Small team designed all of Blue Gene/L. • Example: Chose ASIC die and chip pin-out to ease circuit card routing.
Reducing Cost and Complexity • Cables are bigger, costlier and less reliable than traces. • So want to minimize the number of cables. • So 3-dimensional torus is chosen as main BG/L network, with each node connected to 6 neighbors. • Maximize number of nodes connected via circuit card(s) only. • BG/L midplane has 8*8*8=512 nodes. • (Number of cable connections) / (all connections) = (6 faces * 8 * 8 nodes) / (6 neighbors * 8 * 8 * 8 nodes) = 1 / 8
Blue Gene/L Architecture • Up to 32*32*64=65536 nodes (3D torus). • Max 360 teraFLOPS computation power. • Each processor can perform 4 floating point operations per cycle (in the form of two 64-bit floating point multiply-add’s per cycle) • 5 networks connect nodes to themselves and to the world.
Node Architecture • IBM PowerPC embedded CMOS processors, embedded DRAM, and system-on-a-chip technique is used. • 11.1-mm square die size, allowing for a very high density of processing. • The ASIC uses IBM CMOS CU-11 0.13 micron technology. • 700 Mhz processor speed close to memory speed. • Two processors per node. • Second processor is intended primarily for handling message passing operations
The BG/L node ASIC includes: • The two processing cores are standard PowerPC 440 core • each with a PowerPC 440 FP2 core • an enhanced “Double” 64-bit Floating-Point Unit • The two cores are not L1 cache coherent. • Each core has a small 2 KB L2 cache • 4 MB L3 cache made from embedded DRAM • An integrated external DDR memory controller • A gigabit Ethernet adapter • A JTAG interface
Link ASIC • In addition to the compute ASIC, there is a “link” ASIC. • When crossing • a midplane boundary • BG/L’s torus • global combining tree • global interrupt signals pass through the BG/L link ASIC. • It redrives signals over the cables between BG/L midplanes. • The link ASIC can redirect signals between its different ports. • enables BG/L to be partitioned into multiple, logically separate systems in which there is no traffic interference between systems.
The PowerPC 440 FP2 core • It consists of a primary side and a secondary side • Each side has • its own 64-bit by 32 element register file • a double-precision computational datapath and • a double-precision storage access datapath • The primary side is capable of executing standard PowerPC floating-point instructions • An enhanced set of instructions include those that are executed solely on the secondary side, and those that are simultaneously executed on both sides. • Enhanced set includes SIMD operations
The FP2 core (cont’d) • This enhanced set goes beyond the capabilities of traditional SIMD architectures. • A single instruction can initiate a different but related operation on different data. • Single Instruction Multiple Operation Multiple Data (SIMOMD). • Either of the sides can access data from the other side’s register file. • This saves a lot of swapping when working purely on complex arithmetic operations.
Memory System • It is designed for high bandwidth, low latency memory and cache accesses. • An L2 hit returns in 6 to 10 processor cycles • An L3 hit in about 25 cycles • An L3 miss in about 75 cycles • System has a 16 byte interface to nine 256Mb SDRAM-DDR devices. • Operating at a speed of one half or one third of the processor.
3D Torus Network • It is used for general-purpose, point-to-point message passing and multicast operations to a selected “class” of nodes. • The topology is a three-dimensional torus constructed with point-to-point, serial links between routers embedded within the BlueGene/L ASICs. • Each ASIC has six nearest-neighbor connections • Virtual cut-through routing with multipacket buffering on collision • Minimal, Adaptive, Deadlock Free
Torus Network (cont’d) • Class Routing Capability (Deadlock-free Hardware Multicast) • Packets can be deposited along route to specified destination. • Allows for efficient one to many in some instances • Active messages allows for fast transposes as required in FFTs. • Independent on-chip network interfaces enable concurrent access.
Other Networks • A global combining/broadcast tree for collective operations • A Gigabit Ethernet network for connection to other systems, such as hosts and file systems. • A global barrier and interrupt network • And another Gigabit Ethernet to JTAG network for machine control
Collective Network • It has tree structure • One-to-all broadcast functionality • Reduction operations functionality • 2.8 Gb/s of bandwidth per link; Latency of tree traversal 2.5 µs • ~23TB/s total binary tree bandwidth (64k machine) • Interconnects all compute and I/O nodes (1024)
Gb Ethernet Disk/Host I/O Network • IO nodes are leaves on collective network. • Compute and IO nodes use same ASIC, but: • IO node has Ethernet not torus. Provedes IO seperation on application. • Compute node has torus, not Ethernet: No need for 65536 cables. • Configurable ratio of IO to compute = 1:8,16,32,64,128. • Application runs on compute nodes, not IO nodes.
Fast Barrier/Interrupt Network • Four Independent Barrier or Interrupt Channels • Independently Configurable as "or" or "and" • Asynchronous Propagation • Halt operation quickly (current estimate is 1.3usec worst case round trip) • 3/4 of this delay is time-of-flight. • Sticky bit operation • Allows global barriers with a single channel. • User Space Accessible • System selectable • It is partitioned along same boundaries as Tree, and Torus • Each user partition contains it's own set of barrier/ interrupt signals
Control Network • JTAG interface to 100Mb Ethernet • direct access to all nodes. • boot, system debug availability. • runtime noninvasive RAS support. • non-invasive access to performance counters • direct access to shared SRAM in every node • Control, configuration and monitoring: • Make all active devices accessible through JTAG, I2C, or other “simple”bus. (Only clock buffers & DRAM are not accessible)
Packaging • 2 nodes per compute card. • 16 compute cards per node board. • 16 node boards per 512-node midplane. • Two midplanes in a 1024-node rack. • For compiling, diagnostics, and analysis, a host computer is required. • An I/O node handles communication between a compute node and other systems, including the host and file servers.
Science Application • Study of protein folding and dynamics. • Aim is to obtain a microscopic view of the thermodynamics and kinetics of the folding process • Simulating longer and longer time-scales is the key challenge • Focus is on improving the speed of execution for a fixed size system by utilizing additional CPUs. • Understanding the logical limits to concurrency within the application is very important.
Conclusion • The Blue Gene/L supercomputer is designed to improve cost/performance for a relatively broad class of applications with good scaling behavior. • This is achieved by using parallesim. • System on Chip technology. • The functionality of a node was contained within a single ASIC chip. • BG/L has significantly lower cost in terms of power, space, and service, while doing no worse than the other competitors.
The End Questions ???