80 likes | 218 Views
The Blue Gene Experience. Manish Gupta IBM T. J. Watson Research Center Yorktown Heights, NY . Blue Gene/L (2005). 136.8 Teraflop/s on LINPACK (64K processors). System. Blue Gene/L. 64 Racks, 64x32x32. Rack. 32 Node Cards. Node Card. 180/360 TF/s 32 TB. (32 chips 4x4x2)
E N D
The Blue Gene Experience Manish Gupta IBM T. J. Watson Research Center Yorktown Heights, NY
Blue Gene/L (2005) 136.8 Teraflop/s on LINPACK (64K processors)
System Blue Gene/L 64 Racks, 64x32x32 Rack 32 Node Cards Node Card 180/360 TF/s 32 TB (32 chips 4x4x2) 16 compute, 0-2 IO cards 2.8/5.6 TF/s 512 GB Compute Card 2 chips, 1x2x1 90/180 GF/s 16 GB Chip 2 processors 5.6/11.2 GF/s 1.0 GB 2.8/5.6 GF/s 4 MB
Blue Gene/L Compute ASIC • Low power processors • Chip-level integration • Powerful networks
Blue Gene/L Networks 3 Dimensional Torus • Interconnects all compute nodes (65,536) • Virtual cut-through hardware routing • 1.4Gb/s on all 12 node links (2.1 GB/s per node) • 1 µs latency between nearest neighbors, 5 µs to the farthest • Communications backbone for computations • 0.7/1.4 TB/s bisection bandwidth, 68TB/s total bandwidth Global Collective • One-to-all broadcast functionality • Reduction operations functionality • 2.8 Gb/s of bandwidth per link • Latency of one way traversal 2.5 µs • Interconnects all compute and I/O nodes (1024) Low Latency Global Barrier and Interrupt • Latency of round trip 1.3 µs Ethernet • Incorporated into every node ASIC • Active in the I/O nodes (1:8-64) • All external comm. (file I/O, control, user interaction, etc.) Control Network
RAS (Reliability, Availability, Serviceability) • System designed for reliability from top to bottom • System issues • Redundant bulk supplies, power converters, fans, DRAM bits, cable bits • Extensive data logging (voltage, temp, recoverable errors … ) for failure forecasting • Nearly no single points of failure • Chip design • ECC on all SRAMs • All dataflow outside processors is protected by error-detection mechanisms • Access to all state via noninvasive back door • Low power, simple design leads to higher reliability • All interconnects have multiple error detections and correction coverage • Virtually zero escape probability for link errors
C-Node 63 C-Node 63 C-Node 0 C-Node 0 CNK CNK CNK CNK Blue Gene/L System Architecture tree Pset 0 Service Node I/O Node 0 SystemConsole Front-endNodes FileServers Linux app app fs client ciod Functional Gigabit Ethernet CMCS torus DB2 I/O Node 1023 Linux I2C app app Control GigabitEthernet LoadLeveler fs client ciod IDo chip JTAG Pset 1023