100 likes | 206 Views
BGL Photo (system). BlueGene/L. IBM Journal of Research and Development, Vol. 49, No. 2-3. <http://www.research.ibm.com/journal/rd49-23.html>. Main Design Principles. Some science & engineering applications scale up to and beyond 10,000 parallel processes ;
E N D
BGL Photo (system) BlueGene/L IBM Journal of Research and Development, Vol. 49, No. 2-3. <http://www.research.ibm.com/journal/rd49-23.html>
Main Design Principles • Some science & engineering applications scale up to and beyond 10,000 parallel processes; • Improve computing capability, holding total system cost; • Cost/perf trade-offs considering the end-use: • Applications <> Architecture <> Packaging • Reduce complexity and size. • ~25KW/rack is max for air-cooling in standard room. • Need to improve performance/power ratio. • 700MHz PowerPC440 for ASIC has excellent FLOP/Watt. • Maximize Integration: • On chip: ASIC with everything except main memory. • Off chip: Maximize number of nodes in a rack.. • Large systems require excellent reliability, availability, serviceability (RAS)
The Compute Chip • System-on-a-chip (SoC) • 1 ASIC • 2 PowerPC processors • L1 and L2 Caches • 4MB embedded DRAM • DDR DRAM interface and DMA controller • Network connectivity hardware • Control / monitoring equip. (JTAG)
Node Architecture • IBM PowerPC embedded CMOS processors, embedded DRAM, and system-on-a-chip technique is used. • 11.1-mm square die size, allowing for a very high density of processing. • The ASIC uses IBM CMOS CU-11 0.13 micron technology. • 700 Mhz processor speed close to memory speed. • Two processors per node. • Second processor is intended primarily for handling message passing operations
Midplane and Rack • 1 rack holds 1024 nodes • Nodes optimized for low power • ASIC based on SoC technology • Outperform commodity clusters while saving on power • Aggressive packaging of processor, memory and interconnect • Power efficient & space efficient • Allows for latencies and bandwidths that are significantly better than those for nodes typically used in ASC scale supercomputers
The Torus Network • 64 x 32 x 32 • Each compute node is connected to its six neighbors: x+, x-, y+, y-, z+, z- • Compute card is 1x2x1 • Node card is 4x4x2 • 16 compute cards in 4x2x2 arrangement • Midplane is 8x8x8 • 16 node cards in 2x2x4 arrangement • Each uni-directional link is 1.4Gb/s, or 175MB/s. • Each node can send and receive at 1.05GB/s. • Supports cut-through routing, along with both deterministic and adaptive routing. • Variable-sized packets of 32,64,96…256 bytes • Guarantees reliable delivery
BG/L System Software • System software supports efficient execution of parallel applications • Compiler support for MPI-based C, C++, Fortran • Front-end nodes are commodity PCs running Linux • I/O nodes run a customized Linux kernel • Compute nodes: extremely lightweight custom kernel • Space sharing, single-thread/processor (dual-threaded per node) • Flat address space, no paging • Physical resources are memory-mapped • Service node is a single multiprocessor machine running a custom OS
Space Sharing • BG/L system can be partitioned into electronically isolated sets of nodes (power of 2) • Single-user, reservation-based for each partition • Faulty hardware are electrically isolated to allow other nodes to continue to run in the presence of component failures.