Architecture of Parallel Computers CSC / ECE 506 BlueGene Architecture

Architecture of Parallel ComputersCSC / ECE 506 BlueGene Architecture 4/26/2007 Dr Steve Hunter

BlueGene/L Program • December 1999: IBM Research announced a 5 year, $100M US, effort to build a petaflop/s scale supercomputer to attack science problems such as protein folding. Goals: • Advance the state of the art of scientific simulation. • Advance the state of the art in computer design and software for capability and capacity markets. • November 2001: Announced Research partnership with Lawrence Livermore National Laboratory (LLNL).November 2002: Announced planned acquisition of a BG/L machine by LLNL as part of the ASCI Purple contract. • May 11, 2004: Four racks DD1 (4096 nodes at 500 MHz) running Linpack at 11.68 TFlops/s. It was ranked #4 on 23rd Top500 list. • June 2, 2004: 2 racks DD2 (1024 nodes at 700 MHz) running Linpack at 8.655 TFlops/s. It was ranked #8 on 23rd Top500 list. • September 16, 2004, 8 racks running Linpack at 36.01 TFlops/s. • November 8, 2004, 16 racks running Linpack at 70.72 TFlops/s. It was ranked #1 on the 24th Top500 list. • December 21, 2004 First 16 racks of BG/L accepted by LLNL. CSC / ECE 506

BlueGene/L Program • Massive collection of low-power CPUs instead of a moderate-sized collection of high-power CPUs. • A joint development of IBM and DOE’s National Nuclear Security Administration (NNSA) and installed at DOE’s Lawrence Livermore National Laboratory • BlueGene/L has occupied the No. 1 position on the last three TOP500 lists (http://www.top500.org/) • It has reached a Linpack benchmark performance of 280.6 TFlop/s (“teraflops” or trillions of calculations per second) and still remains the only system ever to exceed the level of 100 TFlop/s. • BlueGene/L holds the #1 and #3 positions in top 10. • “Objective was to retain exceptional cost/performance levels achieved by application-specific machines, while generalizing the massively parallel architecture enough to enable a relatively broad class of applications” - Overview of BG/L system architecture, IBM JRD • Design approach was to use a very high level of integration that made simplicity in packaging, design, and bring-up possible • JRD issue available at: http://www.research.ibm.com/journal/rd49-23.html CSC / ECE 506

BlueGene/L Program • BlueGene is a family of supercomputers. • BlueGene/L is the first step, aimed as a multipurpose, massively parallel, and cost/effective supercomputer 12/04 • BlueGene/P is the petaflop generation 12/06 • BlueGene/Q is the third generation ~2010. • Requirements for future generations • Processors will be more powerful. • Networks will be higher bandwidth. • Applications developed on BlueGeneG/L will run well on BlueGene/P. CSC / ECE 506

BlueGene/L Fundamentals • Low Complexity nodes gives more flops per transistor and per watt • 3D Interconnect supports many scientific simulations as nature as we see it is 3D CSC / ECE 506

BlueGene/L Fundamentals • Cellular architecture • Large numbers of low power, more efficient processors interconnected • Rmax of 280.6 Teraflops • Maximal LINPACK performance achieved • Rpeak of 360 Teraflops • Theoretical peak performance • 65,536 dual-processor compute nodes • 700MHz IBM PowerPC 440 processors • 512 MB memory per compute node, 16 TB in entire system. • 800 TB of disk space • 2,500 square feet CSC / ECE 506

Comparing Systems (Peak) CSC / ECE 506

Comparing Systems (Byte/Flop) • Red Storm 2.0 2003 • Earth Simulator 2.0 2002 • Intel Paragon 1.8 1992 • nCUBE/2 1.0 1990 • ASCI Red 1.0 (0.6) 1997 • T3E 0.8 1996 • BG/L 1.5 0.75(torus)+0.75(tree) 2004 • Cplant 0.1 1997 • ASCI White 0.1 2000 • ASCI Q 0.05 Quadrics 2003 • ASCI Purple 0.1 2004 • Intel Cluster 0.1 IB 2004 • Intel Cluster 0.008 GbE 2003 • Virginia Tech 0.16 IB 2003 • Chinese Acad of Sc 0.04 QsNet 2003 • NCSA - Dell 0.04 Myrinet 2003 CSC / ECE 506

Comparing Systems (GFlops/Watt) • Power efficiencies of recent supercomputers • Blue: IBM Machines • Black: Other US Machines • Red: Japanese Machines IBM Journal of Research and Development CSC / ECE 506

Comparing Systems * 10 megawatts approximate usage of 11,000 households CSC / ECE 506

BG/L Summary of Performance Results • DGEMM (Double-precision, GEneral Matrix-Multiply): • 92.3% of dual core peak on 1 node • Observed performance at 500 MHz: 3.7 GFlops • Projected performance at 700 MHz: 5.2 GFlops (tested in lab up to 650 MHz) • LINPACK: • 77% of peak on 1 node • 70% of peak on 512 nodes (1435 GFlops at 500 MHz) • sPPM (Spare Matrix Multiple Vector Multiply), UMT2000: • Single processor performance roughly on par with POWER3 at 375 MHz • Tested on up to 128 nodes (also NAS Parallel Benchmarks) • FFT (Fast Fourier Transform): • Up to 508 MFlops on single processor at 444 MHz (TU Vienna) • Pseudo-ops performance (5N log N) @ 700 MHz of 1300 Mflops (65% of peak) • STREAM – impressive results even at 444 MHz: • Tuned: Copy: 2.4 GB/s, Scale: 2.1 GB/s, Add: 1.8 GB/s, Triad: 1.9 GB/s • Standard: Copy: 1.2 GB/s, Scale: 1.1 GB/s, Add: 1.2 GB/s, Triad: 1.2 GB/s • At 700 MHz: Would beat STREAM numbers for most high end microprocessors • MPI: • Latency – < 4000 cycles (5.5 ls at 700 MHz) • Bandwidth – full link bandwidth demonstrated on up to 6 links CSC / ECE 506

BlueGene/L Architecture • To achieve this level of integration, the machine was developed around a processor with moderate frequency, available in system-on-a-chip (SoC) technology • This approach was chosen because of the performance/power advantage • In terms of performance/watt the low-frequency, low-power, embedded IBM PowerPC core consistently outperforms high-frequency, high-power, microprocessors by a factor of 2 to 10 • Industry focus on performance / rack • Performance / rack = Performance / watt * Watt / rack • Watt / rack = 20kW for power and thermal cooling reasons • Power and cooling • Using conventional techniques, a 360 Tflops machine would require 10-20 megawatts. • BlueGene/L uses only 1.76 megawatts CSC / ECE 506

Microprocessor Power Density Growth CSC / ECE 506

System Power Comparison CSC / ECE 506

BlueGene/L Architecture • Networks were chosen with extreme scaling in mind • Scale efficiently in terms of both performance and packaging • Support very small messages • As small as 32 bytes • Includes hardware support for collective operations • Broadcast, reduction, scan, etc. • Reliability, Availability and Serviceability (RAS) is another critical issue for scaling • BG/L need to be reliable and usable even at extreme scaling limits • 20 fails per 1,000,000,000 hours = 1 node failure every 4.5 weeks • System Software and Monitoring also important to scaling • BG/L designed to efficiently utilize a distributed memory, message-passing programming model • MPI is the dominant message-passing model with hardware features added and parameter tuned CSC / ECE 506

RAS (Reliability, Availability, Serviceability) • System designed for RAS from top to bottom • System issues • Redundant bulk supplies, power converters, fans, DRAM bits, cable bits • Extensive data logging (voltage, temp, recoverable errors … ) for failure forecasting • Nearly no single points of failure • Chip design • ECC on all SRAMs • All dataflow outside processors is protected by error-detection mechanisms • Access to all state via noninvasive back door • Low power, simple design leads to higher reliability • All interconnects have multiple error detections and correction coverage • Virtually zero escape probability for link errors CSC / ECE 506

BlueGene/L System 136.8 Teraflop/s on LINPACK (64K processors) 1 TF = 1000,000,000,000 Flops Rochester Lab 2005 CSC / ECE 506

BlueGene/L System CSC / ECE 506

Physical Layout of BG/L CSC / ECE 506

Midplanes and Racks CSC / ECE 506

The Compute Chip • System-on-a-chip (SoC) • 1 ASIC • 2 PowerPC processors • L1 and L2 Caches • 4MB embedded DRAM • DDR DRAM interface and DMA controller • Network connectivity hardware • Control / monitoring equip. (JTAG) CSC / ECE 506

Compute Card CSC / ECE 506

Node Card CSC / ECE 506

BlueGene/L Compute ASIC • IBM CU-11, 0.13 µm • 11 x 11 mm die size • 25 x 32 mm CBGA • 474 pins, 328 signal • 1.5/2.5 Volt CSC / ECE 506

BlueGene/L Interconnect Networks 3 Dimensional Torus • Main network, for point-to-point communication • High-speed, high-bandwidth • Interconnects all compute nodes (65,536) • Virtual cut-through hardware routing • 1.4Gb/s on all 12 node links (2.1 GB/s per node) • 1 µs latency between nearest neighbors, 5 µs to the farthest • 4 µs latency for one hop with MPI, 10 µs to the farthest • Communications backbone for computations • 0.7/1.4 TB/s bisection bandwidth, 68TB/s total bandwidth Global Tree • One-to-all broadcast functionality • Reduction operations functionality • MPI collective ops in hardware • Fixed-size 256 byte packets • 2.8 Gb/s of bandwidth per link • Latency of one way tree traversal 2.5 µs • ~23TB/s total binary tree bandwidth (64k machine) • Interconnects all compute and I/O nodes (1024) • Also guarantees reliable delivery Ethernet • Incorporated into every node ASIC • Active in the I/O nodes (1:64) • All external comm. (file I/O, control, user interaction, etc.) Low Latency Global Barrier and Interrupt • Latency of round trip 1.3 µs Control Network CSC / ECE 506

The Torus Network • 3 dimensional: 64 x 32 x 32 • Each compute node is connected to its six neighbors: x+, x-, y+, y-, z+, z- • Compute card is 1x2x1 • Node card is 4x4x2 • 16 compute cards in 4x2x2 arrangement • Midplane is 8x8x8 • 16 node cards in 2x2x4 arrangement • Communication path • Each uni-directional link is 1.4Gb/s, or 175MB/s. • Each node can send and receive at 1.05GB/s. • Supports cut-through routing, along with both deterministic and adaptive routing. • Variable-sized packets of 32,64,96…256 bytes • Guarantees reliable delivery CSC / ECE 506

Complete BlueGene/L System at LLNL BG/L I/O nodes 1,024 WAN 48 visualization 64 archive 128 BG/L compute nodes 65,536 Federated Gigabit Ethernet Switch 2,048 ports CWFS 1024 512 Front-end nodes 8 Service node 8 8 Control network CSC / ECE 506

System Software Overview • Operating system - Linux • Compilers - IBM XL C, C++, Fortran95 • Communication - MPI, TCP/IP • Parallel File System - GPFS, NFS support • System Management - extensions to CSM • Job scheduling - based on LoadLeveler • Math libraries - ESSL CSC / ECE 506

BG/L Software Hierarchical Organization • Compute nodes dedicated to running user application, and almost nothing else - simple compute node kernel (CNK) • I/O nodes run Linux and provide a more complete range of OS services – files, sockets, process launch, signaling, debugging, and termination • Service node performs system management services (e.g., heart beating, monitoring errors) - transparent to application software CSC / ECE 506

BG/L System Software • Simplicity • Space-sharing • Single-threaded • No demand paging • Familiarity • MPI (MPICH2) • IBM XL Compilers for PowerPC CSC / ECE 506

Operating Systems • Front-end nodes are commodity systems running Linux • I/O nodes run a customized Linux kernel • Compute nodes use an extremely lightweight custom kernel • Service node is a single multiprocessor machine running a custom OS CSC / ECE 506

Compute Node Kernel (CNK) • Single user, dual-threaded • Flat address space, no paging • Physical resources are memory-mapped • Provides standard POSIX functionality (mostly) • Two execution modes: • Virtual node mode • Coprocessor mode CSC / ECE 506

Service Node OS • Core Management and Control System (CMCS) • BG/L’s “global” operating system. • MMCS - Midplane Monitoring and Control System • CIOMAN - Control and I/O Manager • DB2 relational database CSC / ECE 506

Running a User Job • Compiled, and submitted from front-end node. • External scheduler • Service node sets up partition, and transfers user’s code to compute nodes. • All file I/O is done using standard Unix calls (via the I/O nodes). • Post-facto debugging done on front-end nodes. CSC / ECE 506

Performance Issues • User code is easily ported to BG/L. • However, MPI implementation requires effort & skill • Torus topology instead of crossbar • Special hardware, such as collective network. CSC / ECE 506

BG/L MPI Software Architecture GI = Global Interrupt CIO = Control and I/O Protocol CH3 = Primary device distributed with MPICH2 communication MPD = Multipurpose Daemon CSC / ECE 506

MPI_Bcast CSC / ECE 506

MPI_Alltoall CSC / ECE 506

References • IBM Journal of Research and Development, Vol. 49, No. 2-3. • http://www.research.ibm.com/journal/rd49-23.html • “Overview of the Blue Gene/L system architecture” • “Packaging the Blue Gene/L supercomputer” • “Blue Gene/L compute chip: Memory and Ethernet subsystems” • “Blue Gene/L torus interconnection network” • “Blue Gene/L programming and operating environment” • “Design and implementation of message-passing services for the Blue Gene/L supercomputer” CSC / ECE 506

References (cont.) • BG/L homepage @ LLNL: <http://www.llnl.gov/ASC/platforms/bluegenel/> • BlueGene homepage @ IBM: <http://www.research.ibm.com/bluegene/> CSC / ECE 506

The End CSC / ECE 506

Architecture of Parallel Computers CSC / ECE 506 BlueGene Architecture