460 likes | 659 Views
STATE OF THE ART. Section 2. Overview We are going to briefly describe some state-of-the-art supercomputers The goal is to evaluate the degree of integration of the three main components, processing nodes, interconnection network and system software
E N D
Section 2 • Overview • We are going to briefly describe some state-of-the-art supercomputers • The goal is to evaluate the degree of integration of the three main components, processing nodes, interconnection network and system software • Analysis limited to 6 supercomputers (ASCI Q and Thunder, System X, BlueGene/L, Cray XD1 and ASCI Red Storm), due to space and time limitations
ASCI Q • Total — 20.48 TF/s, #3 in the top 500 • Systems— 2048 AlphaServer ES45s • 8,192 EV-68 1.25-GHz CPUs with 16-MB cache • Memory— 22 Terabytes • System Interconnect • Dual Rail Quadrics Interconnect • 4096 QSW PCI adapters • Four 1024-way QSW federated switches • Operational in 2002
Quad C-Chip Controller Quad C-Chip Controller D D D D D D D D D D D D D D D D PCI ChipBus 0,1 PCI ChipBus 0 PCI ChipBus 1 PCI ChipBus 2,3 64b 66MHz (528 MB/S) 64b 66MHz (528 MB/S) 64b 33MHz (266MB/S) 64b 33MHz (266MB/S) 64b 33 MHz (266 MB/S) 64b 66 MHz (528 MB/S) PCI 9 PCI8 PCI7 PCI6 PCI5 PCI4 PCI3 PCI2 PCI1 PCI0 PCI 9 PCI8 PCI7 PCI6 PCI5 PCI4 PCI3 PCI2 PCI1 PCI0 PCI-USB PCI-junk IO PCI-USB PCI-junk IO Node: HP (Compaq) AlphaServer ES45 21264 System Architecture EV68 1.25 GHz Each @ 64b 500 MHz (4.0 GB/s) Memory Up to 32 GB MMB 0 256b 125 MHz (4.0 GB/s) MMB 1 Cache 16 MB per CPU 256b 125 MHz (4.0 GB/s) MMB 2 MMB 3 PCI3 HS PCI2 HS PCI1 HS PCI9 HS PCI8 HS PCI7 HS PCI6 HS PCI5 PCI4 PCI0 Serial, Parallel keyboard/mouse floppy 3.3V I/O 5.0V I/O
QsNET: Quaternary Fat Tree Hardware support for Collective Communication MPI Latency 4ms, Bandwidth 300 MB/s Barrier latency less than 10ms
Interconnection Network Switch Level Super Top Level 6 5 Mid Level 4 1024 nodes (2x = 2048 nodes) ... 3 2 16th 64U64D Nodes 960-1023 1st 64U64D Nodes 0-63 1 48 63 960 1023
System Software • Operating System is Tru64 • Nodes organized in Clusters of 32 for resource allocation and administration purposes (TruCluster) • Resource management executed through Ethernet (RMS)
ASCI Q: Overview • Node Integration • Low (multiple boards per node, network interface on I/O bus) • Network Integration • High (HW support for atomic collective primitives) • System Software Integration • Medium/Low (TruCluster)
ASCI Thunder, Lawrence Livermore National Laboratory 1,024 Nodes, 4096 Processors, 23 TF/s, #2 in the top 500
ASCI Thunder: Configuration • 1,024 Nodes, Quad 1.4 Ghz Itanium2, 8GB DDR266 SDRAM (8 Terabytes total) • 2.5 ms, 912 MB/s MPI latency and bandwidth over Quadrics Elan4 • Barrier synchronization 6 ms, allreduce 15 ms • 75 TB in local disk in 73GB/node UltraSCSI320 • Lustre file system with 6.4 GB/s delivered parallell I/O performance • Linux RH 3.0, SLURM, Chaos
CHAOS: Clustered High Availability Operating System • Derived from Red Hat, but differs in the following areas • Modified kernel (Lustre and hw specific) • New packages for cluster monitoring, system installation, power/console management • SLURM, an open-source resource manager
ASCI Thunder: Overview • Node Integration • Medium/Low (network interface on I/O bus) • Network Integration • Very High (HW support for atomic collective primitives) • System Software Integration • Medium (Chaos)
System X, 10.28 TF/s • 1100 dual Apple G5 2GHz CPU based nodes. • 8 billion operations/second/processor (8 GFlops) peak double precision floating performance. • Each node has 4GB of main memory and 160 GB of Serial ATA storage. • 176TB total secondary storage. • Infiniband, 8ms and 870 MB/s, latency and bandwidth, partial support for collective communication • System-level Fault-tolerance (Déjà vu)
System X: Overview • Node Integration • Medium/Low (network interface on I/O bus) • Network Integration • Medium (limited support for atomic collective primitives) • System Software Integration • Medium (system-level fault-tolerance)
BlueGene/L System System (64 cabinets, 64x32x32) Cabinet (32 Node boards, 8x8x16) Node Card (32 chips, 4x4x2) 16 Compute Cards Compute Card 180/360 TF/s (2 chips, 2x1x1) 16 TB DDR Chip (2 processors) 2.9/5.7 TF/s 256 GB DDR 90/180 GF/s 8 GB DDR 5.6/11.2 GF/s 2.8/5.6 GF/s 0.5 GB DDR 4 MB
PLB (4:1) 32k/32k L1 256 128 L2 440 CPU 4MB EDRAM Shared “Double FPU” L3 directory L3 Cache 1024+ Multiported for EDRAM or 256 144 ECC Shared snoop Memory SRAM 32k/32k L1 Buffer 128 440 CPU L2 256 Includes ECC I/O proc 256 “Double FPU” 128 DDR JTAG Control Ethernet Torus Tree Global Gbit Access with ECC Interrupt 6 out and 3 out and 144 bit wide Gbit JTAG 4 global 6 in, each at 3 in, each at DDR Ethernet barriers or 1.4 Gbit/s link 2.8 Gbit/s link 256/512MB interrupts BlueGene/L Compute ASIC IBM CU-11, 0.13 µm 11 x 11 mm die size 25 x 32 mm CBGA 474 pins, 328 signal 1.5/2.5 Volt
16 compute cards 2 I/O cards DC-DC Converters: 40V 1.5, 2.5V
BlueGene/L Interconnection Networks 3 Dimensional Torus • Interconnects all compute nodes (65,536) • Virtual cut-through hardware routing • 1.4Gb/s on all 12 node links (2.1 GBytes/s per node) • 350/700 GBytes/s bisection bandwidth • Communications backbone for computations Global Tree • One-to-all broadcast functionality • Reduction operations functionality • 2.8 Gb/s of bandwidth per link • Latency of tree traversal in the order of 5 µs • Interconnects all compute and I/O nodes (1024) Ethernet • Incorporated into every node ASIC • Active in the I/O nodes (1:64) • All external comm. (file I/O, control, user interaction, etc.) Low Latency Global Barrier • 8 single wires crossing whole system, touching all nodes Control Network (JTAG) • For booting, checkpointing, error logging
BlueGene/L System Software Organization • Compute nodes dedicated to running user application, and almost nothing else - simple compute node kernel (CNK) • I/O nodes run Linux and provide O/S services • file access • process launch/termination • debugging • Service nodes perform system management services (e.g., system boot, heart beat, error monitoring) - largely transparent to application/system software
Operating Systems • Compute nodes: CNK • Specialized simple O/S • 5000 lines of code, • 40KBytes in core • No thread support, no virtual memory • Protection • Protect kernel from application • Some net devices in userspace • File I/O offloaded (“function shipped”) to IO nodes • Through kernel system calls • “Boot, start app and then stay out of the way” • I/O nodes: Linux • 2.4.19 kernel (2.6 underway) w/ ramdisk • NFS/GPFS client • CIO daemon to • Start/stop jobs • Execute file I/O • Global O/S (CMCS, service node) • Invisible to user programs • Global and collective decisions • Interfaces with external policy modules (e.g., job scheduler) • Commercial database technology (DB2) stores static and dynamic state • Partition selection • Partition boot • Running of jobs • System error logs • Checkpoint/restart mechanism • Scalability, robustness, security • Execution mechanisms in the core • Policy decisions in the service node
BlueGeneL: Overview • Node Integration • High (processing node integrates processors and network interfaces, network interfaces directly connected to the processors) • Network Integration • High (separate tree network) • System Software Integration • Medium/High (Compute kernels are not globally coordinated) • #2 and #4 in the top500
Cray XD1 System Architecture Compute • 12 AMD Opteron 32/64 bit, x86 processors • High Performance Linux RapidArray Interconnect • 12 communications processors • 1 Tb/s switch fabric Active Management • Dedicated processor Application Acceleration • 6 co-processors • Processors directly connected to the interconnect
Cray XD1 Processing Node Six 2-way SMP Blades 4 Fans 500 Gb/s crossbar switch Six SATA Hard Drives Chassis Front 12-port Inter-chassis connector Four independent PCI-X Slots Connector to 2nd 500 Gb/s crossbar switch and 12-port inter-chassis connector Chassis Rear
Cray XD1 Compute Blade RapidArray Communications Processor 4 DIMM Sockets for DDR 400 Registered ECC Memory AMD Opteron 2XX Processor AMD Opteron 2XX Processor Connector to Main Board 4 DIMM Sockets for DDR 400 Registered ECC Memory
Fast Access to the Interconnect GigaBytes GFLOPS GigaBytes per Second Memory I/O Interconnect Processor 1 GB/s PCI-X 0.25 GB/s GigE Xeon Server 5.3 GB/s DDR 333 Cray XD1 6.4GB/s DDR 400 8 GB/s
Communications Optimizations RapidArray Communications Processor • HT/RA tunnelling with bonding • Routing with route redundancy • Reliable transport • Short message latency optimization • DMA operations • System-wide clock synchronization RapidArray Communications Processor AMD Opteron 2XX Processor 2 GB/s RA 3.2 GB/s 2 GB/s
Active Manager System Usability • Single System Command and Control Resiliency • Dedicated management processors, real-time OS and communications fabric. • Proactive background diagnostics with self-healing. • Synchronized Linux kernels Active Management Software
Cray XD1: Overview • Node Integration • High (direct access from HyperTransport to RapidArray) • Network Integration • Medium/High (HW support for collective communication) • System Software Integration • High (Compute kernels are globally coordinated) • Early stage
Red Storm Architecture • Distributed memory MIMD parallel supercomputer • Fully connected 3D mesh interconnect. Each compute node processor has a bi-directional connection to the primary communication network • 108 compute node cabinets and 10,368 compute node processors (AMD Sledgehammer @ 2.0 GHz) • ~10 TB of DDR memory @ 333MHz • Red/Black switching: ~1/4, ~1/2, ~1/4 • 8 Service and I/O cabinets on each end (256 processors for each color240 TB of disk storage (120 TB per color)
Red Storm Architecture • Functional hardware partitioning: service and I/O nodes, compute nodes, and RAS nodes • Partitioned Operating System (OS): LINUX on service and I/O nodes, LWK (Catamount) on compute nodes, stripped down LINUX on RAS nodes • Separate RAS and system management network (Ethernet) • Router table-based routing in the interconnect
Red Storm architecture Compute File I/O Service Users /home Net I/O
System Layout(27 x 16 x 24 mesh) { { NormallyUnclassified SwitchableNodes NormallyClassified Disconnect Cabinets
Red Storm System Software • Run-Time System • Logarithmic loader • Fast, efficient Node allocator • Batch system – PBS • Libraries – MPI, I/O, Math • File Systems being considered include • PVFS – interim file system • Lustre – Pathforward support, • Panassas… • Operating Systems • LINUX on service and I/O nodes • Sandia’s LWK (Catamount) on compute nodes • LINUX on RAS nodes
ASCI Red Storm: Overview • Node Integration • High (direct access from HyperTransport to network through custom network interface chip) • Network Integration • Medium (No support for collective communication) • System Software Integration • Medium/High (scalable resource manager, no global coordination between nodes) • Expected to become the most powerful machine in the world (competition permitting)