850 likes | 987 Views
Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems. Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Performance and Architectures Lab (PAL), CCS-3. Overview.
E N D
Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Performance and Architectures Lab (PAL), CCS-3
Overview • In this part of the tutorial we will discuss the characteristics of some of the most powerful supercomputers • We classify these machines along three dimensions • Node Integration - how processors and network interface are integrated in a computing node • Network Integration – what primitive mechanisms the network provides to coordinate the processing nodes • System Software Integration – how the operating system instances are globally coordinated
Overview • We argue that the level of integration in each of the three dimensions, more than other parameters (as distributed vs shared memory or vector vs scalar processor), is the discriminating factor beween large-scale supercomputers • In this part of the tutorial we will briefly characterize some existing and up-coming parallel computers
ASCI Q • Total — 20.48 TF/s, #3 in the top 500 • Systems— 2048 AlphaServer ES45s • 8,192 EV-68 1.25-GHz CPUs with 16-MB cache • Memory— 22 Terabytes • System Interconnect • Dual Rail Quadrics Interconnect • 4096 QSW PCI adapters • Four 1024-way QSW federated switches • Operational in 2002
Quad C-Chip Controller Quad C-Chip Controller D D D D D D D D D D D D D D D D PCI ChipBus 0,1 PCI ChipBus 0 PCI ChipBus 1 PCI ChipBus 2,3 64b 66MHz (528 MB/S) 64b 66MHz (528 MB/S) 64b 33MHz (266MB/S) 64b 33MHz (266MB/S) 64b 33 MHz (266 MB/S) 64b 66 MHz (528 MB/S) PCI 9 PCI8 PCI7 PCI6 PCI5 PCI4 PCI3 PCI2 PCI1 PCI0 PCI 9 PCI8 PCI7 PCI6 PCI5 PCI4 PCI3 PCI2 PCI1 PCI0 PCI-USB PCI-junk IO PCI-USB PCI-junk IO Node: HP (Compaq) AlphaServer ES45 21264 System Architecture EV68 1.25 GHz Each @ 64b 500 MHz (4.0 GB/s) Memory Up to 32 GB MMB 0 256b 125 MHz (4.0 GB/s) MMB 1 Cache 16 MB per CPU 256b 125 MHz (4.0 GB/s) MMB 2 MMB 3 PCI3 HS PCI2 HS PCI1 HS PCI9 HS PCI8 HS PCI7 HS PCI6 HS PCI5 PCI4 PCI0 Serial, Parallel keyboard/mouse floppy 3.3V I/O 5.0V I/O
QsNET: Quaternary Fat Tree Hardware support for Collective Communication MPI Latency 4ms, Bandwidth 300 MB/s Barrier latency less than 10ms
Interconnection Network Switch Level Super Top Level 6 5 Mid Level 4 1024 nodes (2x = 2048 nodes) ... 3 2 16th 64U64D Nodes 960-1023 1st 64U64D Nodes 0-63 1 48 63 960 1023
System Software • Operating System is Tru64 • Nodes organized in Clusters of 32 for resource allocation and administration purposes (TruCluster) • Resource management executed through Ethernet (RMS)
ASCI Q: Overview • Node Integration • Low (multiple boards per node, network interface on I/O bus) • Network Integration • High (HW support for atomic collective primitives) • System Software Integration • Medium/Low (TruCluster)
ASCI Thunder, Lawrence Livermore National Laboratory 1,024 Nodes, 4096 Processors, 23 TF/s, #2 in the top 500
ASCI Thunder: Configuration • 1,024 Nodes, Quad 1.4 Ghz Itanium2, 8GB DDR266 SDRAM (8 Terabytes total) • 2.5 ms, 912 MB/s MPI latency and bandwidth over Quadrics Elan4 • Barrier synchronization 6 ms, allreduce 15 ms • 75 TB in local disk in 73GB/node UltraSCSI320 • Lustre file system with 6.4 GB/s delivered parallell I/O performance • Linux RH 3.0, SLURM, Chaos
CHAOS: Clustered High Availability Operating System • Derived from Red Hat, but differs in the following areas • Modified kernel (Lustre and hw specific) • New packages for cluster monitoring, system installation, power/console management • SLURM, an open-source resource manager
ASCI Thunder: Overview • Node Integration • Medium/Low (network interface on I/O bus) • Network Integration • Very High (HW support for atomic collective primitives) • System Software Integration • Medium (Chaos)
System X, 10.28 TF/s • 1100 dual Apple G5 2GHz CPU based nodes. • 8 billion operations/second/processor (8 GFlops) peak double precision floating performance. • Each node has 4GB of main memory and 160 GB of Serial ATA storage. • 176TB total secondary storage. • Infiniband, 8ms and 870 MB/s, latency and bandwidth, partial support for collective communication • System-level Fault-tolerance (Déjà vu)
System X: Overview • Node Integration • Medium/Low (network interface on I/O bus) • Network Integration • Medium (limited support for atomic collective primitives) • System Software Integration • Medium (system-level fault-tolerance)
BlueGene/L System System (64 cabinets, 64x32x32) Cabinet (32 Node boards, 8x8x16) Node Card (32 chips, 4x4x2) 16 Compute Cards Compute Card 180/360 TF/s (2 chips, 2x1x1) 16 TB DDR Chip (2 processors) 2.9/5.7 TF/s 256 GB DDR 90/180 GF/s 8 GB DDR 5.6/11.2 GF/s 2.8/5.6 GF/s 0.5 GB DDR 4 MB
PLB (4:1) 32k/32k L1 256 128 L2 440 CPU 4MB EDRAM Shared “Double FPU” L3 directory L3 Cache 1024+ Multiported for EDRAM or 256 144 ECC Shared snoop Memory SRAM 32k/32k L1 Buffer 128 440 CPU L2 256 Includes ECC I/O proc 256 “Double FPU” 128 DDR JTAG Control Ethernet Torus Tree Global Gbit Access with ECC Interrupt 6 out and 3 out and 144 bit wide Gbit JTAG 4 global 6 in, each at 3 in, each at DDR Ethernet barriers or 1.4 Gbit/s link 2.8 Gbit/s link 256/512MB interrupts BlueGene/L Compute ASIC IBM CU-11, 0.13 µm 11 x 11 mm die size 25 x 32 mm CBGA 474 pins, 328 signal 1.5/2.5 Volt
16 compute cards 2 I/O cards DC-DC Converters: 40V 1.5, 2.5V
BlueGene/L Interconnection Networks 3 Dimensional Torus • Interconnects all compute nodes (65,536) • Virtual cut-through hardware routing • 1.4Gb/s on all 12 node links (2.1 GBytes/s per node) • 350/700 GBytes/s bisection bandwidth • Communications backbone for computations Global Tree • One-to-all broadcast functionality • Reduction operations functionality • 2.8 Gb/s of bandwidth per link • Latency of tree traversal in the order of 5 µs • Interconnects all compute and I/O nodes (1024) Ethernet • Incorporated into every node ASIC • Active in the I/O nodes (1:64) • All external comm. (file I/O, control, user interaction, etc.) Low Latency Global Barrier • 8 single wires crossing whole system, touching all nodes Control Network (JTAG) • For booting, checkpointing, error logging
BlueGene/L System Software Organization • Compute nodes dedicated to running user application, and almost nothing else - simple compute node kernel (CNK) • I/O nodes run Linux and provide O/S services • file access • process launch/termination • debugging • Service nodes perform system management services (e.g., system boot, heart beat, error monitoring) - largely transparent to application/system software
Operating Systems • Compute nodes: CNK • Specialized simple O/S • 5000 lines of code, • 40KBytes in core • No thread support, no virtual memory • Protection • Protect kernel from application • Some net devices in userspace • File I/O offloaded (“function shipped”) to IO nodes • Through kernel system calls • “Boot, start app and then stay out of the way” • I/O nodes: Linux • 2.4.19 kernel (2.6 underway) w/ ramdisk • NFS/GPFS client • CIO daemon to • Start/stop jobs • Execute file I/O • Global O/S (CMCS, service node) • Invisible to user programs • Global and collective decisions • Interfaces with external policy modules (e.g., job scheduler) • Commercial database technology (DB2) stores static and dynamic state • Partition selection • Partition boot • Running of jobs • System error logs • Checkpoint/restart mechanism • Scalability, robustness, security • Execution mechanisms in the core • Policy decisions in the service node
BlueGeneL: Overview • Node Integration • High (processing node integrates processors and network interfaces, network interfaces directly connected to the processors) • Network Integration • High (separate tree network) • System Software Integration • Medium/High (Compute kernels are not globally coordinated) • #2 and #4 in the top500
Cray XD1 System Architecture Compute • 12 AMD Opteron 32/64 bit, x86 processors • High Performance Linux RapidArray Interconnect • 12 communications processors • 1 Tb/s switch fabric Active Management • Dedicated processor Application Acceleration • 6 co-processors • Processors directly connected to the interconnect
Cray XD1 Processing Node Six 2-way SMP Blades 4 Fans 500 Gb/s crossbar switch Six SATA Hard Drives Chassis Front 12-port Inter-chassis connector Four independent PCI-X Slots Connector to 2nd 500 Gb/s crossbar switch and 12-port inter-chassis connector Chassis Rear
Cray XD1 Compute Blade RapidArray Communications Processor 4 DIMM Sockets for DDR 400 Registered ECC Memory AMD Opteron 2XX Processor AMD Opteron 2XX Processor Connector to Main Board 4 DIMM Sockets for DDR 400 Registered ECC Memory
Fast Access to the Interconnect GigaBytes GFLOPS GigaBytes per Second Memory I/O Interconnect Processor 1 GB/s PCI-X 0.25 GB/s GigE Xeon Server 5.3 GB/s DDR 333 Cray XD1 6.4GB/s DDR 400 8 GB/s
Communications Optimizations RapidArray Communications Processor • HT/RA tunnelling with bonding • Routing with route redundancy • Reliable transport • Short message latency optimization • DMA operations • System-wide clock synchronization RapidArray Communications Processor AMD Opteron 2XX Processor 2 GB/s RA 3.2 GB/s 2 GB/s
Active Manager System Usability • Single System Command and Control Resiliency • Dedicated management processors, real-time OS and communications fabric. • Proactive background diagnostics with self-healing. • Synchronized Linux kernels Active Management Software
Cray XD1: Overview • Node Integration • High (direct access from HyperTransport to RapidArray) • Network Integration • Medium/High (HW support for collective communication) • System Software Integration • High (Compute kernels are globally coordinated) • Early stage
Red Storm Architecture • Distributed memory MIMD parallel supercomputer • Fully connected 3D mesh interconnect. Each compute node processor has a bi-directional connection to the primary communication network • 108 compute node cabinets and 10,368 compute node processors (AMD Sledgehammer @ 2.0 GHz) • ~10 TB of DDR memory @ 333MHz • Red/Black switching: ~1/4, ~1/2, ~1/4 • 8 Service and I/O cabinets on each end (256 processors for each color240 TB of disk storage (120 TB per color)
Red Storm Architecture • Functional hardware partitioning: service and I/O nodes, compute nodes, and RAS nodes • Partitioned Operating System (OS): LINUX on service and I/O nodes, LWK (Catamount) on compute nodes, stripped down LINUX on RAS nodes • Separate RAS and system management network (Ethernet) • Router table-based routing in the interconnect
Red Storm architecture Compute File I/O Service Users /home Net I/O
System Layout(27 x 16 x 24 mesh) { { NormallyUnclassified SwitchableNodes NormallyClassified Disconnect Cabinets
Red Storm System Software • Run-Time System • Logarithmic loader • Fast, efficient Node allocator • Batch system – PBS • Libraries – MPI, I/O, Math • File Systems being considered include • PVFS – interim file system • Lustre – Pathforward support, • Panassas… • Operating Systems • LINUX on service and I/O nodes • Sandia’s LWK (Catamount) on compute nodes • LINUX on RAS nodes
ASCI Red Storm: Overview • Node Integration • High (direct access from HyperTransport to network through custom network interface chip) • Network Integration • Medium (No support for collective communication) • System Software Integration • Medium/High (scalable resource manager, no global coordination between nodes) • Expected to become the most powerful machine in the world (competition permitting)
A Case Study: ASCI Q • We try to provide some insight on the what we perceive are the important problems in a large-scale supercomputer • Our hands-on experience on ASCI Q shows that the system software and the global coordination are fundamental in a large-scale parallel machine
ASCI Q • 2,048 ES45 Alphaservers, with 4 processors/node • 16 GB of memory per node • 8,192 processors in total • 2 independent network rails, Quadrics Elan3 • > 8192 cables • 20 Tflops peak, #2 in the top 500 lists • A complex human artifact
Dealing with the complexity of a real system • In this section of the tutorial we provide insight into our methodology, that we used to substantially improve the performance of ASCI Q. • This methodology is based on an arsenal of • analytical models • custom microbenchmarks • full applications • discrete event simulators • Dealing with the complexity of the machine and the complexity of a real parallel application, SAGE, with > 150,000 lines of Fortran & MPI code
Overview • Our performance expectations for ASCI Q and the reality • Identification of performance factors • Application performance and breakdown into components • Detailed examination of system effects • A methodology to identify operating systems effects • Effect of scaling – up to 2000 nodes/ 8000 processors • Quantification of the impact • Towards the elimination of overheads • demonstrated over 2x performance improvement • Generalization of our results: application resonance • Bottom line: the importance of the integration of the various system across nodes
Performance of SAGE on 1024 nodes There is a difference why ? • Performance consistent across QA and QB (the two segments of ASCI Q, with 1024 nodes/4096 processors each) • Measured time 2x greater than model (4096 PEs) Lower is better!
Using fewer PEs per Node Test performance using 1,2,3 and 4 PEs per node Lower is better!
Using fewer PEs per node (2) Measurements match model almost exactly for 1,2 and 3 PEs per node! Performance issue only occurs when using 4 PEs per node