Achieving Success in MPP Supercomputing at SNL Through Innovative R&D

“Big” and “Not so Big” Iron at SNL SOS8 Charleston, SC

ASCI-Red ICC PARAGON NCUBE Red Storm SNL CS R&D AccomplishmentPathfinder for MPP Supercomputing • Sandia successfully led the DOE/DP revolution into MPP supercomputing through CS R&D • nCUBE-10 • nCUBE-2 • IPSC-860 • Intel Paragon • ASCI Red • Cplant • … and gave DOE a strong, scalable parallel platforms effort Computing at SNL is an Applications success (i.e., uniquely-high scalability & reliability among FFRDC’s) because CS R&D paved the way Cplant Note: There was considerable skepticism in the community that MPP computing would be a success

Our Approach • Large systems with a few processors per node • Message passing paradigm • Balanced architecture • Efficient systems software • Critical advances in parallel algorithms • Real engineering applications • Vertically integrated technology base • Emphasis on scalability & reliability in all aspects

Compute File I/O Service Users /home Net I/O A Scalable Computing Architecture

ASCI Red • 4,576 compute nodes • 9,472 Pentium II processors • 800 MB/sec bi-directional interconnect • 3.21 Peak TFlops • 2.34 TFlops on Linpack • 74% of peak • 9632 Processors • TOS on Service Nodes • Cougar LWK on Compute Nodes • 1.0 GB/sec Parallel File System

Computational Plant • Antarctica - 2,376 Nodes • Antarctica has 4 “heads” with a switchable center section • Unclassified Restricted Network • Unclassified Open Network • Classified Network • Compaq (HP) DS10L “Slates” • 466MHz EV6, 1GB RAM • 600Mhz EV67, 1GB RAM • Re-deployed Siberia XP1000 Nodes • 500Mhz EV6, 256MB RAM • Myrinet • 3D Mesh Topology • 33MHz 64bit • A mix of 1,280 and 2,000 Mbit/sec technology • LANai 7.x and 9.x • Runtime Software • Yod - Application loader • Pct - Compute node process control • Bebopd - Allocation • OpenPBS - Batch scheduling • Portals Message Passing API • Red Hat Linux 7.2 w/2.4.x Kernel • Compaq (HP) Fortran, C, C++ • MPICH over Portals

Institutional Computing Clusters • Two (classified/unclassified), 256 Node Clusters in NM • 236 compute nodes: • Dual 3.06GHz Xeon processors, 2GB memory • Myricom Myrinet PCI NIC (XP, REV D, 2MB) • 2 Admin nodes • 4 Login nodes • 2 MetaData Server (NDS) nodes • 12 Object Store Target (OST) nodes • 256 port Myrinet Switch • 128 node (unclassified) and a 64 Node (classified) Clusters in CA Login nodes • RedHat Linux 7.3 • Kerberos • Intel Compilers • C, C++ • Fortran • Open Source Compilers • Gcc • Java • TotalView • VampirTrace • Myrinet GM Administrative Nodes • Red Hat Linux 7.3 • OpenPBS • Myrinet GM w/Mapper • SystemImager • Ganglia • Mon • CAP • Tripwire Compute nodes • RedHat Linux 7.3 • Application Directory • MKL math library • TotalView client • VampirTrace client • MPICH-GM • OpenPBS client • PVFS client • Myrinet GM

Usage

Red Squall Development Cluster • Hewlett Packard Collaboration • Integration, Testing, System SW support • Lustre and Quadrics Expertise • RackSaver BladeRack Nodes • High Density Compute Server Architecture • 66 Nodes (132 processors) per Rack • 2.0GHz AMD Opteron • Same as Red Storm but w/commercial Tyan motherboards • 2 Gbytes of main memory per node (same as RS) • Quadrics QsNetII (Elan4) Interconnect • Best in Class (commercial cluster interconnect) Performance • I/O subsystem uses DDN S2A8500 Couplets with Fiber Channel Disk Drives (same as Red Storm) • Best in Class Performance • Located in the new JCEL facility

SOS8 Charleston, SC

Red Storm Goals • Balanced System Performance - CPU, Memory, Interconnect, and I/O. • Usability - Functionality of hardware and software meets needs of users for • Massively Parallel Computing. • Scalability - System Hardware and Software scale, single cabinet system to • ~20,000 processor system. • Reliability - Machine stays up long enough between interrupts to make real • progress on completing application run (at least 50 hours MTBI), requires full • system RAS capability. • Upgradability - System can be upgraded with a processor swap and additional • cabinets to 100T or greater. • Red/Black Switching - Capability to switch major portions of the machine • between classified and unclassified computing environments. • Space, Power, Cooling - High density, low power system. • Price/Performance - Excellent performance per dollar, use high volume • commodity parts where feasible.

Red Storm Architecture • True MPP, designed to be a single system. • Distributed memory MIMD parallel supercomputer. • Fully connected 3-D mesh interconnect. Each compute node and service and I/O node processor has a high bandwidth, bi-directional connection to the primary communication network. • 108 compute node cabinets and 10,368 compute node processors. • (AMD Opteron @ 2.0 GHz) • ~10 TB of DDR memory @ 333 MHz • Red/Black switching - ~1/4, ~1/2, ~1/4. • 8 Service and I/O cabinets on each end (256 processors for each color). • 240 TB of disk storage (120 TB per color). • Functional hardware partitioning - service and I/O nodes, compute nodes, and RAS nodes. • Partitioned Operating System (OS) - LINUX on service and I/O nodes, LWK (Catamount) on compute nodes, stripped down LINUX on RAS nodes. • Separate RAS and system management network (Ethernet). • Router table based routing in the interconnect. • Less than 2 MW total power and cooling. • Less than 3,000 square feet of floor space.

Red Storm Layout • Less than 2 MW total power and cooling. • Less than 3,000 square feet of floor space. • Separate RAS and system management network (Ethernet). • 3D Mesh: 27 x 16 x 24 (x, y, z) • Red/Black split 2688 : 4992 : 2688 • Service & I/O: 2 x 8 x 16

Red Storm Cabinet Layout • Compute Node Cabinet • 3 Card Cages per Cabinet • 8 Boards per Card Cage • 4 Processors per Board • 4 NIC/Router Chips per Board • N+1 Power Supplies • Passive Backplane • Service and I/O Node Cabinet • 2 Card Cages per Cabinet • 8 Boards per Card Cage • 2 Processors per Board • 2 NIC/Router Chips per Board • PCI-X for each Processor • N+1 Power Supplies • Passive Backplane

Operating Systems LINUX on service and I/O nodes LWK (Catamount) on compute nodes LINUX on RAS nodes File Systems Parallel File System - Lustre (PVFS) Unix File System - Lustre (NFS) Run-Time System Logarithmic loader Node allocator Batch system - PBS Libraries - MPI, I/O, Math Programming Model Message Passing Support for Heterogeneous Applications Tools ANSI Standard Compilers - Fortran, C, C++ Debugger - TotalView Performance Monitor System Management and Administration Accounting RAS GUI Interface Single System View Red Storm Software

Red Storm Performance • Based on application code testing on production AMD Opteron processors we are now expecting that Red Storm will deliver around 10 X performance improvement over ASCI Red on Sandia’s suite of application codes. • Expected MP-Linpack performance - ~30 TF. • Processors • 2.0 GHz AMD Opteron (Sledgehammer) • Integrated dual DDR memory controllers @ 333 MHz • Page miss latency to local processor memory is ~80 nano-seconds. • Peak bandwidth of ~5.3 GB/s for each processor. • Integrated 3 Hyper Transport Interfaces @ 3.2 GB/s each direction • Interconnect performance • Latency <2 µs (neighbor) <5 µs (full machine) • Peak Link bandwidth ~3.84 GB/s each direction • Bi-section bandwidth ~2.95 TB/s Y-Z, ~4.98 TB/s X-Z, ~6.64 TB/s X-Y • I/O system performance • Sustained file system bandwidth of 50 GB/s for each color. • Sustained external network bandwidth of 25 GB/s for each color.

HPC R&D Efforts at SNL • Advanced Architectures • Next Generation Processor & Interconnect Technologies • Simulation and Modeling of Algorithm Performance • Message Passing • Portals • Application characterization of message passing patterns • Light Weight Kernels • Project to design a next-generation lightweight kernel (LWK) for compute nodes of a distributed memory massively parallel system • Assess the performance, scalability, and reliability of a lightweight kernel versus a traditional monolithic kernel • Investigate efficient methods of supporting dynamic operating system services • Light Weight File System • only critical I/O functionality (storage, metadata mgmt, security) • special functionality implemented in I/O libraries (above LWFS) • Light Weight OS • Linux configuration to eliminate the need of a remote /root • “Trimming” the kernel to eliminate unwanted and unnecessary daemons • Cluster Management Tools • Diskless Cluster Strategies and Techniques • Operating Systems Distribution and Initialization • Log Analysis to improve robustness, reliability and maintainability

More Information Computation, Computers, Information and Mathematics Center http://www.cs.sandia.gov

Achieving Success in MPP Supercomputing at SNL Through Innovative R&D

Achieving Success in MPP Supercomputing at SNL Through Innovative R&D

Presentation Transcript