1 / 18

Achieving Success in MPP Supercomputing at SNL Through Innovative R&D

Explore how Sandia National Laboratories led the DOE/DP revolution in MPP supercomputing, emphasizing scalability, reliability, and high performance computing architectures.

bostwick
Download Presentation

Achieving Success in MPP Supercomputing at SNL Through Innovative R&D

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. “Big” and “Not so Big” Iron at SNL SOS8 Charleston, SC

  2. ASCI-Red ICC PARAGON NCUBE Red Storm SNL CS R&D AccomplishmentPathfinder for MPP Supercomputing • Sandia successfully led the DOE/DP revolution into MPP supercomputing through CS R&D • nCUBE-10 • nCUBE-2 • IPSC-860 • Intel Paragon • ASCI Red • Cplant • … and gave DOE a strong, scalable parallel platforms effort Computing at SNL is an Applications success (i.e., uniquely-high scalability & reliability among FFRDC’s) because CS R&D paved the way Cplant Note: There was considerable skepticism in the community that MPP computing would be a success

  3. Our Approach • Large systems with a few processors per node • Message passing paradigm • Balanced architecture • Efficient systems software • Critical advances in parallel algorithms • Real engineering applications • Vertically integrated technology base • Emphasis on scalability & reliability in all aspects

  4. Compute File I/O Service Users /home Net I/O A Scalable Computing Architecture

  5. ASCI Red • 4,576 compute nodes • 9,472 Pentium II processors • 800 MB/sec bi-directional interconnect • 3.21 Peak TFlops • 2.34 TFlops on Linpack • 74% of peak • 9632 Processors • TOS on Service Nodes • Cougar LWK on Compute Nodes • 1.0 GB/sec Parallel File System

  6. Computational Plant • Antarctica - 2,376 Nodes • Antarctica has 4 “heads” with a switchable center section • Unclassified Restricted Network • Unclassified Open Network • Classified Network • Compaq (HP) DS10L “Slates” • 466MHz EV6, 1GB RAM • 600Mhz EV67, 1GB RAM • Re-deployed Siberia XP1000 Nodes • 500Mhz EV6, 256MB RAM • Myrinet • 3D Mesh Topology • 33MHz 64bit • A mix of 1,280 and 2,000 Mbit/sec technology • LANai 7.x and 9.x • Runtime Software • Yod - Application loader • Pct - Compute node process control • Bebopd - Allocation • OpenPBS - Batch scheduling • Portals Message Passing API • Red Hat Linux 7.2 w/2.4.x Kernel • Compaq (HP) Fortran, C, C++ • MPICH over Portals

  7. Institutional Computing Clusters • Two (classified/unclassified), 256 Node Clusters in NM • 236 compute nodes: • Dual 3.06GHz Xeon processors, 2GB memory • Myricom Myrinet PCI NIC (XP, REV D, 2MB) • 2 Admin nodes • 4 Login nodes • 2 MetaData Server (NDS) nodes • 12 Object Store Target (OST) nodes • 256 port Myrinet Switch • 128 node (unclassified) and a 64 Node (classified) Clusters in CA Login nodes • RedHat Linux 7.3 • Kerberos • Intel Compilers • C, C++ • Fortran • Open Source Compilers • Gcc • Java • TotalView • VampirTrace • Myrinet GM Administrative Nodes • Red Hat Linux 7.3 • OpenPBS • Myrinet GM w/Mapper • SystemImager • Ganglia • Mon • CAP • Tripwire Compute nodes • RedHat Linux 7.3 • Application Directory • MKL math library • TotalView client • VampirTrace client • MPICH-GM • OpenPBS client • PVFS client • Myrinet GM

  8. Usage

  9. Red Squall Development Cluster • Hewlett Packard Collaboration • Integration, Testing, System SW support • Lustre and Quadrics Expertise • RackSaver BladeRack Nodes • High Density Compute Server Architecture • 66 Nodes (132 processors) per Rack • 2.0GHz AMD Opteron • Same as Red Storm but w/commercial Tyan motherboards • 2 Gbytes of main memory per node (same as RS) • Quadrics QsNetII (Elan4) Interconnect • Best in Class (commercial cluster interconnect) Performance • I/O subsystem uses DDN S2A8500 Couplets with Fiber Channel Disk Drives (same as Red Storm) • Best in Class Performance • Located in the new JCEL facility

  10. SOS8 Charleston, SC

  11. Red Storm Goals • Balanced System Performance - CPU, Memory, Interconnect, and I/O. • Usability - Functionality of hardware and software meets needs of users for • Massively Parallel Computing. • Scalability - System Hardware and Software scale, single cabinet system to • ~20,000 processor system. • Reliability - Machine stays up long enough between interrupts to make real • progress on completing application run (at least 50 hours MTBI), requires full • system RAS capability. • Upgradability - System can be upgraded with a processor swap and additional • cabinets to 100T or greater. • Red/Black Switching - Capability to switch major portions of the machine • between classified and unclassified computing environments. • Space, Power, Cooling - High density, low power system. • Price/Performance - Excellent performance per dollar, use high volume • commodity parts where feasible.

  12. Red Storm Architecture • True MPP, designed to be a single system. • Distributed memory MIMD parallel supercomputer. • Fully connected 3-D mesh interconnect. Each compute node and service and I/O node processor has a high bandwidth, bi-directional connection to the primary communication network. • 108 compute node cabinets and 10,368 compute node processors. • (AMD Opteron @ 2.0 GHz) • ~10 TB of DDR memory @ 333 MHz • Red/Black switching - ~1/4, ~1/2, ~1/4. • 8 Service and I/O cabinets on each end (256 processors for each color). • 240 TB of disk storage (120 TB per color). • Functional hardware partitioning - service and I/O nodes, compute nodes, and RAS nodes. • Partitioned Operating System (OS) - LINUX on service and I/O nodes, LWK (Catamount) on compute nodes, stripped down LINUX on RAS nodes. • Separate RAS and system management network (Ethernet). • Router table based routing in the interconnect. • Less than 2 MW total power and cooling. • Less than 3,000 square feet of floor space.

  13. Red Storm Layout • Less than 2 MW total power and cooling. • Less than 3,000 square feet of floor space. • Separate RAS and system management network (Ethernet). • 3D Mesh: 27 x 16 x 24 (x, y, z) • Red/Black split 2688 : 4992 : 2688 • Service & I/O: 2 x 8 x 16

  14. Red Storm Cabinet Layout • Compute Node Cabinet • 3 Card Cages per Cabinet • 8 Boards per Card Cage • 4 Processors per Board • 4 NIC/Router Chips per Board • N+1 Power Supplies • Passive Backplane • Service and I/O Node Cabinet • 2 Card Cages per Cabinet • 8 Boards per Card Cage • 2 Processors per Board • 2 NIC/Router Chips per Board • PCI-X for each Processor • N+1 Power Supplies • Passive Backplane

  15. Operating Systems LINUX on service and I/O nodes LWK (Catamount) on compute nodes LINUX on RAS nodes File Systems Parallel File System - Lustre (PVFS) Unix File System - Lustre (NFS) Run-Time System Logarithmic loader Node allocator Batch system - PBS Libraries - MPI, I/O, Math Programming Model Message Passing Support for Heterogeneous Applications Tools ANSI Standard Compilers - Fortran, C, C++ Debugger - TotalView Performance Monitor System Management and Administration Accounting RAS GUI Interface Single System View Red Storm Software

  16. Red Storm Performance • Based on application code testing on production AMD Opteron processors we are now expecting that Red Storm will deliver around 10 X performance improvement over ASCI Red on Sandia’s suite of application codes. • Expected MP-Linpack performance - ~30 TF. • Processors • 2.0 GHz AMD Opteron (Sledgehammer) • Integrated dual DDR memory controllers @ 333 MHz • Page miss latency to local processor memory is ~80 nano-seconds. • Peak bandwidth of ~5.3 GB/s for each processor. • Integrated 3 Hyper Transport Interfaces @ 3.2 GB/s each direction • Interconnect performance • Latency <2 µs (neighbor) <5 µs (full machine) • Peak Link bandwidth ~3.84 GB/s each direction • Bi-section bandwidth ~2.95 TB/s Y-Z, ~4.98 TB/s X-Z, ~6.64 TB/s X-Y • I/O system performance • Sustained file system bandwidth of 50 GB/s for each color. • Sustained external network bandwidth of 25 GB/s for each color.

  17. HPC R&D Efforts at SNL • Advanced Architectures • Next Generation Processor & Interconnect Technologies • Simulation and Modeling of Algorithm Performance • Message Passing • Portals • Application characterization of message passing patterns • Light Weight Kernels • Project to design a next-generation lightweight kernel (LWK) for compute nodes of a distributed memory massively parallel system • Assess the performance, scalability, and reliability of a lightweight kernel versus a traditional monolithic kernel • Investigate efficient methods of supporting dynamic operating system services • Light Weight File System • only critical I/O functionality (storage, metadata mgmt, security) • special functionality implemented in I/O libraries (above LWFS) • Light Weight OS • Linux configuration to eliminate the need of a remote /root • “Trimming” the kernel to eliminate unwanted and unnecessary daemons • Cluster Management Tools • Diskless Cluster Strategies and Techniques • Operating Systems Distribution and Initialization • Log Analysis to improve robustness, reliability and maintainability

  18. More Information Computation, Computers, Information and Mathematics Center http://www.cs.sandia.gov

More Related