1 / 14

Reliability, Availability, and Serviceability (RAS) for High-Performance Computing

Reliability, Availability, and Serviceability (RAS) for High-Performance Computing. Stephen L. Scott and Christian Engelmann Computer Science and Mathematics Division Oak Ridge National Laboratory, Oak Ridge, TN, USA. Research and development goals.

Download Presentation

Reliability, Availability, and Serviceability (RAS) for High-Performance Computing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Reliability, Availability, and Serviceability (RAS) for High-Performance Computing Stephen L. Scott and Christian Engelmann Computer Science and Mathematics DivisionOak Ridge National Laboratory, Oak Ridge, TN, USA

  2. Research and development goals • Provide high-level RAS capabilities for current terascale and next-generation petascale high-performance computing (HPC) systems • Eliminate many of the numerous single points of failure and control in today’s HPC systems • Develop techniques to enable HPC systems to run computational jobs 24x7 • Develop proof-of-concept prototypes and production-type RAS solutions

  3. MOLAR: Adaptive runtime support for high-end computing operating and runtime systems • Addresses the challenges for operating and runtime systems to run large applications efficiently on future ultra-scale high-end computers • Part of the Forum to Address Scalable Technology for Runtime and Operating Systems (FAST-OS) • MOLAR is a collaborative research effort (www.fastos.org/molar):

  4. Active/standby with shared storage • Single active head node • Backup to shared storage • Simple checkpoint/restart • Fail-over to standby node • Possible corruption of backup state when failing during backup • Introduction of a new single point of failure • No guarantee of correctness and availability • Simple Linux Utility for Resource Management, metadata servers of Parallel Virtual File System and Lustre Active/Standby Head Nodes with Shared Storage

  5. Active/standby redundancy • Single active head node • Backup to standby node • Simple checkpoint/restart • Fail-over to standby node • Idle standby head node • Rollback to backup • Service interruption for fail-over and restore-over • HA-OSCAR, Torque on Cray XT Active/Standby Head Nodes

  6. Asymmetric active/active redundancy • Many active head nodes • Work load distribution • Optional fail-over to standby head node(s) (n+1 or n+m) • No coordination between active head nodes • Service interruption for fail-over and restore-over • Loss of state without standby • Limited use cases, such as high-throughput computing • Prototype based on HA-OSCAR Asymmetric Active/Active Head Nodes

  7. Symmetric active/active redundancy • Many active head nodes • Work load distribution • Symmetric replication between head nodes • Continuous service • Always up to date • No fail-over necessary • No restore-over necessary • Virtual synchrony model • Complex algorithms • JOSHUA prototype for Torque Active/Active Head Nodes

  8. Output Unification Virtually Synchronous Processing Input Replication Symmetric active/active replication

  9. Symmetric active/active high availability for head and service nodes • Acomponent = MTTF / (MTTF + MTTR) • Asystem = 1 - (1 - Acomponent) n • Tdown = 8760 hours * (1 – A) • Single node MTTF: 5000 hours • Single node MTTR: 72 hours Single-site redundancy for 7 nines does not mask catastrophic events

  10. Applications Scheduler MPI Runtime File System SSI Virtual Synchrony Replicated Memory Replicated File Replicated State-Machine Replicated Database Replicated RPC/RMI Distributed Control Group Communication Membership Management Failure Detection Reliable Multicast Atomic Multicast Communication Driver Singlecast Failure Detection Multicast Network (Ethernet, Myrinet, Elan+, Infiniband,…) High-availabilityframework for HPC • Pluggable component framework • Communication drivers • Group communication • Virtual synchrony • Applications • Interchangeable components • Adaptation to application needs, such as level of consistency • Adaptation to system properties, such as network and system scale

  11. Scalable, fault-tolerant membership for MPI tasks on HPC systems • Scalable approach to reconfiguring communication infrastructure • Decentralized (peer-to-peer) protocol that maintains consistent view of active nodes in the presence of faults • Resilience against multiple node failures, even during reconfiguration • Response time: • Hundreds of microseconds over MPI on 1024-node Blue Gene/L • Single-digit milliseconds over TCP on 64-node Gigabit Ethernet Linux cluster (XTORC) • Integration with Berkeley Laboratory checkpoint/restart (BLCR) mechanism to handle node failures without restarting an entire MPI job

  12. Stabilization time over MPI on BG/L 350 300 Experimental results Distance model 250 Base model 200 Time for Stabilization [microsecs] 150 100 50 0 4 8 16 32 64 512 1024 128 256 Number of Nodes (Log Scale)

  13. Stabilization time over TCP on XTORC 2000 Experimental results Distance Model 1500 Base Model Time for Stabilization [microsecs] 1000 500 0 3 7 11 15 19 23 27 31 35 39 43 47 Number of nodes

  14. ORNL contacts Stephen L. Scott Network and Cluster Computing Computer Science and Mathematics (865) 574-3144 Scottsl@ornl.gov Christian Engelmann Network and Cluster Computing Computer Science and Mathematics (865) 574-3132 Engelmannc@ornl.gov 14 Scott_RAS_0614

More Related