Reliability, Availability, and Serviceability (RAS) for High-Performance Computing

Reliability, Availability, and Serviceability (RAS) for High-Performance Computing Stephen L. Scott and Christian Engelmann Computer Science and Mathematics DivisionOak Ridge National Laboratory, Oak Ridge, TN, USA

Research and development goals • Provide high-level RAS capabilities for current terascale and next-generation petascale high-performance computing (HPC) systems • Eliminate many of the numerous single points of failure and control in today’s HPC systems • Develop techniques to enable HPC systems to run computational jobs 24x7 • Develop proof-of-concept prototypes and production-type RAS solutions

MOLAR: Adaptive runtime support for high-end computing operating and runtime systems • Addresses the challenges for operating and runtime systems to run large applications efficiently on future ultra-scale high-end computers • Part of the Forum to Address Scalable Technology for Runtime and Operating Systems (FAST-OS) • MOLAR is a collaborative research effort (www.fastos.org/molar):

Active/standby with shared storage • Single active head node • Backup to shared storage • Simple checkpoint/restart • Fail-over to standby node • Possible corruption of backup state when failing during backup • Introduction of a new single point of failure • No guarantee of correctness and availability • Simple Linux Utility for Resource Management, metadata servers of Parallel Virtual File System and Lustre Active/Standby Head Nodes with Shared Storage

Active/standby redundancy • Single active head node • Backup to standby node • Simple checkpoint/restart • Fail-over to standby node • Idle standby head node • Rollback to backup • Service interruption for fail-over and restore-over • HA-OSCAR, Torque on Cray XT Active/Standby Head Nodes

Asymmetric active/active redundancy • Many active head nodes • Work load distribution • Optional fail-over to standby head node(s) (n+1 or n+m) • No coordination between active head nodes • Service interruption for fail-over and restore-over • Loss of state without standby • Limited use cases, such as high-throughput computing • Prototype based on HA-OSCAR Asymmetric Active/Active Head Nodes

Symmetric active/active redundancy • Many active head nodes • Work load distribution • Symmetric replication between head nodes • Continuous service • Always up to date • No fail-over necessary • No restore-over necessary • Virtual synchrony model • Complex algorithms • JOSHUA prototype for Torque Active/Active Head Nodes

Output Unification Virtually Synchronous Processing Input Replication Symmetric active/active replication

Symmetric active/active high availability for head and service nodes • Acomponent = MTTF / (MTTF + MTTR) • Asystem = 1 - (1 - Acomponent) n • Tdown = 8760 hours * (1 – A) • Single node MTTF: 5000 hours • Single node MTTR: 72 hours Single-site redundancy for 7 nines does not mask catastrophic events

Applications Scheduler MPI Runtime File System SSI Virtual Synchrony Replicated Memory Replicated File Replicated State-Machine Replicated Database Replicated RPC/RMI Distributed Control Group Communication Membership Management Failure Detection Reliable Multicast Atomic Multicast Communication Driver Singlecast Failure Detection Multicast Network (Ethernet, Myrinet, Elan+, Infiniband,…) High-availabilityframework for HPC • Pluggable component framework • Communication drivers • Group communication • Virtual synchrony • Applications • Interchangeable components • Adaptation to application needs, such as level of consistency • Adaptation to system properties, such as network and system scale

Scalable, fault-tolerant membership for MPI tasks on HPC systems • Scalable approach to reconfiguring communication infrastructure • Decentralized (peer-to-peer) protocol that maintains consistent view of active nodes in the presence of faults • Resilience against multiple node failures, even during reconfiguration • Response time: • Hundreds of microseconds over MPI on 1024-node Blue Gene/L • Single-digit milliseconds over TCP on 64-node Gigabit Ethernet Linux cluster (XTORC) • Integration with Berkeley Laboratory checkpoint/restart (BLCR) mechanism to handle node failures without restarting an entire MPI job

Stabilization time over MPI on BG/L 350 300 Experimental results Distance model 250 Base model 200 Time for Stabilization [microsecs] 150 100 50 0 4 8 16 32 64 512 1024 128 256 Number of Nodes (Log Scale)

Stabilization time over TCP on XTORC 2000 Experimental results Distance Model 1500 Base Model Time for Stabilization [microsecs] 1000 500 0 3 7 11 15 19 23 27 31 35 39 43 47 Number of nodes

ORNL contacts Stephen L. Scott Network and Cluster Computing Computer Science and Mathematics (865) 574-3144 Scottsl@ornl.gov Christian Engelmann Network and Cluster Computing Computer Science and Mathematics (865) 574-3132 Engelmannc@ornl.gov 14 Scott_RAS_0614

Reliability, Availability, and Serviceability (RAS) for High-Performance Computing