MOLAR: MO dular L inux and A daptive R untime support

MOLAR: MOdular Linux and Adaptive Runtime support Project Team David Bernholdt1, Christian Engelmann1, Stephen L. Scott1, Jeffrey Vetter1 Arthur B. Maccabe2, Patrick G. Bridges2 Frank Mueller3 Ponnuswany Sadayappan4 Chokchai Leangsuksun5 1Oak Ridge National Laboratory 2University of New Mexico 3North Carolina State University 4Ohio State University 5Louisiana Tech University Briefing at: Scalable Systems Software meeting Argonne National Laboratory - August 26, 2004

Research Plan • Create a modular and configurable Linux system that allows customized changes based on the requirements of the applications, runtime systems, and cluster management software. • Build runtime systems that leverage the OS modularity and configurability to improve efficiency, reliability, scalability, ease-of-use, and provide support to legacy and promising programming models. • Advance computer RAS management systems to work cooperatively with the OS/R to identify and preemptively resolve system issues. • Explore the use of advanced monitoring and adaptation to improve application performance and predictability of system interruptions.

MOLAR map

MOLAR map MOLAR: Modular Linux and Adaptive Runtime support MOLAR: Modular Linux and Adaptive Runtime support HEC Linux OS: modular, custom, light HEC Linux OS: modular, custom, light - - weight weight Kernel design Kernel design [UNM, ORNL, LLNL] [UNM, ORNL] RAS: reliability, availability, serviceability RAS: reliability, availability, serviceability Monitoring Monitoring Extend/adapt Extend/adapt Root cause analysis Root cause analysis High availability High availability Message logging Process state saving runtime/OS runtime/OS [LaTech, ORNL, NCSU] ORNL LaTech [ [ LaTech , ORNL] [NCSU] [LLNL] , ] ORNL LaTech [ORNL, OSU] [ORNL, OSU] Programming models Programming models Testbeds Testbeds Evaluation Evaluation Provided Provided [ORNL, OSU] [ORNL, OSU] [Cray, [Cray, ORNL] ORNL]

RAS for Scientific and Engineering Applications • High mean time between interrupts (MTBI) for hardware, system software, and storage devices. • High mean time between errors/failures that affect users. • Recovery is automatic w/o human intervention. • Minimal work loss due to recovery process. Computation – Storage – Network

Case for RAS in HEC • Today’s systems need to reboot to recover. • Entire system often down for any maintenance or repair. • Compute nodes sit idle if their head (service) node is down. • Availability and MTBI typically decreases as system grows. • The “hidden” costs of failures • researchers’ lost work-in-progress • researchers on hold • additional system staff • checkpoint & restart time • Why do we accept such significant system outages due to failures, maintenance or repair? • With the expected investment into HEC we simply cannot afford low availability! • We need to drastically increase the availability of HEC computing resources now!

High-availability in Industry • Industry has shown for years that 99.999% (five nines) high-availability is feasible for computing services. • Used in corporate web servers, distributed data bases, business accounting and stock exchange services. • OS-level high-availability has not been a priority in the past. • Implementation involves complex algorithms. • Development and distribution licensing issues exist. • Most solutions are proprietary and do not perform well. • HA-OSCAR first freely available open source HA cluster implementation. • If we don’t step-up and do it as an Open Source proof-of-concept implementation and set the standard no one will.

Availibility by the Nines* • Service measured by “9’s of availability” • 90% has one 9, 99% has two 9s, etc… • Good HA package + substandard hardware = up to 3 nines • Enterprise-class hardware + stable Linux kernel = 5+ nines *Highly-Affordable High Availability by Alan Robertson Linux Magazine, November 2003 http://www.linux-mag.com/2003-11/availability_01.html

Federated System Management

Active/Hot-Standby: Single head node. Idle standby head node(s). Backup to shared storage. Service interruption for the time of the fail-over. Rollback to backup. Simple checkpoint/restart. Service interruption for the time of restore-over. Active/Active: Many active head nodes. Work load distribution. Symmetric replication between head nodes. Continuous service. Always up-to-date. Complex distributed control algorithms. No restore-over necessary High-availability Methods

Active/Hot-Standby: HA-OSCAR with active/ hot-standby head node. Cluster system software. No support for multiple active/active head nodes. No middleware support. No support for compute nodes. Active/Active: HARNESS with symmetric distributed virtual machine. Heterogeneous adaptable distributed middleware. No system level support. High-availability Technology • System-level data replication and distributed control service needed for active/active head node solution. • Reconfigurable framework similar to HARNESS needed to adapt to system properties and application needs.

Modular RAS Framework forTerascale Computing High-Available Service Nodes: Service Node Service Node Service Node To Compute Nodes Reliable Services: Job Sched. User Mgmt. Etc. Virtual Synchrony: Distributed Control Service Symmetric Replication: Data Replication Service Reliable Server Groups: Group Communication Service Communication Methods: TCP/IP Shared Memory Etc.

MOLAR: MO dular L inux and A daptive R untime support