Providing Fault-tolerance for Parallel Programs on Grid (FT-MPICH)

Providing Fault-tolerance for Parallel Programs on Grid(FT-MPICH) Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University

Contents Motivation 1 Introduction 2 Architecture 3 Conclusion 4

Motivation • Hardware performance limitations are overcome by Moore's Law • These cutting-edge technologies make “Tera-scale” clusters feasible !!! • However.. • What about “THE” system reliability ??? • Distributed systems are still fragile due to unexpected failures…

Motivation Multiple Fault-tolerant Framework MPICH-G2 (Ethernet) Good speed (1Gbps) Common MPICH Standard Demand Fault-resilience !!! MPICH-GM (Myrinet) High-speed (10Gbps) Popular MPICH Compatible Demand Fault-resilience !!! MVAPICH (InfiniBand) High-speed (Up to 30Gbps) Will be Popular MPICH Compatible Demand Fault-resilience !!! High-performance Network Trend

Introduction • Unreliability of distributed systems • Even a single local failure can be fatal to parallel processes since it could render useless all computations executed to the point of failure. • Our goal is • To construct a practical multiple fault-tolerant framework for various types of MPICH variants working on high-performance clusters/Grids.

Introduction • Why Message Passing Interface (MPI)? • Designing a generic FT framework is extremely hard due to the diversity of hardware and software systems. • We chosen MPICH series .... • MPI is the most popular programming model in cluster computing. • Providing fault-tolerance to MPI is more cost-effective than providing it to the OS or hardware…

Architecture-Concept- Failure Detection Monitoring Multiple Fault-tolerant Framework C/R Protocol Consensus & Election Protocol

Architecture-Overall System- Management System Communication Ethernet Gigabit Ethernet High-speed Network (Myrinet, InfiniBand) Ethernet Others Ethernet Others Ethernet Others Communication Communication Communication MPI Process MPI Process MPI Process

Architecture-Development History- Current 2003 2004 2005 FT-MPICH-GM MPICH-GF FT-MVAPICH Fault-tolerant MPICH-G2 -Ethernet- Fault-tolerant MPICH-GM -Myrinet- Fault-tolerant MVAPICH -InfiniBand-

Management System Failure Detection Initialization Coordination Management System Output Management Checkpoint Coordination Checkpoint Transfer Recovery Makes MPI more reliable

Management System

Job Management System 1/2 • Job Management System • Manages and monitors multiple MPI processes and their execution environments • Should be lightweight • Helps the system take consistent checkpoints and recover from failures • Has a fault-detection mechanism • Two main components • Central Manager & Local Job Manager

Job Management System 2/2 • Central Manager • Manages all system functions and states • Detects node failures by periodic heartbeats and Job Manager’s failures • Job Manager • Relays messages between Central Manager & MPI Processes • Detects unexpected MPI process failures

Fault-Tolerant MPI 1/3 • To provide MPI fault-tolerance, we adopt • Coordinated checkpointing scheme (vs. Independent scheme) • The Central Manager is the Coordinator!! • Application-level checkpointing (vs. kernel-level CKPT.) • This method does not require any efforts on the part of cluster administrators • User-transparent checkpointing scheme (vs. User-aware) • This method requires no modification of MPI source codes

ver 1 checkpoint command ver 2 Fault-Tolerant MPI 2/3 • Coordinated Checkpointing rank1 rank2 rank3 rank0 Central Manager storage

ver 1 failure detection Fault-Tolerant MPI 3/3 • Recovery from failures rank1 rank2 rank3 rank0 Central Manager checkpoint command storage

Management System • MPICH-GF • Based on Globus Toolkit2 • Hierarchical Management System • Suitable for multiple clusters • Supports recovery from process/manager/node failure • Limitation • Does not support recovery from multiple failures • Has single point of failure (Central Manager)

Management System • FT-MPICH-GM • New version • It does not rely on the Globus Toolkit. • Removes of hierarchical structure • Myrinet/Infiniband clusters no longer require hierarchical structure. • Supports recovery from multiple failures • FT-MVAPICH • More robust • Removes the single point of failure • Leader election for the job manager

Fault-tolerant MPICH-variants FT-MPICH-GM FT-MVAPICH MPICH-GF Collective Operations P2P Operations ADI(Abstract Device Interface) Globus2 (Ethernet) GM (Myrinet) MVAPICH (InfiniBand) FT Module Recovery Module Checkpoint Toolkit Connection Re-establishment Atomic Message Transfer Ethernet Myrinet InfiniBand

Future Works • We’re working to incorporate our FT protocol into the GT-4 framework. • MPICH-GF is GT-2 compliant • Incorporating fault-tolerant management protocol into GT-4 • Make MPICH work with different clusters • Gig-E • Myrinet Open-MPI, VMI, etc. • Infiniband • Supporting non-Intel CPUs • AMD(Opteron)

GRID Issues • Who should be responsible for ? • Monitoring the up/down of nodes. • Resubmitting the failed process. • Allocating new nodes. • GRID Job Management • Resource management • Scheduler • Health Monitoring

Thank You !

Providing Fault-tolerance for Parallel Programs on Grid (FT-MPICH)

Providing Fault-tolerance for Parallel Programs on Grid (FT-MPICH)

Presentation Transcript

Fault Tolerance

FT-MPICH : Providing fault tolerance for MPI parallel applications

Fault Tolerance CDA 5140 Spring 06

Fault Tolerance

Tool for Customizing Fault Tolerance in a System

Fault tolerance

Fault Tolerance Distributed

Fault Tolerance

Fault Tolerance

Fault Tolerance

Distributed Parallel Processing – MPICH-VMI

Kwon, Oh-kyoung Grid Computing Research Team KISTI Korea-Japan Grid Symposium 2007

Fault Tolerance BOF

MPI and Grid Computing

Fault Tolerance

Fault Tolerance

Defining Programs, Specifications, fault-tolerance, etc.

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance