Towards Highly Available OSG (Open Science Grid)

Towards Highly Available OSG (Open Science Grid) VISHAL RAMPURE Louisiana Tech University Towards Highly Available OSG VISHAL RAMPURE

OVERVIEW • INTRODUCTION • HA-OSCAR • OSG services • Proposed work • Fault Tolerance • Data Replication • Recovery • Conclusion

INTRODUCTION • High Availability: It refers to a system or a component that is continuously operational for long period of time. • With respect to a grid, high availability refers to continuous operation of a grid resource. • In other words, ensure that the uptime of a service provided by a grid resource is maintained at a high level.

Continued… • If a system providing some services crashes then the services will be unavailable for a certain amount of time until the system is repaired. • Mostly HA is provided by redundancy • A system failure is detected by continuously monitoring its health and the service is restarted on another system automatically without any human intervention.

HAOSCAR (High Availability Open Source Cluster Application Resources) • HA-OSCAR is a open source project that aims to provide a combined power of High availability and Performance computing solution. • The motivation is to enhance the Beowulf cluster system for critical grade applications. • Component redundancy is adopted in order to provide a high availability on HA-OSCAR cluster.

Continued… • HA-OSCAR incorporates the following features to eliminate a single point failure in a HA-OSCAR cluster • Self Healing mechanism. • Failure Detection • Recovery • Automatic failover • Automatic fail-back

HA-OSCAR FAILOVER STRATEGY

HA-OSCAR VS BEWOULF CLUSTER HA-OSCAR Beowulf

OSG SERVICES • The Open Science Grid (OSG) is a distributed computing infrastructure for large-scale scientific research. • OSG provides a lot of services some optional and some critical. • Virtual Data Toolkit (VDT) provides the basic grid infrastructure with critical services like GRAM, Gridftp etc and optional services like Uberftp, MonALISA etc.

Some of the services provided by OSG • Virtual Data Toolkit • GT4 services • GRAM • Gridftp • CONDOR • VOMS • Monitoring and Information Services • Core MIS • Grid Cat • MonALISA

Continued… • Resource Selection Service • Condor Match Making Service • Storage Services • Disk Resource Manager

Proposed Work • Our aim is to provide a HA enabled infrastructure that monitors the critical services on a grid resource. • We intend to provide a HA enabled grid infrastructure with three functionalities • Fault Tolerance • Data Replication • Recovery

How do we intend to make OSG Fault Tolerant? • The ability of a system to respond gracefully to an unexpected hardware or software failure is FAULT TOLARANCE. • There are many levels of fault tolerance, the lowest being the ability to continue operation in the event of a power failure. • Many fault-tolerant systems mirror all operations that is, every operation is performed on two or more duplicate systems, so if one fails the other can take over.

Continued… • We intend to provide fault tolerant OSG system using the HA-OSCAR infrastructure. • There would be two identical systems one active (primary) and the other passive (standby). The standby system monitors the health of the primary system continuously. • The standby takes over the functionality of the primary as soon as it detects a failure.

How do we intend to provide Data Replication? • The copying of data to and from sites to improve local service response times and availability frequently employed as part of a backup and recovery strategy. • In case a disaster occurs, recovery ability and speed are critical. • Every time HA-OSCAR is completely re-installed or the kernel updated, ghost images of before and after are saved.

Continued.. • Snapshot of an old and new kernel, gzips it and sends the image to the secondary head node as well as to a predefined disaster recovery site. • Important OSG data as well as application and configuration files also can be included in the ghost image. • The running jobs are checkpointed and replicated on the standby node in case of failure for fast recovery.

How do we intend to provide Recovery mechanism? • We intend to provide a automatic fault detection and recovery mechanism for HA enabled grid (OSG). • The critical grid services like Gridftp and globus-gatekeeper can be continuously monitored. • The failure of these services will generate an alert wherein the standby can takeover. • The checkpointed jobs replicated on the standby node would be restarted.

Critical Service Monitoring & Failover-Failback

CONCLUSION • Providing a HA enabled grid infrastructure would make a OSG grid resource more reliable. • HA-OSCAR solution for a OSG site-manager provides better availability, self healing and fault tolerance. • HA-OSCAR ensures that the critical grid and cluster service interruptions are minimized.

Towards Highly Available OSG (Open Science Grid)