150 likes | 283 Views
Institute of Computer Science AGH. A Proposal of Application Failure Detection and Recovery in the Grid. Marian Bubak 1,2 , Tomasz Szepieniec 2 , Marcin Radecki 2 1 Institute of Computer Science, AGH 2 Academic Computer Centre -- CYFRONET. Outline. Motivation & introduction
E N D
Institute of Computer Science AGH A Proposal of Application Failure Detectionand Recovery in the Grid Marian Bubak1,2, Tomasz Szepieniec2, Marcin Radecki2 1 Institute of Computer Science, AGH 2 Academic Computer Centre -- CYFRONET
Outline • Motivation & introduction • Services useful in fault recovery approach • Overview of our proposal • Problems & workflow approach • Summary
• Risk that application crashesis higher • Crash is more expensive for large application Motivation • Reliability of single component does not raise considerably • Environment and application size increase steadily Fault tolerance problembecomes important in the Grid
Checkpointing is costly Often restarting whole application Many global operations Additional developer’s effort is required Application-specific methods Demands vs. Reality Minimaloverhead Automatic, quick recovery Scalability Transparent Porting to any kind ofapplication
Two classes of FT approaches • Application Built-in FT • Algorithm/structure profile can be exploited, • FT activity can by done more efficiently, e.g. checkpointing • Naturally Fault Tolerant problem class, e.g. genetic alg. • Fault Tolerant-MPI • but... all must be done by developer • FT realized by external services • automatic middleware services • no developer effort required • but... limited functionality • It would be beneficial to combine this two
Services useful in FT approach • Monitoring services • For fault detection in hardware and software • e.g. Check if process is still running, • Checkpointing, logging, redundancy services • For preparing recovery • e.g. Store the current state of application • Recovery services • In case of failure • e.g. Rollback from last checkpointing, • Scheduler and resource broker • For knowledge about started application • For re-scheduling, re-brokering job or it’s part
How to make it work together? Infrastructure Mon.Services • The component that manages this services is needed • part of middleware • job companion • co-ordinate actions of FT services • Recovery action taken is more appropriate, because: • whole job state is considered • the most suitable of available services could be used Application Mon. Services Fault Tolerant Manager Checkpointing Services Scheduler Services Recovery Services
Infrastructure Application Application Monitoring Check- pointing Recovery FT Manager – Architecture Infrastructure Mon.Services Fault Tolerant Manager Application Mon. Services Job Supervisor Decision Maker Checkpointing Services Scheduler Services Recovery Scenario Executor Recovery Services
Job Supervisor (1) Fault Tolerant Manager • Main functionality: • Monitors job execution • Manages (or stores information about) checkpointing • When something is wrong generatesFault Alarm • Fault Alarm contains not only the information what is wrong, but also the status of job (e.g. last checkpoint) • Job Supervisor can be asked toperform more checking by Decision Maker Job Supervisor FaultAlarm Decision Maker Recovery Scenario Executor
Job Supervisor (2) – Faults Fault Tolerant Manager • Typical examples of fault: • process crash • node is not responding • lost connection (link is down) • Extended fault characteristics: • Occurring and duration characteristics • Severity for application, • E.g. Master fault is more dangerous than slave fault • Fault is not only when connection is lost, but also when performance dramatically decreases • Sophisticated performance monitoring is required Job Supervisor FaultAlarm Decision Maker Recovery Scenario Executor
Decision Maker Fault Tolerant Manager • Main functionality: • Analyzes the situation, when gets fault alarm • Preparesrecovery scenariosand sendsthe best of them for execution • Issues to be considered: • What is possible • The cost of each recovery scenario • Do-nothingor waitscenario is always possible and sometimes beneficial • E.g. in case of problem with network link when only recovery is to restart the whole application • Historical data and probabilistic methods should be used Job Supervisor FaultAlarm Decision Maker Recovery Scenario Recovery Scenario Executor
Recovery Scenario Executor Fault Tolerant Manager • Main functionality: • Executes actions from scenario • Supervises recovery process • Recovery Scenario contains several actions that could be performed by different recovery services • In case of failure in scenario execution, Decision Maker is alarmed Job Supervisor Decision Maker Recovery Scenario Recovery Scenario Executor
Problems • Many class of services to cooperate with • Many interfaces • How to obtain information about application? • Which services are available? • Semantic specification for monitoring and recovery services is needed
Feasibility – WorkFlows • Grid-Services-based approach could help to solve our problems • Knowledge about application architecture is accessible • Workflow description details are welcomed • Exchange of single component is better that restart the whole application • Directives for FT Manager could be included in job description • Interfaces are unified
Summary • Fault tolerance issuesbecome more and more important in the Grid • A service for fault tolerance management has been proposed • ...which enables more sophisticated fault tolerance for Grid • Workflow-based framework facilites the task • But, this is a proposal only... You are invited for commenting and remarking!