Runtime Fault-Handling for Job-Flow Management In Grid Environments

Runtime Fault-Handling for Job-Flow Management In Grid Environments Team:Gargi Dasgupta3, Onyeka Ezenwoye2,Liana Fong1, Selim Kalayci2, S. Masoud Sadjadi2, Balaji Viswanathan3, 1: IBM T. J. Watson Research Center, Yorktown Heights, NY 10598, 2: Florida International University, Miami, FL 33199, 3: IBM India Research Labs, New Delhi 1. Overview 2. Two-Layered Architecture • Motivation • Provide flexible job-flow orchestration and submission for scientists • Integrate service-based flow orchestration with job scheduling • Provide fault-tolerance handling that is transparent to the job-flow • Approach • Isolate job-flow orchestration from fault handling using layered design • Top Layer: Job-flow (Service) orchestration; using BPEL for job-flows • Second layer: Job trapping & fault handling; using JSDL for jobs • Introduce fault-handling proxy to implement job recovery policies • Use the Generic Proxy from TRAP/BPEL framework • Fault handling policies can be job specific depending on • Type of job • Type of failure • Level of fault-tolerance specified by user 3. Runtime Fault-Handling 4. Prototypical Architecture • Generic Proxy sits between Job-Flow Manager and Meta-Scheduler, and transparently intercepts the calls between these two layers. It employs recovery policies upon a failed invocation. • Recovery Policies contain rules to detect failures and a sequence of recovery actions to follow on failure detection. These policies are defined in the form of patterns that comprise common reusable recovery actions. • Two sample fault-tolerant patterns: Re-stage Data Pattern Re-submit Job Pattern

Runtime Fault-Handling for Job-Flow Management In Grid Environments

Runtime Fault-Handling for Job-Flow Management In Grid Environments

Presentation Transcript

Page Fault Handling

Runtime Environments Chapter 7 in textbook

Compiler Design 27. Runtime Environments: Activation Records, Heap Management

Runtime Environments

Runtime Environments

Chapter 7: Runtime Environments

Fault Handling

Fault Tolerant Runtime Research @ ANL

Grid Computing Environments

Autonomous FPGA Fault Handling through Competitive Runtime Reconfiguration

Dynamic Creation and Management of Runtime Environments in the Grid

Scalable, Fault-tolerant Management of Grid Services

Grid Programming Environments

BPEL4Job: a Fault-handling Design for Job Flow Management

Storage Accounting for Grid Environments

Runtime Environments

Crash Fault Detection in Celerating Environments

CSc 453 Runtime Environments

Fault-tolerant Scheduling of Fine-grained Tasks in Grid Environments

Runtime Environments

Chapter 7: Runtime Environments

Grid Compute Resources and Job Management