Fault-tolerant Scheduling of Fine-grained Tasks in Grid Environments

Fault-tolerant Scheduling of Fine-grained Tasks in Grid Environments

Assumptions • Fail-stopmodel If a processor fails, it no longer transmits valid messages. • Reliablecommunication Processor crashes are detected eventually by the communication layer.

2 is a descendant of 1 (Master)

Map of Thief to Set of Tasks • Each victim has a table of stolen tasks. Map<Computer, Set<Task>> thiefTaskSet = new … • When a task is stolen, a copy is put in the Set<Task> associated with that thief (Computer).

Global Result Table (GRT) • Each compute server has a GRT replica: Map<taskParameters, Result> Entries are broadcast to all compute servers. • The Map key & value are potentially large. • It should be (more explanation later …) Map<TaskId, Computer> Where Computer is where the Result is stored.

Crash recovery method • If ( master crashed ) Elect a new master; • For all ( tasks stolen by a crashed processor ) Put task in task queue; • For all ( descendants of tasks stolen from a crashed processor ) If (descendant is finished) Then store it’s result in Global Result Table; Else abort the task; • If ( old master crashed && I am the master ) Restart the application;

Notes • A task is an orphan if its parent task is on a crashed server. • The authors: Our contribution: Some descendants of orphaned tasks are not recomputed. • Descendants of orphaned tasks are aborted, if they are incomplete at the time they become orphans. • They do not use explicit continuation passing: No composition tasks. Descendant decompositions that were complete must be recomputed!

Complete decomposition • tasks 4, 8, & 14 are lost. • In-progress task 21 is lost. • Decompositions • 2, 5, 10, 16 are lost

Notes • Their GRT key is task parameters. • The hash code is sum the hash of the parameters If the parameter is an array, they sum the hash of each element! • It should be TaskId, but they do not have a processor-independent TaskId. This is claimed as future work. • They claim: only 1 in 1000-10,000 tasks is stolen, which is key to the efficiency of their scheme. Their tests crash whole clusters, rather than individual compute servers within a cluster. Why?

Fault-tolerant Scheduling of Fine-grained Tasks in Grid Environments

Fault-tolerant Scheduling of Fine-grained Tasks in Grid Environments

Presentation Transcript

Fault-Tolerant Broadcast

Fundamentals of Fault-Tolerant Distributed Computing In Asynchronous Environments

Scheduling and Optimization of Fault-Tolerant Embedded Systems

Fault-Tolerant Broadcast

Scalable, Fault-tolerant Management of Grid Services

Fine-Grained Authorization in Databases

Fault-tolerant Adaptive Divisible Load Scheduling

Fault-Tolerant CORBA

FAULT TOLERANT CORBA

Scheduling Interactive Tasks in the Grid-based Systems

Fault Tolerant MPI

Fault Tolerant Configuration

Fault-Tolerant Rate-Monotonic Scheduling

FAULT-TOLERANT NETWORKS AND FAULT-TOLERANT ROUTING

fault-tolerant

Fine-Grained Soils:

DRAFTS Distributed Real-time Applications Fault Tolerant Scheduling

Fault-tolerant routing

Fault-Tolerant Consensus

Fine Grained Auditing