100 likes | 177 Views
Fault-tolerant Scheduling of Fine-grained Tasks in Grid Environments. Assumptions. Fail-stop model If a processor fails, it no longer transmits valid messages. Reliable communication Processor crashes are detected eventually by the communication layer. 2 is a descendant of 1. ( Master ).
E N D
Fault-tolerant Scheduling of Fine-grained Tasks in Grid Environments
Assumptions • Fail-stopmodel If a processor fails, it no longer transmits valid messages. • Reliablecommunication Processor crashes are detected eventually by the communication layer.
2 is a descendant of 1 (Master)
Map of Thief to Set of Tasks • Each victim has a table of stolen tasks. Map<Computer, Set<Task>> thiefTaskSet = new … • When a task is stolen, a copy is put in the Set<Task> associated with that thief (Computer).
Global Result Table (GRT) • Each compute server has a GRT replica: Map<taskParameters, Result> Entries are broadcast to all compute servers. • The Map key & value are potentially large. • It should be (more explanation later …) Map<TaskId, Computer> Where Computer is where the Result is stored.
Crash recovery method • If ( master crashed ) Elect a new master; • For all ( tasks stolen by a crashed processor ) Put task in task queue; • For all ( descendants of tasks stolen from a crashed processor ) If (descendant is finished) Then store it’s result in Global Result Table; Else abort the task; • If ( old master crashed && I am the master ) Restart the application;
Notes • A task is an orphan if its parent task is on a crashed server. • The authors: Our contribution: Some descendants of orphaned tasks are not recomputed. • Descendants of orphaned tasks are aborted, if they are incomplete at the time they become orphans. • They do not use explicit continuation passing: No composition tasks. Descendant decompositions that were complete must be recomputed!
Complete decomposition • tasks 4, 8, & 14 are lost. • In-progress task 21 is lost. • Decompositions • 2, 5, 10, 16 are lost
Notes • Their GRT key is task parameters. • The hash code is sum the hash of the parameters If the parameter is an array, they sum the hash of each element! • It should be TaskId, but they do not have a processor-independent TaskId. This is claimed as future work. • They claim: only 1 in 1000-10,000 tasks is stolen, which is key to the efficiency of their scheme. Their tests crash whole clusters, rather than individual compute servers within a cluster. Why?