80 likes | 239 Views
Fault Tolerance. CSCI 4780/6780. Failures in Distributed Systems. Partial failures – characteristic of distributed systems Goals: Construct systems which can automatically recover from partial failures System should operate in an acceptable way even during failures.
E N D
Fault Tolerance CSCI 4780/6780
Failures in Distributed Systems • Partial failures – characteristic of distributed systems • Goals: • Construct systems which can automatically recover from partial failures • System should operate in an acceptable way even during failures
Basic of Dependable Systems • Availability – Property that the system is operating correctly at a given moment • Reliability – Property that a system can continuously run without failures • Safety – Failures should not lead to catastrophes • Maintainability – How easy is it to repair a failed system
Failures, Errors and Faults • Failure – A system not meeting its promises • Error – Part of system’s state that may lead to failure • Eg: Damaged packets • Fault – Cause of error • Bad transmission medium, bad disk, etc. • Types of faults • Transient – Occur once and disappear • Intermittent – Appear, vanish and reappear • Permanent – Continues until repair
Failure Models • Different types of failures.
Arbitrary Failures • Crash failures is a benign way of halting the service • Fail-stop failures – Halting can be detected by other processes • The halting server may announce its status • Fail-silent systems – Halting is not announced • Other processes need to detect the failure • Fail-safe – Server is producing random output • Other servers can detect the failure
Failure Masking by Redundancy • Hiding failures from other processes • Three types of redundancies • Information redundancy – Extra data is added to hide failure. • Eg. Hamming codes • Timing redundancy – Extra actions are performed for hiding failures • Redoing a transaction • Physical redundancy – Extra equipment (processes) for hiding failures • Extra disks, process pools etc.