CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12)

CSE 6510 (461)Fall 2010 Selected Noteson Fault-Tolerance (12) Alexander A. ShvartsmanComputer Science and EngineeringUniversity of Connecticut

Fault-Tolerance -- An Overview • A fundamental property of distributed systems: • potential for fault tolerance • The main tool in achieving fault tolerance is • redundancy • Distributed systems consist of multiple components: • When more than one resource is capable of performing a certain function, some fault tolerance is achievable • Goal • Take advantage of the multiplicity of resources in constructing systems that tolerate failures

Fault Tolerance and Dependability • A system specification may call for fault-tolerance • By stating that the system must perform correctly • Even if certain internal or external components fail to perform according to their specifications • Additionally, the degradation in in performance due to failures must be “graceful” • Dependability: is a closely-related notion • Trustworthiness of a computer system, i.e., • Reliance can justifiably be placed on system’s service • Dependability is achieved in part through fault-tolerance

Faults, Errors and Failures • We distinguish among faults, errors and failures: • Fault: (or defect) a component or a subsystem fail to perform according to their specification • Error: a computation enters an incorrect state as the result of a fault • Failure: a systems fails to meet its specification as the result of an error • Faults may or may not lead to an error • Errors may or may not lead to a failure

Fault-Tolerance -- Basic Approaches • Fault prevention: • eliminating faults • before the system put into use or • during periodic preventive maintenance • Fault tolerance: • a system detects errors caused by faults, • corrects its state and • does not fail for as long as the faults and errors are within its design parameters • Fault masking: • a fault-tolerant system is capable of dealing with faults and errors • in a way that is transparent to the users of the system’s services

Crash Omission Timing Byzantine Fault Classification • Crash fault • Fail-stop processor (detectable crash) • Failure after a send/receive • Omission fault • Communication, send orreceive omission • Operation • Timing fault • Processor delays • Link time-out • Byzantine fault • Arbitrary fault • Malicious behavior Increased Severity

Undetectable restarts Detectable restarts Synchronous restarts No restarts Initial faults Models of Processor Failures and Restarts • Fail-stop processors • Model assumptions, e.g., • Shared memory • Robust interconnect • Resilient memory • Timing guarantees

Fault Tolerance, Redundancy and Efficiency • Fault tolerance is achieved through redundancy • Redundancy in components/resources -- space redundancy: • additional components (hardware or software) are provided or made available to deal with errors • distributed systems have inherently redundancy • Redundancy in computation or time redundancy: • additional computation is performed to detect errors or to test components • here the cost is performance

Combining Fault-Tolerance and Efficiency • The fundamental conflict exists between efficiency and faulttolerance: • Efficiency implies low redundancy • Fault tolerance implies high redundancy • Robustness • Property of a system that combines • Efficiency and • Fault-tolerance, e.g., correctness under failures • Achieving robustness is very challenging in many cases • Efficiency often must be traded-off for fault tolerance

Strategies for Fault Tolerance • Layered architecture : • a structuring technique in achieving fault tolerance • A failure of a lower level component may/will manifest itself as a fault to a higher layer • Error at a lower layer may be contained or masked • When this is not possible, the layer attempts • to reduce the severity of the error and • to manifest itself through a more benign failure

Layer Architecture for Fault-Tolerance failure Layer N+1 error Layer N fault fault failure failure error error Layer N-1 fault fault

Phases in Fault Tolerance • Fault prevention and fault tolerance are complementary: • both are needed for dependability • Fault tolerance and its “phases” • Error detection • Tests, checks and diagnostics • Damage confinement • Dynamic assessment of damage boundaries • Static firewalls • Progress evaluation and error recovery • Backward recovery, checkpointing, roll back • Forward recovery and self-stabilization • Processor scheduling and load balancing • Fault treatment and continued system service • Fault location • System repair • Dynamic reconfiguration • Standby spare components

Faults: Causes and Temporal Effects • Faulty system -- a system with defects • Faulty requirements • Design faults • Hardware faults • Software . . . bugs (I don’t know who put it there) • Operational faults • Faults -- temporal taxonomy • Transient fault -- limited duration • Intermittent fault -- occur repeatedly • Permanent fault -- manifests itself until fixed • Faults and fault masking • Is fault masking “good”? • If a system is capable of tolerating k faults, is masking 1 fault good? Masking k-1 faults? • Are faults “bad”? • Is a system containing faults necessarily defective?

Models of Failure: Overall Considerations • Models need to capture/abstract/approximate reality • Type of failures -- • severity: fail-stop, malicious failures, memory contamination • Kind of failure-causing adversary -- • omniscient or oblivious; on-line adaptive or off-line. • Duration: • no-restart <-> restartable • Frequency of failures -- • rate of processor attrition (one time, arbitrary, probabilistic) • Fine/coarse granularity of failures -- • components: processors / gates, processor / thread failures • Magnitude of failures -- • total number of failures (and recoveries) during computation

Designing for F/T: Evaluation Criteria • What is the cost of failure? Is it bearable? • How much is one willing to pay for fault tolerance? • Is slower response preferable to a failure? • Is higher HW cost acceptable? • Is lower HW cost acceptable as long as failures are masked? • What is the goal of building-in some fault tolerance? • Elimination of (some failure)? • Reduction in the severity of failures? • Error detection? • When the failures are corrected, • Is a slower response time acceptable as long as the computation is correct? • Is a slight error acceptable as long as the computation completes within the required time?

CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12)

CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12)

Presentation Transcript

CSE 461: Link State Routing

CSE 461

CSE 461

CSE 461 HTTP and the Web

CSE/EE 461 Sliding Windows and ARQ

CSE 461: TCP (part 3)

CSE 461: Multiple Access Networks

CSE 461: IP Addressing and Forwarding

CSE 461: Distance Vector Routing

CSE 461: Interdomain Routing

CSE 461: Introduction

Psych 5510/6510

CSE 461: Transport Layer Connections

70-461 Real Exam Questions with 70-461 Dumps

CSE 461: Multiple Access

CSE/EE 461 Sliding Windows and ARQ

CSE 461: Bits and Bandwidth

CSE 461: Link State Routing