COP 5611 Operating Systems Spring 2010

COP 5611 Operating Systems Spring 2010 Dan C. Marinescu Office: HEC 439 B Office hours: M-Wd 1:00-2:00 PM

Lecture 13 Reading Assignment: Chapter 8 from the online textbook Homework 3 due on March 3 Midterm: Wednesday March 17, the first week after Spring Break Last time: End-to-end-layer Resource Management - Congestion Today: Faults, Failures and Fault-Tolerant Design Measures of Reliability and Failure Tolerance Tolerating active Faults Next time 2 2 2 2 2

Reliable Systems from Unreliable Components • Problem investigated first in mid 1940s by John von Neumann. • Steps to build reliable systems • Error detection • Network protocols (link and end-to-end) • Error containment – limit the effect of errors • Enforced modularity: client-server architectures, virtual memory, etc. • Error masking – ensure correct operation in the presence of errors • Network protocols: error correction, repetition, interpolation for data cu real-time constrains

Faults and errors • Fault a flaw with the potential to cause problems • Software • Hardware • Design • Implementation • Operation • Environment • Types of faults • Latent • Active • Error  the consequence of an active fault.

Error containment in a layered system • Several design strategies are possible. The layer where an error occurs: • Masks the error  correct it internally so that the higher layer is not aware of it. • Detects the error and report its to the higher layer  fail-fast. • Stops  fail-stop. • Does nothing. • Types of faults • Transient (caused by passing external condition)/Persistent • Soft /Hard  Can be masked or not by a retry. • Intermittent  occurs only occasionally and it is not reproducible • Latency of a fault – time until a fault causes an error • A long latency may allow errors to accumulate and defeat periodic error correction

The fault-tolerance design process is iterative • Begin the design of a fault-tolerant model • Identify potential faults • Estimate the risk of each one • Design methods to detect the errors for the highest risk faults. • Design methods to deal with the errors for the highest risk faults • Contain the damage from high risk errors through modularity. • Design procedures to contain the errors detected by: • Temporal redundancy (retry the operation) • Spatial redundancy (deploy multiple components) • Update the model to account for the error masking procedures • Iterate until the probability of un-tolerated faults is small • Observe the system in the real world • Study the error logs • Identify the cause of each error • Use the information collected to improve the model and iterate again

Measures of reliability • TTF – time to failure • MTTF – mean time to failure MTTF = 1/N ∑ TTFi • TTR – time to repair • MTTR – mean time to repair MTTR = 1/N ∑ TTRi • MTBF – mean time between failures MTBF =MTTF + MTTR • Availability =MTTF/MTBF • Down time = ( 1- Availability) = MTTR/MTBF

The conditional failure rate

Reliability functions • Unconditional failure rate f(t) = Pr(module fails between t and t = dt) • Reliability R(t) = Pr(module functions at time t given that it was functioning at time 0). This function is memoryless

COP 5611 Operating Systems Spring 2010