160 likes | 176 Views
This paper delves into the core concepts of reliability, availability, and probability in fault-tolerant computing systems. It covers basic probability theory, exponential distribution modeling, reliability metrics, failure rates, and the relationship between MTTF and MTTR. The discussion also includes the Bathtub Curve phenomenon and various measures of system availability.
E N D
Fault-Tolerant Computing Systems#4Reliability and Availability Pattara Leelaprute Computer Engineering Department Kasetsart University pattara.l@ku.ac.th
Reliability and Availability • Reliability • The probability that a system survives till time t (it has not fail till t) • Availability • The probability that a system works properly at time t
Preliminaries of Probability • Discrete sample space: • Tossing a coin • {head, tail} sample space • Continuous sample space: • How long the pc stays up after reboot • {t | t>0} sample space • Random variable • A function mapping each element of sample space to a real number • Ex. heads=1, tails=0
Preliminaries • Random variable • A function mapping each element of sample space to a real number • CDF (Cumulative distributed function) • FX(t) = Pr [X≤t] Pr : probability that the system has gone down by time t • Pdf (Probability density function) • f(t) = dF(t) / dx • Expected Value, Mean • E[X] = 0t f(t)dt (X≥0) • Average outcome of the random experiment expect value, mean of a random variable
Exponential Distribution The most commonly used distribute function in reliability modeling. • CDF • F(t) = 1 – e-lt • pdf • f(t) = l e-lt • Mean • 1/l • Memoryless property • Y = X – t • Gt(y) = Pr [Y≤ y | X > t ] = 1 – e-ly • Distribute of remaining life of a component does not depend on how long it has been working. • The component does not AGE ! (remaining life of X does not depend on the time that has passed) f(t) = 2e-2t F(t) = 1 – e-2t
Reliability • Reliability • The probability that a system survives till time t • R(t) = Pr [X > t] = 1 – F(t) • X : Random probability variable X which represents a time to failure of the system (the life of the system) • R(t): represents probability that the system survives till time t F(t) = exponential Distribution F(t) = 1 – e-2t R(t) = e-2t t time 0 time t X
Reliability • Reliability • R(t) = Pr [X > t] = 1 – F(t) • R(0) = 1 The system is initially working • R() = 0 No system has infinite lifetime F(t) = exponential Distribution R(t) = reliability F(t) = 1 – e-2t R(t) = e-2t t time 0 time t X
= Failure Rate Probability that fault will occur in an interval time [t, t+Dt] • f(t)Dt • Probability that fault will occur in time [t, t+Dt] • f(t)Dt / R(t) • Probability of occurrence of fault at time [t, t+Dt], when the system is working properly at t • Failure Rate f(t) / R(t) f(t) = probability of fault F(t) = exponential Distribution R(t) = reliability f(t) = 2e-2t R(t) = e-2t F(t) = 1 – e-2t [t, t+Dt]
Bathtub Curve • Failure Rate • f(t) / R(t) • Bathtub Curve • General Failure Rate observed from the empirical data collected from mechanical and electronic component • When lifetime of a system F(t)is exponential distribution,it has a constant Failure Rate (see previous slide) 2.constant failure rate • 3.last stage: • faults caused by age • 1.Initial stage: • Inherit defects • faulty design
MTTF (Mean Time To Failure) • MTTF • E[X] = 0t f(t)dt = 0R(t)dt • X: theExpected valueof the probability variable which represents time till fault occurs in the system • When R(t) = e-lt (Xis exponential distribution) • Failure Rate = l • MTTF = 1 / l time 0 expected value
Availability • The probability that a system works properly at time t • Availability is a measure that is frequently used for describing the behavior of the system • *If the system has no repair or replacement, availability is equal to reliability R(t) • R(t): the probability that no failures have occurred during the whole period (0,t) fails repairs fails repairs Operational Under repair Operational t Xi Xi+1 Xi+2 Ui Ui+1
Availability • Instantaneous availability (ทันทีทันใด) • A(t) = Pr [probability that the component is functioning correctly att ] • Steady-State Availability (general meaning) • A = limt→∞ A(t) fails repairs fails repairs t Xi Xi+1 Xi+2 Ui Ui+1
Availability • When Xi, Ui is exponential distribution • FXi(t) = 1 – e-lt, FUi(t) = 1 – e-mt • Instantaneous Availability A(t) = (m +le-(l+m)t )/(m+ l) • Steady-State Availability A = limt→∞ A(t) = m/(m+ l) t Xi Xi+1 Xi+2 Ui Ui+1
MTTR (Mean Time To Repair) • MTTR (mean time to repair) • MTTR = E [ Ui ] Ui : the random variable that represents the downtime for i th repair or replacement E[Ui] : theExpected valueof Ui • MTTF (mean time to failure) • MTTF = E [ Xi ] Xi : the random variable that represents the duration of the i th function period. E[Xi] : theExpected valueof Xi • Steady-State Availability A = MTTF / (MTTF+MTTR) = m/(m+ l) (Xi,Ui is the exponential distribution of parameter l,m) t Xi Xi+1 Xi+2 Ui Ui+1