CprE 545: FAULT-TOLERANT SYSTEMS

CprE 545: FAULT-TOLERANT SYSTEMS Lecture 1: What is fault tolerance all about?

Course information • Webpage: http://www.ee.iastate.edu/~gmani/cpre545 • Book • Fault Tolerance in Distributed Systems, Pankaj Jalote, Prentice Hall • Grading • Homework: 25% • Mid-term: 25% • Term-project: 25% • Final exam: 25% • Project • Study/Implementation design • Written report of about 25 pages • Need creative component • No cheating allowed.

Course outline • Dependability concepts • Dependable system, techniques for achieving dependability, dependability measures, fault, error, failure, and classification of faults and failures. • Fault-tolerant strategies • Fault detection, masking, containment, location, reconfiguration, and recovery. • Fault tolerant design techniques • Hardware redundancy, software redundancy, time redundancy, and information redundancy. • Fault tolerance in real-time systems • Time-space tradeoff, imprecise computation, (m,k)-firm deadline model, fault tolerant scheduling algorithms.

Course outline (contd..) • Dependable communication • Dependable channels, survivable networks, fault-tolerant routing. • Fault tolerance in distributed systems • Byzantine General problem, consensus protocols, checkpointing and recovery, stable storage and RAID architectures, and data replication and resiliency. • Fault-tolerant interconnection networks • Hypercube, star graphs, and fault tolerant ATM switches. • Dependability evaluation techniques and tools • Fault trees, Markov chains; HIMAP tool. • Reading of some of the state-of-the-art research material.

Motivation • Systems are implemented using COTS parts • Components may fail due to various reasons • Hostile Environment • Operating conditions out of specification range • Aging • Poor design • Being able to tolerate an individual failure may save the day

Why fault-tolerance? • 10,000 units of a component are used in a system • Failure rate of components: 0.5%/1000 hours • Total Failure rate: (0.5*10000)/(100*1000) = 0.05/hour • Approximate Unreliability = lt • Desired reliability = 0.99 • Time duration, t = (1-0.99)/l = 0.01/0.05 = 1/5 hours • System goes below desired level after 12 minutes

Fault-tolerant computing concepts • Fault-tolerant computing • Correct execution of specified algorithm in the presence of defects • Fault tolerance is achieved using redundancy • Physical (space) or temporal (time) • Two important criteria • Availability • Reliability • Three dimensional reliability framework • Physical  circuit/RTL/Logic switch/Processor/Memory/Links • Time  specification/design/prototyping/manufacturing/ installation/ operation • Cost  Acquisition/Repair

System and fault tolerance • Function: What the system in intended for • Behavior: what it does • Structure: What makes it do what it does • System may be layered • In layered system, each layer behaves as a component at the next layer • Function and service may be plural System User Component

Ownership cost and fault tolerance • Two Systems • To be used for 4 years • System A • Acquiring: $2000 • Maintenance: @250/year • Total cost: $3000 • System B • Acquiring: $1000 • Maintenance: @500/year • Total cost: $3000 • But down time frustration can be avoided Total Cost Cost Cost of Acquiring Cost of Maintenance Reliability

Some definitions • Fault • Physical Change  Physical World • Error • Result of a Fault  Information World • Failure • Deviation from intended function  External Effect • Three World Model Physical Informational External

Why fault tolerance is more important? • System speed is higher • More reason for Timing Faults • Harsher environments • Systems are employed in all kind of applications • Novice users • Inadvertent user abuse • Higher cost for repair • Manpower and down times are expensive • Larger systems • Use more components, more chances of failure

Issues • At what level to introduce redundancy? • Duplicate or triplicate? • How to manage redundancy? • Automatic or user assisted fault tolerance? • Fault tolerance or reliability and relationship? • Information redundancy? • How to evaluate?

CprE 545: FAULT-TOLERANT SYSTEMS