380 likes | 765 Views
Learning Objectives Why computer fails How to design computers that are fault tolerant How to evaluate reliability Error correcting code Diagnosis. 3. Why FT?. What is Fault-Tolerance?A
E N D
1. Fault Tolerant Computer Design (COMS30125)
3. 3 Why FT? What is Fault-Tolerance?
A “fault-tolerant system” is one that continues to perform at desired level of service in spite of failures in some components that constitute the system. Be sure everyone has a conduct sheet.
Re GROUND RULES:
1. In terms of allowed collaboration vs. individual work, ask if you are not sure.
2. Deactivate all cell phones or pagers during class unless you are on-call during your job.
3. No tape-recording permitted. Take notes.Be sure everyone has a conduct sheet.
Re GROUND RULES:
1. In terms of allowed collaboration vs. individual work, ask if you are not sure.
2. Deactivate all cell phones or pagers during class unless you are on-call during your job.
3. No tape-recording permitted. Take notes.
4. 4 Why FT? Key attributes
Fault - Error - Failure
Performance - Availability - Reliability
More recently concept of “survivability”
Inclusions of these constraints at design stage is likely to be more cost effective. Be sure everyone has a conduct sheet.
Re GROUND RULES:
1. In terms of allowed collaboration vs. individual work, ask if you are not sure.
2. Deactivate all cell phones or pagers during class unless you are on-call during your job.
3. No tape-recording permitted. Take notes.Be sure everyone has a conduct sheet.
Re GROUND RULES:
1. In terms of allowed collaboration vs. individual work, ask if you are not sure.
2. Deactivate all cell phones or pagers during class unless you are on-call during your job.
3. No tape-recording permitted. Take notes.
5. 5 Why FT? Who is concerned about fault-tolerance?
System Users – irrespective of the application but some are a lot more concerned than others
Who is concerned at design stages?
Universities
R, d, and a (Research, development, applications)
Industry
r, D, and A (research, Development, Applications)
Issues
Design, Analysis/Validation, Implementation, Testing/Validation, Evaluation Be sure everyone has a conduct sheet.
Re GROUND RULES:
1. In terms of allowed collaboration vs. individual work, ask if you are not sure.
2. Deactivate all cell phones or pagers during class unless you are on-call during your job.
3. No tape-recording permitted. Take notes.Be sure everyone has a conduct sheet.
Re GROUND RULES:
1. In terms of allowed collaboration vs. individual work, ask if you are not sure.
2. Deactivate all cell phones or pagers during class unless you are on-call during your job.
3. No tape-recording permitted. Take notes.
6. 6 Why FT? Examples
General Purpose Systems
PCs: RAMs with parity checks and possibly ECC
(consideration of re-execution on failure detection is being investigated)
Workstations/Servers: error detection (HW), occasional corrective action (SW), Even ECC (HW), keeping log (SW) Be sure everyone has a conduct sheet.
Re GROUND RULES:
1. In terms of allowed collaboration vs. individual work, ask if you are not sure.
2. Deactivate all cell phones or pagers during class unless you are on-call during your job.
3. No tape-recording permitted. Take notes.Be sure everyone has a conduct sheet.
Re GROUND RULES:
1. In terms of allowed collaboration vs. individual work, ask if you are not sure.
2. Deactivate all cell phones or pagers during class unless you are on-call during your job.
3. No tape-recording permitted. Take notes.
7. 7 Why FT? Examples
Reliable Systems
Telephone systems
Banking systems e.g. ATM
Stock market
CAE - exams/projects
Football games - display/ticketing Be sure everyone has a conduct sheet.
Re GROUND RULES:
1. In terms of allowed collaboration vs. individual work, ask if you are not sure.
2. Deactivate all cell phones or pagers during class unless you are on-call during your job.
3. No tape-recording permitted. Take notes.Be sure everyone has a conduct sheet.
Re GROUND RULES:
1. In terms of allowed collaboration vs. individual work, ask if you are not sure.
2. Deactivate all cell phones or pagers during class unless you are on-call during your job.
3. No tape-recording permitted. Take notes.
8. 8 Why FT? Examples
Critical and Life Critical Systems
Manned and unmanned space borne systems
Aircraft control systems
Nuclear reactor control systems
Life support systems
Be sure everyone has a conduct sheet.
Re GROUND RULES:
1. In terms of allowed collaboration vs. individual work, ask if you are not sure.
2. Deactivate all cell phones or pagers during class unless you are on-call during your job.
3. No tape-recording permitted. Take notes.Be sure everyone has a conduct sheet.
Re GROUND RULES:
1. In terms of allowed collaboration vs. individual work, ask if you are not sure.
2. Deactivate all cell phones or pagers during class unless you are on-call during your job.
3. No tape-recording permitted. Take notes.
9. 9 Why FT?
Examples
Reliable -> Critical Systems
911 telephone switching system
Traffic light control system
Automobile control system (ABS, Fuel injection system)
Be sure everyone has a conduct sheet.
Re GROUND RULES:
1. In terms of allowed collaboration vs. individual work, ask if you are not sure.
2. Deactivate all cell phones or pagers during class unless you are on-call during your job.
3. No tape-recording permitted. Take notes.Be sure everyone has a conduct sheet.
Re GROUND RULES:
1. In terms of allowed collaboration vs. individual work, ask if you are not sure.
2. Deactivate all cell phones or pagers during class unless you are on-call during your job.
3. No tape-recording permitted. Take notes.
10. 10 Introduction Historical perspective and major push
New initiatives
Goals of fault-tolerance
Applications of fault-tolerance Do not discuss much about topics here.
Under computer system
overall implies what is a compute system - its architecture and
components
Then focus on hardware and software components Do not discuss much about topics here.
Under computer system
overall implies what is a compute system - its architecture and
components
Then focus on hardware and software components
11. 11 Introduction (contd.) Historical Perspective
not a new concept
first use by J. van Neumann 1956
probabilistic logic and synthesis of reliable organism from unreliable components, Annals of mathematical studies, Princeton University Press
Major push
Space program
HW Fault tolerance - then
SW Fault tolerance later
Merge the two Be sure everyone has a conduct sheet.
Re GROUND RULES:
1. In terms of allowed collaboration vs. individual work, ask if you are not sure.
2. Deactivate all cell phones or pagers during class unless you are on-call during your job.
3. No tape-recording permitted. Take notes.Be sure everyone has a conduct sheet.
Re GROUND RULES:
1. In terms of allowed collaboration vs. individual work, ask if you are not sure.
2. Deactivate all cell phones or pagers during class unless you are on-call during your job.
3. No tape-recording permitted. Take notes.
12. 12 Introduction (contd.) New initiatives
Density of devices more failures likely
Deep submicron technology and time to market pressure
designs not fully verified
Implementation of numerous functionalities on chip/board/system possibility of system hang-up
Speculative execution results may need to be re-checked
Low cost of HW and SW affordable/economical
Hot issues: Soft errors, Life-time failures
Be sure everyone has a conduct sheet.
Re GROUND RULES:
1. In terms of allowed collaboration vs. individual work, ask if you are not sure.
2. Deactivate all cell phones or pagers during class unless you are on-call during your job.
3. No tape-recording permitted. Take notes.Be sure everyone has a conduct sheet.
Re GROUND RULES:
1. In terms of allowed collaboration vs. individual work, ask if you are not sure.
2. Deactivate all cell phones or pagers during class unless you are on-call during your job.
3. No tape-recording permitted. Take notes.
13. 13 Introduction (contd.) Goals - different goals for different applications
The key word is “reliability” – has different meaning for different users and applications
Intuitive explanations
Dependability
Service
Specification Be sure everyone has a conduct sheet.
Re GROUND RULES:
1. In terms of allowed collaboration vs. individual work, ask if you are not sure.
2. Deactivate all cell phones or pagers during class unless you are on-call during your job.
3. No tape-recording permitted. Take notes.Be sure everyone has a conduct sheet.
Re GROUND RULES:
1. In terms of allowed collaboration vs. individual work, ask if you are not sure.
2. Deactivate all cell phones or pagers during class unless you are on-call during your job.
3. No tape-recording permitted. Take notes.
14. 14 Introduction (contd.) Intuitive concepts
Reliability – continues to work
Availability – works when I need it
Safety – does not put me in jeopardy
Performability
Maintainability
Testability
Survivability – will the system survive catastrophic events? Be sure everyone has a conduct sheet.
Re GROUND RULES:
1. In terms of allowed collaboration vs. individual work, ask if you are not sure.
2. Deactivate all cell phones or pagers during class unless you are on-call during your job.
3. No tape-recording permitted. Take notes.Be sure everyone has a conduct sheet.
Re GROUND RULES:
1. In terms of allowed collaboration vs. individual work, ask if you are not sure.
2. Deactivate all cell phones or pagers during class unless you are on-call during your job.
3. No tape-recording permitted. Take notes.
15. 15 Introduction (contd.) Applications
Space borne system
long life system
Airplane control system
critical system
Transaction processing system
high availability system
Switching system
high availability over certain level of performance Be sure everyone has a conduct sheet.
Re GROUND RULES:
1. In terms of allowed collaboration vs. individual work, ask if you are not sure.
2. Deactivate all cell phones or pagers during class unless you are on-call during your job.
3. No tape-recording permitted. Take notes.Be sure everyone has a conduct sheet.
Re GROUND RULES:
1. In terms of allowed collaboration vs. individual work, ask if you are not sure.
2. Deactivate all cell phones or pagers during class unless you are on-call during your job.
3. No tape-recording permitted. Take notes.
16. 16 Terminology and definitions Reliability and concept of probability
R(t): conditional probability that a system provides continuous proper service in the interval [0,t] given that it provided desired service at time 0.
Availability
Performabiltiy
An Example
Dependability Be sure everyone has a conduct sheet.
Re GROUND RULES:
1. In terms of allowed collaboration vs. individual work, ask if you are not sure.
2. Deactivate all cell phones or pagers during class unless you are on-call during your job.
3. No tape-recording permitted. Take notes.Be sure everyone has a conduct sheet.
Re GROUND RULES:
1. In terms of allowed collaboration vs. individual work, ask if you are not sure.
2. Deactivate all cell phones or pagers during class unless you are on-call during your job.
3. No tape-recording permitted. Take notes.
17. 17
18. 18 Fault-Error-Failure concept Intuitive definitions
Origins of faults
Methods to break FEF chain
Attribute of faults
Do not discuss much about topics here.
Under computer system
overall implies what is a compute system - its architecture and
components
Then focus on hardware and software components Do not discuss much about topics here.
Under computer system
overall implies what is a compute system - its architecture and
components
Then focus on hardware and software components
19. 19 Fault-Error-Failure concept (contd.) Intuitive definitions
Fault -
An anomalous physical condition caused by a manufacturing problem, fatigue, external disturbance (intentional or un-intentional), desgin flaw, …
Causes
Error - Effect of activation of a fault
Failure - over-all system effect of an error
Fault -> Error -> Failure Be sure everyone has a conduct sheet.
Re GROUND RULES:
1. In terms of allowed collaboration vs. individual work, ask if you are not sure.
2. Deactivate all cell phones or pagers during class unless you are on-call during your job.
3. No tape-recording permitted. Take notes.Be sure everyone has a conduct sheet.
Re GROUND RULES:
1. In terms of allowed collaboration vs. individual work, ask if you are not sure.
2. Deactivate all cell phones or pagers during class unless you are on-call during your job.
3. No tape-recording permitted. Take notes.
20. 20 Fault-Error-Failure concept (contd.) Origins of faults
Physical device level (HW)
Logic level (HW)
Chip level (HW)
System level (HW/SW)
interfacing, specifications, …
Why systems fail Be sure everyone has a conduct sheet.
Re GROUND RULES:
1. In terms of allowed collaboration vs. individual work, ask if you are not sure.
2. Deactivate all cell phones or pagers during class unless you are on-call during your job.
3. No tape-recording permitted. Take notes.Be sure everyone has a conduct sheet.
Re GROUND RULES:
1. In terms of allowed collaboration vs. individual work, ask if you are not sure.
2. Deactivate all cell phones or pagers during class unless you are on-call during your job.
3. No tape-recording permitted. Take notes.
21. 21 Fault-Error-Failure concept (contd.) Methods to break FEF chain
Flow FEF
Barriers
Fault avoidance
Fault masking
Fault removal
Fault forecasting Be sure everyone has a conduct sheet.
Re GROUND RULES:
1. In terms of allowed collaboration vs. individual work, ask if you are not sure.
2. Deactivate all cell phones or pagers during class unless you are on-call during your job.
3. No tape-recording permitted. Take notes.Be sure everyone has a conduct sheet.
Re GROUND RULES:
1. In terms of allowed collaboration vs. individual work, ask if you are not sure.
2. Deactivate all cell phones or pagers during class unless you are on-call during your job.
3. No tape-recording permitted. Take notes.
22. 22 Fault-Error-Failure concept (contd.) Attribute of faults
Cause
Nature
Duration
Extent
Value Be sure everyone has a conduct sheet.
Re GROUND RULES:
1. In terms of allowed collaboration vs. individual work, ask if you are not sure.
2. Deactivate all cell phones or pagers during class unless you are on-call during your job.
3. No tape-recording permitted. Take notes.Be sure everyone has a conduct sheet.
Re GROUND RULES:
1. In terms of allowed collaboration vs. individual work, ask if you are not sure.
2. Deactivate all cell phones or pagers during class unless you are on-call during your job.
3. No tape-recording permitted. Take notes.
24. 24 The development of Systematic Approach to Reliability and Fault Tolerance (1950s)
Theoretical work on redundancy and coding
Moore, Shannon, Hamming and Von Neumann
(1960s)
Fault Tolerance systematically built into systems
Bell ESS
IBM 360
Space system (SATURN IV)
(1970s)
Reliability of Primary concern in commercial designs
Tandem Nonstop
25. 25 Revival of interest
Vacuum tubes?Transistors ?VLSI
Computers have become increasingly complex
Billions of transistors switching billions of times per second
Wear out
5-year repair interval for TV sets can be reduced to a few months for computers
26. 26
System reliability can be enhanced in two ways:
Fault Avoidance: Implementing the system from ultra reliable components that are extremely unlikely to fail.
Fault Tolerance (FT): designing the systems such that it continues to operate correctly, with error free execution of programs, even in the presence of certain specified faults
Fault tolerance is achieved by using protective Redundancy
Hardware redundancy
Software redundancy
Time redundancy
27. 27 Classes of Fault Tolerant Systems Ultra reliable systems
Employed in critical real time control applications
System Reliability:
The probability that the system will operate correctly over the desired mission time.
Example:
Avionic computers for unstable aircraft (NASA)
Failure probability constrained to be less than10-9 for a 10 hour mission
Fault Tolerance:
Maximum number of failures that may occur anywhere in the system without causing system failure.
28. 28 Long Life Systems Application where maintenance and/or repair is impossible
Unmanned Spacecraft
Mean time to Failure(MTTF): The expected (average) time to system failure.
Example:20 years MTTF for a communication satellite
Maximum mission time: The maximum time of operation for some specified minimum reliability.
Example: Reliability if 0.90 for a 10 year mission on an outer planet exploratory vehicle
29. 29 Safety Critical High Reliability Systems Safety Critical High Reliability Systems
30. 30 Highly Available systems Application where downtime is expensive
Telephone switching computer
Expensive high performance systems
Mean time to repair(MTTR): The average time before the system is repaired following a failure.
Mean time between Failures:(MTBF)=MTTF + MTTR
Maintainability: The probability that a system will be operating correctly at any given time during its operation schedule
Availability: The probability that a system will be operating correctly at any given time during its operation schedule.
31. 31 MTTF
Availability = ------------------
MTTF + MTTR
MTBF – MTTR
= ---------------------
MTBF
Examples: Cray-1(1975)
MTTF = 4 hours
MTTR = 0.1 hours
4
Availability = -------- = 0.98
4.1
BELL ESS
Goal: 20 minutes of downtime in 40 years
32. 32 Availability:
33. Cost of ownership as a function of reliability and maintainability features
34. 34 Fail-Fast is Good, Repair is Needed
Improving either MTTR or MTTF gives benefit
Simple redundancy does not help much.
35. 35
36. 36 The Last 5 Years: Availability Dark Ages Ready for a Renaissance? Things got better, then things got a lot worse!
37. 37 A Schematic of HotMail ~7,000 servers
100 backend stores with 120TB
3 data centers
Links to
Passport
Ad-rotator
Internet Mail gateways
…
~ 1B messages per day
150M mailboxes, 100M active
~400,000 new per day.
38. 38 Availability