380 likes | 523 Views
Taxonomy and Trends. Dan Siewiorek Carnegie Mellon University June 2012. Outline. Taxonomy and Trends General Purpose Examples High Availability Examples A Methodology Conclusion. Application Taxonomy. General purpose Wide range of applications; frequently high performance
E N D
Taxonomy and Trends Dan Siewiorek Carnegie Mellon University June 2012
Outline • Taxonomy and Trends • General Purpose Examples • High Availability Examples • A Methodology • Conclusion
Application Taxonomy • General purpose • Wide range of applications; frequently high performance • High availability • Occasional loss of single user but not system; rapid restart • Long life • No human maintenance; automatically detect and reconfigure; high coverage • Critical computations • Usually real-time control systems; low recovery time; high coverage
Error Detection Techniques in Typical General-Purpose System • Memory • Double-error-detection code on memory data • Parity on address and control information • Cache • Parity on data, address, control information • I/O Unit • Parity on data and control • CPU • Parity on data paths • Parity on control store • Duplication and comparison of control logic
Error Recovery Techniques in Typical General-Purpose System • Memory • Single-error-detection code on data • Retry on address or control information parity error • Cache • Retry on data, address, control information parity error • I/O Unit • Retry on data or control parity errors • CPU • Retry on control store parity error • Invert sense of control store • Macroinstruction retry
IBM 3090 Series Fault-Tolerance Features • Reliability • Low intrinsic failure rate technology • Extensive component burn-in during manufacture • Dual processor controller that incorporates switchover • Dual 3370 Direct Access Storage units support switchover • Multiple consoles for monitoring processor activity and for backup • LSI packaging vastly reduces number of circuit connections • Internal machine power and temperature monitoring • Chip sparing in memory replaces defective chips automatically
IBM 3090 Series Fault-Tolerance Features • Availability • Two or four central processors • Automatic error detection and correction in central and expanded storage • Single bit error correction and double bit error detection in central storage • Double bit error correction and triple bit error detection in expanded storage • Storage deallocation in 4K-byte increments under system program control • Ability to vary channels off line in one channel increments • Instruction retry • Channel command retry • Error detection and fault isolation circuits provide improved recovery and serviceability • Multipath I/O controllers and units
IBM 3090 Series Fault-Tolerance Features • Data integrity • Key controlled storage protection (store and fetch) • Critical address storage protection • Storage error checking and correction • Processor cache error handling • Parity and other internal error checking • Segment protection (S/370 mode) • Page protection (S/370 mode) • Clear reset of registers and main storage • Automatic Remote Support authorization • Block multiplexer channel command retry • Extensive I/O recovery by hardware and control programs
IBM 3090 Series Fault-Tolerance Features • Serviceability • Automatic fault isolation (analysis routines) concurrent with operation • Automatic remote support capability – auto call to IBM if authorized by the customer • Automatic customer engineer and parts dispatching • Trade facilities • Error logout recording • Microcode update distribution via remote support facilities • Remote service console capability • Automatic validation tests after repair • Customer problem analysis facilities
ED/FI in IBM 308X / 3090 • Hundreds of thousands of isolation domains • Parity checks account for 70-80% of checkers – data, address, and shift/increment parity predictors • Decoder/encoder checkers • 25% of IBM 3090 circuits for RAS • Can instantaneously detect 90% of all errors • 25% of faults assumed solid for the technology • If less that two weeks between events, the cause is assumed to be the same intermittent • Call service if 24 errors in 2 hours
Tandem Design Objectives • “Nonstop” operation where failures detected, components configured out of service, repaired components configured back in without stopping other system components • No single hardware failure can compromise data integrity of the system • Modular system expansion through adding more processing power, memory, and peripherals without impacting application software
Fault Containment • Software processes do not share state – only message passing • Hardware – no shared memory, dual porting I/O, multiple power supply
Fast-Fail Modules (detection) • Software – consistency checks, defensive programming • Hardware – software generated status probes, hardware self-tests
Software Bugs • Backup process does not encounter same state and environment, code takes a different path
Software • Process pairs • Transaction processing – two phase commit protocol • Log write-ahead protocol – record before and after-image of database in an audit trail • Network systems management – programmed operators help reduce administrative errors • Tandem maintenance and diagnostic system – analyze event loss to successfully call out FRU 90% of time
Error Handling • Error detection logic records error • Operating system runs diagnostics • Incident of failure algorithm • If transient return board to service • If permanent call Customer Assistant Center – CAC • CAC determines problem • Selects board of same revision level • Print installation instructions • Ship via overnight courier • 22 field engineers support 400 systems • Service 6% / year of LCC vs. 9% for others
A Methodology • Define objectives • Limit the scope • Define confinement regions • Design error handling mechanisms • Design error reporting mechanisms • Testing of error handling/reporting mechanisms • Evaluate design
Exercising Latent Faults *Special commands to support exercising dormant areas are provided in BIUs and MCUs
Conclusion • Designing from first principles to produce an architecture to tolerate failures achieves better reliability, availability, and cost-effectiveness than an ad-hoc, add-on approach • It is possible to build systems in which the activities of fault detection, diagnosis, and recovery are completely automated and transparent to the user