1 / 23

Principles of Engineering System Design

Learn about the principles of engineering system design and how fault tolerance can be implemented in the development of physical architectures, using the case study of the Iowa United 232 aircraft crash in 1989. Explore error detection, damage confinement, error recovery, fault isolation and reporting, and the use of redundancy for achieving fault tolerance.

hgoetsch
Download Presentation

Principles of Engineering System Design

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Principles of Engineering System Design Dr T Asokan asok@iitm.ac.in

  2. Implementing Fault Tolerance in Physical Architecture Development

  3. Case Study: Aircraft crash- Iowa

  4. United 232: 3-engine aircraft crashed on 19/7/1989 while making an emergency landing after losing one of the three engines. 110 people died, 185 survived. • Three redundant hydraulic systems, each powered by a unique engine, were available for aircraft stabilisation. • The three hydraulic system converged at the location near the tail where the fan disk ripped out, the single point of failure for all the hydraulic systems.

  5. Failure:Deviation in behavior between the system and its requirements Error :A subset of the system state, which may lead to system failure.Fault:a defect in the system that can cause an error. Error detection Functions Fault tolerance is the ability of a system to tolerate faults and continue performing. • Fault tolerance can be achieved only for those errors that are observed.Functions associated with fault tolerance are:Error detectionDamage confinementError recoveryFault isolation and reporting

  6. Error detection is defining possible errors, deviations in the subset of the system’s state from the desired state, in the design phase before they occur, and establishing a set of functions for checking for the occurrence of each error. • Type checks, range checks, timing checks • Damage confinement is protecting the system from the possible spread of failure to other parts of the system. • Firewalls

  7. Error recovery attempts to correct the error after the error has been detected and the errors extent defined. • Backward recovery, forward recovery • Fault isolation and reportingattempts to determine where in the system the fault occurred that generated the error.

  8. Redundancy to Achieve Fault Tolerance A primary source of high availability and fault tolerance is redundancy: Hardware, software, information, and time. Hardware redundancy uses extra hardware to enable the detection of errors as well as to provide additional operational hardware components after errors have occurred. Hardware redundancy can be implemented in Passive, Active, and Hybrid forms

  9. Passive hardware redundancy masks or hides the occurrence of errors rather than detecting them. Recovery is achieved by having extra hardware available when needed. The most common implementation is Triple Modular Redundancy (TMR). Relies on majority voting scheme to mask error in one of the three hardware units.

  10. Triplicated TMR

  11. Software implementation of voting for TTMR

  12. Active hardware redundancy • Active hardware redundancy attempts to do all the four functions i.e. detect errors, confine damage, recover from errors, and isolate and report fault. • Hardware duplication with comparison • Hot standby sparing • Cold standby sparing • Pair-and-a-spare

  13. Hardware duplication with comparison Hardware duplication with comparison is the basic building block for active redundancy

  14. Hot standby sparing and Cold standby sparing Most common approaches to hardware redundancy

  15. 1

  16. Pair and a spare active redundancy

  17. Hybrid Hardware Redundancy • Combination of N-modular redundancy with spares or TMR with duplication with comparison. • Critical computation systems normally use Active or Hybrid redundancy. • Active redundancy reduces the life of the system • Hybrid redundancy is the costliest

  18. Software redundancy • N-versions, • capability checks: Periodic hardware tasks with known answers • consistency checks: compares output of a component with known characteristics • Information redundancy: achieved by extra bits of information to enable error detections • Helps to catch system induced errors • Parity checks • Time redundancy • Standby systems error detection

  19. Design Flexibility • The mark of a long-lived system is one that has been upgraded successfully many times • System should have an adaptable platform for such upgrades ( eg: Windows NT operating system) • Engineering systems to be designed to be “changeable” in the future • Four aspects of changeability are: • Flexibility • Agility • Robustness • Adaptability

  20. Flexibility represents the property of the system to be changed easily. Changes from external to be incorporated to cope with changing environments • Computers with various interface ports can interface with many external systems • Flexibility is important for future upgrades • Agility characterizes a systems ability to be changed rapidly • Race cars are designed to be agile to enable easy modifications to suit the tracks • Robustness represents a systems ability to be insensitive towards changing environments. • An all terrain vehicle such as a Jeep is robust enough to run on different terrains. • Adaptability characterizes a systems ability to adapt itself towards changing environments. No changes form external have to be incorporated to cope with changing environments. • Some of the intelligent software/OS are designed to learn and adapt to different users

  21. Summary • Development of Physical architecture from Functional architecture • Generic and Instantiated architecture • Morphological box • Fault tolerance in physical architecture • Redundancy for fault tolerance • Hardware, software, information, time • Passive, Active redundancy • Hot standby, cold standby, pair and spare • Design Flexibility

More Related