1 / 29

Fault-Tolerant Design for Long-Life Deep Space Missions

Fault-Tolerant Design for Long-Life Deep Space Missions Yiğit Kültür 2006702835 Contents Introduction Fault-Tolerant System Considerations and Techniques Historical Perspective Future Approach Conclusion Introduction

Thomas
Download Presentation

Fault-Tolerant Design for Long-Life Deep Space Missions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fault-Tolerant Design for Long-Life Deep Space Missions Yiğit Kültür 2006702835

  2. Contents • Introduction • Fault-Tolerant System Considerations and Techniques • Historical Perspective • Future Approach • Conclusion

  3. Introduction • Recently, planet Mars has been at the focal point of astronomical attention because Mars will play a key role in humanity’s expansion to the deep space • Future Mars transportation will require reliable operations over a lifespan of years unlike: • Space Shuttle which requires operations over months • Space Station which is close enough to the Earth for maintenance logistics

  4. Introduction • Long operation period associated with deep space missions demands: • Innovative fault-tolerant technology development • Applications of advanced redundancy techniques • To enable Mars exploration safety, reliability and autonomy must be improved • A new technology plan to guide the development of the next generation fault tolerant computing technology

  5. Fault Tolerant System Considerations • Traditionally, avionic systems achieved fault-tolerance through redundancy management • Redundancy management technique: • Detects and isolates a failure • Performs hardware roconfiguration • A combination of self-monitoring and cross-comparison strategies lead to comprehensive fault coverage at reduced risk and cost

  6. Fault Tolerant System Considerations • Primary Flight Control System (PFCS) Baseline Requirements • Mission reliability: 0.95 success probability at 10 years with no repair • Throughput: 100 million instructions per second (MIPS) • Expandable I/O: 100 Mbits/sec • Expandable Memory: 1 GByte • Mass Storage Capacity: 1 Terabyte • Cycle Rate: 100 Hz • Hardware N-fail operation • Low life-cycle cost • Low power and mass • Radiation tolerance • Building block approach(Look for existing soultions to the parts of the problem and combine the soluitons)

  7. Fault Tolerant Techniques for Mars Applications • Ultra-reliable systems for long-life applications like human Mars exploration are required to sustain: • Permanent faults • Transient (temporary) faults • Intermittent (not continuous) faults • Timing faults • Latent (hidden) faults • Worst-case fault scenarios with a lower probability of occurence

  8. Fault Tolerant Techniques for Mars Applications • Distributed Architectures are more suitable to long-life space applications: • Function integration • Parallel computation • Graceful performance growth • Selective technology upgrade • Appropriate levels of function reliability • Graceful degradation of system capabilities in the presence of faults • Efficient use of hardware resources

  9. Historical Perspective • Long-Life Unmanned Redundant Systems Viking Voyager Galileo

  10. Historical Perspective • Safety Critical High Reliability Systems Columbia Challenger Discovery Atlantis Endeavour

  11. Long-Life Unmanned Redundant SystemsViking • Viking is an instance of the pre-1970 Thermoelectric Outer Planets Spacecraft (TOPS) concept • This spacecraft firstly introduced the use of computer as a fault manager, to attempt to reconfigure and restore the spacecraft to an operational configuration • Fundamental strategy was to switch power on and off to various alternative subsystems until either the built-in fault monitoring indicated operation was restored, or until commands from the Earth are detected in the case of faults in the communication chain • There was no real-time masking of faults, so if a fault occured during a maneuver, an incorrect maneuver would have been performed Viking Fault-Tolerant Architecture CCS: Command Computer Subsystem FDS: Flight Data Subsytem

  12. Long-Life Unmanned Redundant SystemsVoyager • Like Viking, Voyager is an instance of the pre-1970 Thermoelectric Outer Planets Spacecraft (TOPS) concept. • The improvement according to Viking is in only limited ways, such as the addition of a pair of seperate computers for the attitude and articulation control • In both of them standby redundancy was used. The standby spares where cross-strapped so that either unit could be switched in to communicate with the other units • Cross-strapping and switching allowed reconfiguration around failed components, either automatically or by the ground command Voyager Fault-Tolerant Architecture CCS: Command Computer Subsystem FDS: Flight Data Subsytem AACS: Attitude and ArticulationControl Subsystem

  13. Long-Life Unmanned Redundant SystemsGalileo • Galileo mission is a follow on to the Voyager Jupiter fly-by mission • Galileo design borrows heavily from the experiences of the Voyager • Block redundancy (An error checking method that generates a longitudal parity byte from a specified string or block of bytes on alongitudinal track.) is used throughout the subsystems • All except CDS operates as an active/standby pair • CDS operates as active redundancy wherein each block can issue independent commands, or they can operate in parallel on the same critical activity Galileo Fault-Tolerant Architecture CDS: Command and Data Subsystem AACS: Attitude and ArticulationControl Subsystem

  14. Long-Life Unmanned Redundant SystemsGalileo • The major departure from the Voyager arcihtecture is the extensive use of microprocessors and the consequent use of bus oriented architecture to facilitate communications among them • Galileo on-board fault detection software is designed to alleviate the effects and symptoms of faults, rather than to pinpoint the exact faults. • Fault identification and isolation are performed by the ground intervention Galileo Fault-Tolerant Architecture CDS: Command and Data Subsystem AACS: Attitude and ArticulationControl Subsystem

  15. Safety Critical High Reliability SystemsShuttles • Operational differences from planetary probes: • being absolutely certain no fault propagates to the effectors during a relatively shorter operation cycle • rather than relying on fault monitors to interrupt processing and going through a reconfiguration, powering several redundant strings on and operating in parallel

  16. Safety Critical High Reliability SystemsShuttles • Voting occurs both in General Purpose Computers (GPC’s) and at the final effectors • Voting is much more brute force than fault moitoring, requiring more hardware but also providing greater fault coverage • Much more suited to real-time safety-critical maneuver control than a reconfiguration oriented strategy as in Viking, Voyager and Galileo Conceptual Shuttle Orbiter Fault-Tolerant Architecture GPC: General Purpose Computer

  17. Mars Advanced Fault Tolerant Computing ApproachFuture Manned Mars Missions • Parallel-Hybrid Redundancy will be the base for future long-life deep space missions: • It combines the attractive features of parallel processing and redundant computation • Computational elements can be arranged to provide high throughput or ultra reliability or a combination of them depending on the mission phase

  18. Mars Advanced Fault Tolerant Computing ApproachFuture Manned Mars Missions • Parallel-Hybrid Redundancy was first used in 1979 when Fault Tolerant Multi-Processor (FTMP) was designed and built: • FTMP used conventional shared memory multiprocessor architecture • Each virtual processor consisted of three real processors working as a triad to provide real-time fault masking • Upon detection of a fault in a processor, faulty unit is replaced from a pool of spares

  19. Mars Advanced Fault Tolerant Computing ApproachFuture Manned Mars Missions • Parallel-Hybrid Redundancy had certain drawbacks: • It was not explicitly designed to meet rigorous requirements of Byzantine resilience (Correctly functioning components of a Byzantine fault tolerant system will be able to reach the same group decisions regardless of Byzantine faulty components ) which is necessary to provide • Coverage of random hardware faults • Ultra-high reliability • Ease of validation • It lacked ease of expandability due to redundant bus connections between processors and main memory • It did not support mixed redundancy because processors are aranged to work in triads regardless of the criticality of the application

  20. Mars Advanced Fault Tolerant Computing ApproachFuture Manned Mars Missions • To solve the deficiencies of FTMP a new architecture called Fault Tolerant Parallel Processor (FTPP) was conceived • It meets all requirements of random hardware faults • FTPP will be the base of fault tolerance for future manned Mars missions FTPP Arcihtecture

  21. Mars Advanced Fault Tolerant Computing ApproachFeatures of FTPP – Parallel Procesing • Parallel Processing is provided by: • 40 Processing Elements (PEs) in 5 Fault Containment Regions (FCRs) • 2 Input/Output Controllers (IOCs) per FCR FTPP Arcihtecture

  22. Mars Advanced Fault Tolerant Computing ApproachFeatures of FTPP – Scalable Performance • Increasing the number of PEs in a single cluster create a communication bottleneck in the Network Elements (NEs) • FTPP relies on hierarchical approach to scaling the performance by assebmling clusters via IOCs FTPP Arcihtecture

  23. Mars Advanced Fault Tolerant Computing ApproachFeatures of FTPP – Mixed Redundancy • Most fault tolerant computers are designed to operate in a redundant mode only, which is a waste of resources for the uncritical tasks • FTPP allows the processing elements to be configured as • Simplex:non-critical tasks • Triplex:tasks that require real-time fault masking • Quadruplex or higher: when two or moresequential faults must be tolerated in a small time window without the benefit of reconfiguration • In the figure: • 4 quads • 3 triplexes • 15 simplexes FTPP Arcihtecture

  24. Mars Advanced Fault Tolerant Computing ApproachFeatures of FTPP – Dynamic Reconfiguration • Mission consists of several phases such as launch, ascent, cruise from Earth orbit to Mars, Mars orbit injection, Mars landing • For each phase the throughput, latency, iteration rates and criticality changes over a wide range, therefore the arcihecture must be flexible • Reconfiguration from high throughput to high reliability • 3 PEs which are operating as independent simplex elements can be synchronized to run the same task (S2,S3,S13) • Replacing failed members • A simplex in the same FCR as the failed member is synchronized with the non-failed members of the virtual group(Channel A of Q1 failsS2,S7 or S12 can replace) FTPP Arcihtecture

  25. Mars Advanced Fault Tolerant Computing ApproachFeatures of FTPP – Low Fault Tolerance Overhead • Frequent fault tolerant related functions such as fault/error detection, error masking(voting) and synchronization are implemented in the Network Element • Less frequent functions such as identification of faulty modules, reconfiguration and reintegration are implemented in software which executes on PEs. • Each NE services 8 PEs FTPP Arcihtecture

  26. Mars Advanced Fault Tolerant Computing ApproachFeatures of FTPP – Open Architecture • FTTP provides open architecture for both hardware and software including: • Processors • I/O modules • Fiber optic links • Operating Systems FTPP Arcihtecture

  27. Mars Advanced Fault Tolerant Computing ApproachFeatures of FTPP – Small Physical Size • Key element of meeting the weight, volume and power requirements is the packaging technology • Multi-Chip Modules (MCMs) will be used: • A NE on a single MCM with less than 4 cm2 FTPP Arcihtecture

  28. Conclusion • Future manned deep space missions will require reliable operation over years and real-time masking of critical faults • Current approaches are not enough and a new fault tolerant approach is needed • FTPP is a powerful candidate for the spacecraft which will bring the humans to Mars

  29. References • Advanced fault tolerant computing for future manned space missionsBenjamin, A.L.; Lala, J.H.;Digital Avionics Systems Conference, 1997. 16th DASC., AIAA/IEEEVolume 2,  26-30 Oct. 1997 Page(s):8.5 - 26-8.5-32 vol.2 • NASA Website Computers in Spaceflight: The NASA Experiencehttp://www.hq.nasa.gov/office/pao/History/computers/Ch6-2.html • NASA Jet Propulison Laboratory Website Voyager: The Interstellar Mission http://voyager.jpl.nasa.gov/spacecraft/index.html

More Related