Fault-Tolerant Design for Long-Life Deep Space Missions

Fault-Tolerant Design for Long-Life Deep Space Missions Yiğit Kültür 2006702835

Contents • Introduction • Fault-Tolerant System Considerations and Techniques • Historical Perspective • Future Approach • Conclusion

Introduction • Recently, planet Mars has been at the focal point of astronomical attention because Mars will play a key role in humanity’s expansion to the deep space • Future Mars transportation will require reliable operations over a lifespan of years unlike: • Space Shuttle which requires operations over months • Space Station which is close enough to the Earth for maintenance logistics

Introduction • Long operation period associated with deep space missions demands: • Innovative fault-tolerant technology development • Applications of advanced redundancy techniques • To enable Mars exploration safety, reliability and autonomy must be improved • A new technology plan to guide the development of the next generation fault tolerant computing technology

Fault Tolerant System Considerations • Traditionally, avionic systems achieved fault-tolerance through redundancy management • Redundancy management technique: • Detects and isolates a failure • Performs hardware roconfiguration • A combination of self-monitoring and cross-comparison strategies lead to comprehensive fault coverage at reduced risk and cost

Fault Tolerant System Considerations • Primary Flight Control System (PFCS) Baseline Requirements • Mission reliability: 0.95 success probability at 10 years with no repair • Throughput: 100 million instructions per second (MIPS) • Expandable I/O: 100 Mbits/sec • Expandable Memory: 1 GByte • Mass Storage Capacity: 1 Terabyte • Cycle Rate: 100 Hz • Hardware N-fail operation • Low life-cycle cost • Low power and mass • Radiation tolerance • Building block approach(Look for existing soultions to the parts of the problem and combine the soluitons)

Fault Tolerant Techniques for Mars Applications • Ultra-reliable systems for long-life applications like human Mars exploration are required to sustain: • Permanent faults • Transient (temporary) faults • Intermittent (not continuous) faults • Timing faults • Latent (hidden) faults • Worst-case fault scenarios with a lower probability of occurence

Fault Tolerant Techniques for Mars Applications • Distributed Architectures are more suitable to long-life space applications: • Function integration • Parallel computation • Graceful performance growth • Selective technology upgrade • Appropriate levels of function reliability • Graceful degradation of system capabilities in the presence of faults • Efficient use of hardware resources

Historical Perspective • Long-Life Unmanned Redundant Systems Viking Voyager Galileo

Historical Perspective • Safety Critical High Reliability Systems Columbia Challenger Discovery Atlantis Endeavour

Long-Life Unmanned Redundant SystemsViking • Viking is an instance of the pre-1970 Thermoelectric Outer Planets Spacecraft (TOPS) concept • This spacecraft firstly introduced the use of computer as a fault manager, to attempt to reconfigure and restore the spacecraft to an operational configuration • Fundamental strategy was to switch power on and off to various alternative subsystems until either the built-in fault monitoring indicated operation was restored, or until commands from the Earth are detected in the case of faults in the communication chain • There was no real-time masking of faults, so if a fault occured during a maneuver, an incorrect maneuver would have been performed Viking Fault-Tolerant Architecture CCS: Command Computer Subsystem FDS: Flight Data Subsytem

Long-Life Unmanned Redundant SystemsVoyager • Like Viking, Voyager is an instance of the pre-1970 Thermoelectric Outer Planets Spacecraft (TOPS) concept. • The improvement according to Viking is in only limited ways, such as the addition of a pair of seperate computers for the attitude and articulation control • In both of them standby redundancy was used. The standby spares where cross-strapped so that either unit could be switched in to communicate with the other units • Cross-strapping and switching allowed reconfiguration around failed components, either automatically or by the ground command Voyager Fault-Tolerant Architecture CCS: Command Computer Subsystem FDS: Flight Data Subsytem AACS: Attitude and ArticulationControl Subsystem

Long-Life Unmanned Redundant SystemsGalileo • Galileo mission is a follow on to the Voyager Jupiter fly-by mission • Galileo design borrows heavily from the experiences of the Voyager • Block redundancy (An error checking method that generates a longitudal parity byte from a specified string or block of bytes on alongitudinal track.) is used throughout the subsystems • All except CDS operates as an active/standby pair • CDS operates as active redundancy wherein each block can issue independent commands, or they can operate in parallel on the same critical activity Galileo Fault-Tolerant Architecture CDS: Command and Data Subsystem AACS: Attitude and ArticulationControl Subsystem

Long-Life Unmanned Redundant SystemsGalileo • The major departure from the Voyager arcihtecture is the extensive use of microprocessors and the consequent use of bus oriented architecture to facilitate communications among them • Galileo on-board fault detection software is designed to alleviate the effects and symptoms of faults, rather than to pinpoint the exact faults. • Fault identification and isolation are performed by the ground intervention Galileo Fault-Tolerant Architecture CDS: Command and Data Subsystem AACS: Attitude and ArticulationControl Subsystem

Safety Critical High Reliability SystemsShuttles • Operational differences from planetary probes: • being absolutely certain no fault propagates to the effectors during a relatively shorter operation cycle • rather than relying on fault monitors to interrupt processing and going through a reconfiguration, powering several redundant strings on and operating in parallel

Safety Critical High Reliability SystemsShuttles • Voting occurs both in General Purpose Computers (GPC’s) and at the final effectors • Voting is much more brute force than fault moitoring, requiring more hardware but also providing greater fault coverage • Much more suited to real-time safety-critical maneuver control than a reconfiguration oriented strategy as in Viking, Voyager and Galileo Conceptual Shuttle Orbiter Fault-Tolerant Architecture GPC: General Purpose Computer

Mars Advanced Fault Tolerant Computing ApproachFuture Manned Mars Missions • Parallel-Hybrid Redundancy will be the base for future long-life deep space missions: • It combines the attractive features of parallel processing and redundant computation • Computational elements can be arranged to provide high throughput or ultra reliability or a combination of them depending on the mission phase

Mars Advanced Fault Tolerant Computing ApproachFuture Manned Mars Missions • Parallel-Hybrid Redundancy was first used in 1979 when Fault Tolerant Multi-Processor (FTMP) was designed and built: • FTMP used conventional shared memory multiprocessor architecture • Each virtual processor consisted of three real processors working as a triad to provide real-time fault masking • Upon detection of a fault in a processor, faulty unit is replaced from a pool of spares

Mars Advanced Fault Tolerant Computing ApproachFuture Manned Mars Missions • Parallel-Hybrid Redundancy had certain drawbacks: • It was not explicitly designed to meet rigorous requirements of Byzantine resilience (Correctly functioning components of a Byzantine fault tolerant system will be able to reach the same group decisions regardless of Byzantine faulty components ) which is necessary to provide • Coverage of random hardware faults • Ultra-high reliability • Ease of validation • It lacked ease of expandability due to redundant bus connections between processors and main memory • It did not support mixed redundancy because processors are aranged to work in triads regardless of the criticality of the application

Mars Advanced Fault Tolerant Computing ApproachFuture Manned Mars Missions • To solve the deficiencies of FTMP a new architecture called Fault Tolerant Parallel Processor (FTPP) was conceived • It meets all requirements of random hardware faults • FTPP will be the base of fault tolerance for future manned Mars missions FTPP Arcihtecture

Mars Advanced Fault Tolerant Computing ApproachFeatures of FTPP – Parallel Procesing • Parallel Processing is provided by: • 40 Processing Elements (PEs) in 5 Fault Containment Regions (FCRs) • 2 Input/Output Controllers (IOCs) per FCR FTPP Arcihtecture

Mars Advanced Fault Tolerant Computing ApproachFeatures of FTPP – Scalable Performance • Increasing the number of PEs in a single cluster create a communication bottleneck in the Network Elements (NEs) • FTPP relies on hierarchical approach to scaling the performance by assebmling clusters via IOCs FTPP Arcihtecture

Mars Advanced Fault Tolerant Computing ApproachFeatures of FTPP – Mixed Redundancy • Most fault tolerant computers are designed to operate in a redundant mode only, which is a waste of resources for the uncritical tasks • FTPP allows the processing elements to be configured as • Simplex:non-critical tasks • Triplex:tasks that require real-time fault masking • Quadruplex or higher: when two or moresequential faults must be tolerated in a small time window without the benefit of reconfiguration • In the figure: • 4 quads • 3 triplexes • 15 simplexes FTPP Arcihtecture

Mars Advanced Fault Tolerant Computing ApproachFeatures of FTPP – Dynamic Reconfiguration • Mission consists of several phases such as launch, ascent, cruise from Earth orbit to Mars, Mars orbit injection, Mars landing • For each phase the throughput, latency, iteration rates and criticality changes over a wide range, therefore the arcihecture must be flexible • Reconfiguration from high throughput to high reliability • 3 PEs which are operating as independent simplex elements can be synchronized to run the same task (S2,S3,S13) • Replacing failed members • A simplex in the same FCR as the failed member is synchronized with the non-failed members of the virtual group(Channel A of Q1 failsS2,S7 or S12 can replace) FTPP Arcihtecture

Mars Advanced Fault Tolerant Computing ApproachFeatures of FTPP – Low Fault Tolerance Overhead • Frequent fault tolerant related functions such as fault/error detection, error masking(voting) and synchronization are implemented in the Network Element • Less frequent functions such as identification of faulty modules, reconfiguration and reintegration are implemented in software which executes on PEs. • Each NE services 8 PEs FTPP Arcihtecture

Mars Advanced Fault Tolerant Computing ApproachFeatures of FTPP – Open Architecture • FTTP provides open architecture for both hardware and software including: • Processors • I/O modules • Fiber optic links • Operating Systems FTPP Arcihtecture

Mars Advanced Fault Tolerant Computing ApproachFeatures of FTPP – Small Physical Size • Key element of meeting the weight, volume and power requirements is the packaging technology • Multi-Chip Modules (MCMs) will be used: • A NE on a single MCM with less than 4 cm2 FTPP Arcihtecture

Conclusion • Future manned deep space missions will require reliable operation over years and real-time masking of critical faults • Current approaches are not enough and a new fault tolerant approach is needed • FTPP is a powerful candidate for the spacecraft which will bring the humans to Mars

References • Advanced fault tolerant computing for future manned space missionsBenjamin, A.L.; Lala, J.H.;Digital Avionics Systems Conference, 1997. 16th DASC., AIAA/IEEEVolume 2, 26-30 Oct. 1997 Page(s):8.5 - 26-8.5-32 vol.2 • NASA Website Computers in Spaceflight: The NASA Experiencehttp://www.hq.nasa.gov/office/pao/History/computers/Ch6-2.html • NASA Jet Propulison Laboratory Website Voyager: The Interstellar Mission http://voyager.jpl.nasa.gov/spacecraft/index.html

Fault-Tolerant Design for Long-Life Deep Space Missions