Dynamic Run-time Fault Tolerance in Multi-Processor System on Chips

Dynamic Run-time Fault Tolerance in Multi-Processor System on Chips Prem Kumar Ramesh Department of Electrical and Computer Engineering

Deep Sub Micron Era • Shrinking Transistors • Feature Size < 90 nm • Billion Device Processors • High Performance ICs • Multi-Processor System on Chip

Multi-Processor System on Chip • 10’s of Processors on a single chip • Much more harder than single processor system • Processor Configuration • Communication and Synchronization • Poses a challenge to reliability!

Background – Fault Model • Duration • Transient • Permanent • Location • Processing Element • Network on Chip • Time to Failure • Before-Shelf • After-Shelf • Graceful Degradation

Previous Works • Static Redundancy Approach • N-copies of same program on different PEs • Majority Voting • Not very efficient! • Run-time Recovery Approach • Checker Processor is assigned to each Processor • Checker ‘commits’ only when the result matches with PE • If not, the task gets re-assigned to some other PE

Proposed Work • Extends the run-time recovery approach • Dynamic • Resourse Utilization • Graceful Degradation • Combines two models • Hardware model • Software model

Hardware Model • Dynamically allocate checkers to PE • Commits only when both PEs agree • Detects and Corrects Transient Faults • In case of failure of one, the other could be re-allocated to some other PE, allowing a graceful degradation

Software Model • Addresses Permanent Faults • SPMD-Single Program Multiple Data suits the situation • MPI-based approach • Splitter-Parallel Tasks-Joiner • In case of permanent fault, only the data associated with that task need to be migrated, as all Pes work on same program

Things to Explore Further • MPSoC with Heterogeneous Processors • Simultaneous Multiple Application Processing • Recovering from Control Faults

Simulation Framework • System C to model the Framework • C/C++ for the Application to be mapped

Expected Result • Achieve Run-time Dynamic Fault-Recovery with negligible performance (speed-up) cost • Better Resource Utilization • Achieve graceful degradation

Time Line • First and Second Week • Literature Survey • Third Week • Design of the models • Fourth and Fifth Week • Implementation, Coding and Debugging

References [1] Xinping Zhu and Wei Qin, “Prototyping a Fault-Tolerant Multiprocessor SoC with Run-time Fault Recovery,” DAC 2006, July 24-28, 2006, San Francisco, California, USA. [2] Grant Martin, “Overview of the MPSoC Design Challenge,” DAC 2006, July 24-28, 2006, San Francisco, California, USA. [3] Peter Flake and Simon Davidmann and Frank Schirrmeister, “System-Level Exploration Tools for MPSoC Designs,” DAC 2006, July 24-28, 2006, San Francisco, California, USA.

Dynamic Run-time Fault Tolerance in Multi-Processor System on Chips

Dynamic Run-time Fault Tolerance in Multi-Processor System on Chips

Presentation Transcript

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault tolerance

Fault tolerance

Partially Reconfigurable System-on-Chips for Adaptive Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Multi-Station and Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Multi-Station and Fault Tolerance

Fault Tolerance