130 likes | 236 Views
Dynamic Run-time Fault Tolerance in Multi-Processor System on Chips. Prem Kumar Ramesh Department of Electrical and Computer Engineering. Deep Sub Micron Era. Shrinking Transistors Feature Size < 90 nm Billion Device Processors High Performance ICs Multi-Processor System on Chip.
E N D
Dynamic Run-time Fault Tolerance in Multi-Processor System on Chips Prem Kumar Ramesh Department of Electrical and Computer Engineering
Deep Sub Micron Era • Shrinking Transistors • Feature Size < 90 nm • Billion Device Processors • High Performance ICs • Multi-Processor System on Chip
Multi-Processor System on Chip • 10’s of Processors on a single chip • Much more harder than single processor system • Processor Configuration • Communication and Synchronization • Poses a challenge to reliability!
Background – Fault Model • Duration • Transient • Permanent • Location • Processing Element • Network on Chip • Time to Failure • Before-Shelf • After-Shelf • Graceful Degradation
Previous Works • Static Redundancy Approach • N-copies of same program on different PEs • Majority Voting • Not very efficient! • Run-time Recovery Approach • Checker Processor is assigned to each Processor • Checker ‘commits’ only when the result matches with PE • If not, the task gets re-assigned to some other PE
Proposed Work • Extends the run-time recovery approach • Dynamic • Resourse Utilization • Graceful Degradation • Combines two models • Hardware model • Software model
Hardware Model • Dynamically allocate checkers to PE • Commits only when both PEs agree • Detects and Corrects Transient Faults • In case of failure of one, the other could be re-allocated to some other PE, allowing a graceful degradation
Software Model • Addresses Permanent Faults • SPMD-Single Program Multiple Data suits the situation • MPI-based approach • Splitter-Parallel Tasks-Joiner • In case of permanent fault, only the data associated with that task need to be migrated, as all Pes work on same program
Things to Explore Further • MPSoC with Heterogeneous Processors • Simultaneous Multiple Application Processing • Recovering from Control Faults
Simulation Framework • System C to model the Framework • C/C++ for the Application to be mapped
Expected Result • Achieve Run-time Dynamic Fault-Recovery with negligible performance (speed-up) cost • Better Resource Utilization • Achieve graceful degradation
Time Line • First and Second Week • Literature Survey • Third Week • Design of the models • Fourth and Fifth Week • Implementation, Coding and Debugging
References [1] Xinping Zhu and Wei Qin, “Prototyping a Fault-Tolerant Multiprocessor SoC with Run-time Fault Recovery,” DAC 2006, July 24-28, 2006, San Francisco, California, USA. [2] Grant Martin, “Overview of the MPSoC Design Challenge,” DAC 2006, July 24-28, 2006, San Francisco, California, USA. [3] Peter Flake and Simon Davidmann and Frank Schirrmeister, “System-Level Exploration Tools for MPSoC Designs,” DAC 2006, July 24-28, 2006, San Francisco, California, USA.