A SAFE Approach to Early DSE of Fault-tolerant Multimedia MPSoCs

A SAFE Approach to Early DSE of Fault-tolerant Multimedia MPSoCs Peter van Stralen Andy Pimentel University of Amsterdam

Contents Introduction Sesame environment Sesame Automated Fault-tolerant Explorer (SAFE) Experiments Conclusion Contents

Introduction Introduction

Fault-Tolerant Embedded Systems • Platform Based Design • Deals with complexity by reusing components • MPSoC: Multi-Processor System-On-Chip • Reliability • Soft errors: temporal malfunction • Single Upset Event: Cosmic rays • NBTI: Product wear out Introduction Shivakumar et al. “Modeling the effect of technology trends on the soft error rate of combinational logic”, DNS’02

MPSoC and Reliability Introduction Fault-Tolerance Pattern required to detect and handle errors

Fault-Tolerance Patterns • Detection • Detect system errors • Many flavors: Active redundancy (DMR / TMR), Assertions, … • Recovery • Deal with system errors • Different policies: skip /restart, explicit checkpoint rate • Many different design options • Different effects on reliability • Affects other objectives (like performance, power and costs) Introduction TMR Selection of Fault-Tolerance Pattern is early DSE Parameter! DMR TMR (restart) DMR (skip) ASSERT TMR (restart) [1 check / frame]

Sesame Environment Sesame Environment

Sesame Overview Sesame Environment

Sesame Framework • MPSoC System Level Design • High level modeling • Optimized for speed • Separation of concerns • Application model • Mapping layer • Architecture model • Early Design Space Exploration • Relative accuracy instead of Absolute • Identify promising set of mappings Sesame Environment

SAFE SESAME AUTOMATED FAULT-TOLERANT EXPLORER (SAFE)

Fault-Tolerant Sesame • Modified System Model • Application model • Pattern layer • Mapping layer • Architecture layer • Targets required for fault-tolerant systems • Frame based processing • Deadline for each frame SAFE

Fault-Tolerant Sesame overview Normal Processes: Processes without interaction to outside world. In contrast to OWP these can be restarted without being noticed OWP Processes: Outside World Processes explicitly describe interaction with outside world SAFE

Fault-Tolerant Mapping • Patternization • Assign processes to fault tolerance patterns • Computational Binding • Assign architectural resources to patterns • I/O Binding • Bind OWP processes to architectural resources • Dispatcher • Map communication SAFE

Patternization Fault-Tolerant Mapping

Computational Binding Fault-Tolerant Mapping

I/O Binding Fault-Tolerant Mapping

Dispatcher Internal Communication External Communication Fault-Tolerant Mapping

Simulation Extensions I • Fault Injection • Assumption: fault tolerant network • Software Initiated Fault Injection (SWIFI) method • Exponential random distribution • Currently only transient failures • Failure affects only current and active frame • Invalidates rest of frame SAFE

Simulation Extensions II • Fault Detection • Fault masking • Fault tolerance pattern can deduce correct output • Fault correction • Skip frame • Restart (possibly from checkpoint) • Overhead of fault correction modeled (time, energy, memory) SAFE

Simulation Extensions III • Checkpoint modeling • Full simulation of checkpoints (generation and usage) • Overhead taken into account (time, energy, buffers) • Explicit checkpoint • Periodically obtained checkpoints • Implicit checkpoint • Stateless point between frames • Any state between frames must be modeled explicitly SAFE Restart from CA.4.1, CB.3.1

Experiments Experiments

Experimental set-up • Applications • Motion-JPEG • Sobel Edge Detector • MP3 Decoder • Architecture • Shared memory • Active redundancy with different policies • 2 buses • 4 processors with 10-6 FIT • Exhaustive exploration of limited design space 86 Patterns Experiments TMR (skip) DMR (skip) TMR (1 restart) DMR (3 restart) TMR (2 restart) [1 Chk] DMR (1 restart) [2 Chk]

Fault-tolerant DSE: MJPEG • DMR vs TMR • Less power DMR • TMR not beneficial for frame drop ratio • Restarting • Requires more power • Beneficial for frame drop ratio (31% to 0%) • Sudden increase drop ratio • Can not keep pace (more deadline misses) • Side effect: less power Experiments

Fault-tolerant DSE: Sobel • DMR vs TMR • Less power DMR • TMR beneficialfor frame drop ratio (5% to 1.4%) • Restarting • Requires more power • Beneficialfor frame drop ratio (16% to 1.4%) Experiments

Fault-tolerant DSE: MP3 • DMR vs TMR • Less power DMR • TMR not beneficial for frame drop ratio (w restart) • Restarting • Requires more power • Not beneficial for frame drop ratio TMR • Checkpoint overhead • More deadline misses Experiments

Fault-tolerant DSE: Conclusion • TMR takes significantly more power, but drop ratio does not always improve significantly • Restart effort (+power) always reduces drop ratio, but difference may be small • Application dependent behavior • DSE is required for fault-tolerant design Experiments

Conclusion CONCLUSION

Conclusion • Fault-Tolerance should be a first class citizen of early DSE • SAFE is able to explore fault-tolerant design space • Custom patternization leverages fault handling capabilities and performance overhead • More expensive fault-tolerance patterns are not always beneficial • Future work • Large scale DSE • Mixed fault-tolerance patterns • Multiple applications Conclusion

Questions? Conclusion

MJPEG Patternization • Incremental Patternization • Patternization effects • Equalized subnetworks • Minimized communication • Guard cpu-intensive tasks Additional Slides

Checkpoint overhead vs. benefit Additional

Drop ratio versus #subnetworks Trade-off increased fault-tolerance and overhead subnetworks Additional

Restart budget Additional Corrupt decreases due to fault masking effort Deadline misses only increases up to a certain limit (Real restart usage <= budget)

A SAFE Approach to Early DSE of Fault-tolerant Multimedia MPSoCs

A SAFE Approach to Early DSE of Fault-tolerant Multimedia MPSoCs

Presentation Transcript

Fault-Tolerant Broadcast

Fault-Tolerant Broadcast

Fault-Tolerant CORBA

FAULT TOLERANT CORBA

Fault Tolerant MPI

Fault-Tolerant Consensus

Fault Tolerant Backplane

A Multi-Protocols Fault Tolerant MPI

A Modular Approach to Fault-Tolerant Broadcasts and Related Problems

FAULT-TOLERANT COMPUTING

FAULT-TOLERANT COMPUTING

Fault Tolerant Configuration

Fault-tolerant Control

FAULT-TOLERANT NETWORKS AND FAULT-TOLERANT ROUTING

fault-tolerant

Fault-tolerant routing

Fault-Tolerant Consensus

Fault-Tolerant Broadcast

Fault-tolerant Computing