• 330 likes • 518 Views
A SAFE Approach to Early DSE of Fault-tolerant Multimedia MPSoCs. Peter van Stralen Andy Pimentel University of Amsterdam. Contents. Introduction Sesame environment Sesame Automated Fault-tolerant Explorer (SAFE) Experiments Conclusion. Introduction. Fault-Tolerant Embedded Systems.
E N D
A SAFE Approach to Early DSE of Fault-tolerant Multimedia MPSoCs Peter van Stralen Andy Pimentel University of Amsterdam
Contents Introduction Sesame environment Sesame Automated Fault-tolerant Explorer (SAFE) Experiments Conclusion Contents
Introduction Introduction
Fault-Tolerant Embedded Systems • Platform Based Design • Deals with complexity by reusing components • MPSoC: Multi-Processor System-On-Chip • Reliability • Soft errors: temporal malfunction • Single Upset Event: Cosmic rays • NBTI: Product wear out Introduction Shivakumar et al. “Modeling the effect of technology trends on the soft error rate of combinational logic”, DNS’02
MPSoC and Reliability Introduction Fault-Tolerance Pattern required to detect and handle errors
Fault-Tolerance Patterns • Detection • Detect system errors • Many flavors: Active redundancy (DMR / TMR), Assertions, … • Recovery • Deal with system errors • Different policies: skip /restart, explicit checkpoint rate • Many different design options • Different effects on reliability • Affects other objectives (like performance, power and costs) Introduction TMR Selection of Fault-Tolerance Pattern is early DSE Parameter! DMR TMR (restart) DMR (skip) ASSERT TMR (restart) [1 check / frame]
Sesame Environment Sesame Environment
Sesame Overview Sesame Environment
Sesame Framework • MPSoC System Level Design • High level modeling • Optimized for speed • Separation of concerns • Application model • Mapping layer • Architecture model • Early Design Space Exploration • Relative accuracy instead of Absolute • Identify promising set of mappings Sesame Environment
SAFE SESAME AUTOMATED FAULT-TOLERANT EXPLORER (SAFE)
Fault-Tolerant Sesame • Modified System Model • Application model • Pattern layer • Mapping layer • Architecture layer • Targets required for fault-tolerant systems • Frame based processing • Deadline for each frame SAFE
Fault-Tolerant Sesame overview Normal Processes: Processes without interaction to outside world. In contrast to OWP these can be restarted without being noticed OWP Processes: Outside World Processes explicitly describe interaction with outside world SAFE
Fault-Tolerant Mapping • Patternization • Assign processes to fault tolerance patterns • Computational Binding • Assign architectural resources to patterns • I/O Binding • Bind OWP processes to architectural resources • Dispatcher • Map communication SAFE
Patternization Fault-Tolerant Mapping
Computational Binding Fault-Tolerant Mapping
I/O Binding Fault-Tolerant Mapping
Dispatcher Internal Communication External Communication Fault-Tolerant Mapping
Simulation Extensions I • Fault Injection • Assumption: fault tolerant network • Software Initiated Fault Injection (SWIFI) method • Exponential random distribution • Currently only transient failures • Failure affects only current and active frame • Invalidates rest of frame SAFE
Simulation Extensions II • Fault Detection • Fault masking • Fault tolerance pattern can deduce correct output • Fault correction • Skip frame • Restart (possibly from checkpoint) • Overhead of fault correction modeled (time, energy, memory) SAFE
Simulation Extensions III • Checkpoint modeling • Full simulation of checkpoints (generation and usage) • Overhead taken into account (time, energy, buffers) • Explicit checkpoint • Periodically obtained checkpoints • Implicit checkpoint • Stateless point between frames • Any state between frames must be modeled explicitly SAFE Restart from CA.4.1, CB.3.1
Experiments Experiments
Experimental set-up • Applications • Motion-JPEG • Sobel Edge Detector • MP3 Decoder • Architecture • Shared memory • Active redundancy with different policies • 2 buses • 4 processors with 10-6 FIT • Exhaustive exploration of limited design space 86 Patterns Experiments TMR (skip) DMR (skip) TMR (1 restart) DMR (3 restart) TMR (2 restart) [1 Chk] DMR (1 restart) [2 Chk]
Fault-tolerant DSE: MJPEG • DMR vs TMR • Less power DMR • TMR not beneficial for frame drop ratio • Restarting • Requires more power • Beneficial for frame drop ratio (31% to 0%) • Sudden increase drop ratio • Can not keep pace (more deadline misses) • Side effect: less power Experiments
Fault-tolerant DSE: Sobel • DMR vs TMR • Less power DMR • TMR beneficialfor frame drop ratio (5% to 1.4%) • Restarting • Requires more power • Beneficialfor frame drop ratio (16% to 1.4%) Experiments
Fault-tolerant DSE: MP3 • DMR vs TMR • Less power DMR • TMR not beneficial for frame drop ratio (w restart) • Restarting • Requires more power • Not beneficial for frame drop ratio TMR • Checkpoint overhead • More deadline misses Experiments
Fault-tolerant DSE: Conclusion • TMR takes significantly more power, but drop ratio does not always improve significantly • Restart effort (+power) always reduces drop ratio, but difference may be small • Application dependent behavior • DSE is required for fault-tolerant design Experiments
Conclusion CONCLUSION
Conclusion • Fault-Tolerance should be a first class citizen of early DSE • SAFE is able to explore fault-tolerant design space • Custom patternization leverages fault handling capabilities and performance overhead • More expensive fault-tolerance patterns are not always beneficial • Future work • Large scale DSE • Mixed fault-tolerance patterns • Multiple applications Conclusion
Questions? Conclusion
MJPEG Patternization • Incremental Patternization • Patternization effects • Equalized subnetworks • Minimized communication • Guard cpu-intensive tasks Additional Slides
Checkpoint overhead vs. benefit Additional
Drop ratio versus #subnetworks Trade-off increased fault-tolerance and overhead subnetworks Additional
Restart budget Additional Corrupt decreases due to fault masking effort Deadline misses only increases up to a certain limit (Real restart usage <= budget)