1 / 33

A SAFE Approach to Early DSE of Fault-tolerant Multimedia MPSoCs

A SAFE Approach to Early DSE of Fault-tolerant Multimedia MPSoCs. Peter van Stralen Andy Pimentel University of Amsterdam. Contents. Introduction Sesame environment Sesame Automated Fault-tolerant Explorer (SAFE) Experiments Conclusion. Introduction. Fault-Tolerant Embedded Systems.

galvin
Download Presentation

A SAFE Approach to Early DSE of Fault-tolerant Multimedia MPSoCs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A SAFE Approach to Early DSE of Fault-tolerant Multimedia MPSoCs Peter van Stralen Andy Pimentel University of Amsterdam

  2. Contents Introduction Sesame environment Sesame Automated Fault-tolerant Explorer (SAFE) Experiments Conclusion Contents

  3. Introduction Introduction

  4. Fault-Tolerant Embedded Systems • Platform Based Design • Deals with complexity by reusing components • MPSoC: Multi-Processor System-On-Chip • Reliability • Soft errors: temporal malfunction • Single Upset Event: Cosmic rays • NBTI: Product wear out Introduction Shivakumar et al. “Modeling the effect of technology trends on the soft error rate of combinational logic”, DNS’02

  5. MPSoC and Reliability Introduction Fault-Tolerance Pattern required to detect and handle errors

  6. Fault-Tolerance Patterns • Detection • Detect system errors • Many flavors: Active redundancy (DMR / TMR), Assertions, … • Recovery • Deal with system errors • Different policies: skip /restart, explicit checkpoint rate • Many different design options • Different effects on reliability • Affects other objectives (like performance, power and costs) Introduction TMR Selection of Fault-Tolerance Pattern is early DSE Parameter! DMR TMR (restart) DMR (skip) ASSERT TMR (restart) [1 check / frame]

  7. Sesame Environment Sesame Environment

  8. Sesame Overview Sesame Environment

  9. Sesame Framework • MPSoC System Level Design • High level modeling • Optimized for speed • Separation of concerns • Application model • Mapping layer • Architecture model • Early Design Space Exploration • Relative accuracy instead of Absolute • Identify promising set of mappings Sesame Environment

  10. SAFE SESAME AUTOMATED FAULT-TOLERANT EXPLORER (SAFE)

  11. Fault-Tolerant Sesame • Modified System Model • Application model • Pattern layer • Mapping layer • Architecture layer • Targets required for fault-tolerant systems • Frame based processing • Deadline for each frame SAFE

  12. Fault-Tolerant Sesame overview Normal Processes: Processes without interaction to outside world. In contrast to OWP these can be restarted without being noticed OWP Processes: Outside World Processes explicitly describe interaction with outside world SAFE

  13. Fault-Tolerant Mapping • Patternization • Assign processes to fault tolerance patterns • Computational Binding • Assign architectural resources to patterns • I/O Binding • Bind OWP processes to architectural resources • Dispatcher • Map communication SAFE

  14. Patternization Fault-Tolerant Mapping

  15. Computational Binding Fault-Tolerant Mapping

  16. I/O Binding Fault-Tolerant Mapping

  17. Dispatcher Internal Communication External Communication Fault-Tolerant Mapping

  18. Simulation Extensions I • Fault Injection • Assumption: fault tolerant network • Software Initiated Fault Injection (SWIFI) method • Exponential random distribution • Currently only transient failures • Failure affects only current and active frame • Invalidates rest of frame SAFE

  19. Simulation Extensions II • Fault Detection • Fault masking • Fault tolerance pattern can deduce correct output • Fault correction • Skip frame • Restart (possibly from checkpoint) • Overhead of fault correction modeled (time, energy, memory) SAFE

  20. Simulation Extensions III • Checkpoint modeling • Full simulation of checkpoints (generation and usage) • Overhead taken into account (time, energy, buffers) • Explicit checkpoint • Periodically obtained checkpoints • Implicit checkpoint • Stateless point between frames • Any state between frames must be modeled explicitly SAFE Restart from CA.4.1, CB.3.1

  21. Experiments Experiments

  22. Experimental set-up • Applications • Motion-JPEG • Sobel Edge Detector • MP3 Decoder • Architecture • Shared memory • Active redundancy with different policies • 2 buses • 4 processors with 10-6 FIT • Exhaustive exploration of limited design space 86 Patterns Experiments TMR (skip) DMR (skip) TMR (1 restart) DMR (3 restart) TMR (2 restart) [1 Chk] DMR (1 restart) [2 Chk]

  23. Fault-tolerant DSE: MJPEG • DMR vs TMR • Less power DMR • TMR not beneficial for frame drop ratio • Restarting • Requires more power • Beneficial for frame drop ratio (31% to 0%) • Sudden increase drop ratio • Can not keep pace (more deadline misses) • Side effect: less power Experiments

  24. Fault-tolerant DSE: Sobel • DMR vs TMR • Less power DMR • TMR beneficialfor frame drop ratio (5% to 1.4%) • Restarting • Requires more power • Beneficialfor frame drop ratio (16% to 1.4%) Experiments

  25. Fault-tolerant DSE: MP3 • DMR vs TMR • Less power DMR • TMR not beneficial for frame drop ratio (w restart) • Restarting • Requires more power • Not beneficial for frame drop ratio TMR • Checkpoint overhead • More deadline misses Experiments

  26. Fault-tolerant DSE: Conclusion • TMR takes significantly more power, but drop ratio does not always improve significantly • Restart effort (+power) always reduces drop ratio, but difference may be small • Application dependent behavior • DSE is required for fault-tolerant design Experiments

  27. Conclusion CONCLUSION

  28. Conclusion • Fault-Tolerance should be a first class citizen of early DSE • SAFE is able to explore fault-tolerant design space • Custom patternization leverages fault handling capabilities and performance overhead • More expensive fault-tolerance patterns are not always beneficial • Future work • Large scale DSE • Mixed fault-tolerance patterns • Multiple applications Conclusion

  29. Questions? Conclusion

  30. MJPEG Patternization • Incremental Patternization • Patternization effects • Equalized subnetworks • Minimized communication • Guard cpu-intensive tasks Additional Slides

  31. Checkpoint overhead vs. benefit Additional

  32. Drop ratio versus #subnetworks Trade-off increased fault-tolerance and overhead subnetworks Additional

  33. Restart budget Additional Corrupt decreases due to fault masking effort Deadline misses only increases up to a certain limit (Real restart usage <= budget)