630 likes | 751 Views
Cooperative cross-layer protection for resource constrained embedded systems. Prof. Nikil Dutt Prof. Nalini Venkatasubramanian Prof. Lichun Bao. Kyoungwoo Lee (topic exam). June 17, 2008. Contents. Motivation Cooperative, Cross-layer Methods PPC (Partially Protected Caches)
E N D
Cooperative cross-layer protection for resource constrained embedded systems Prof. Nikil Dutt Prof. Nalini Venkatasubramanian Prof. Lichun Bao Kyoungwoo Lee (topic exam) June 17, 2008
Contents • Motivation • Cooperative, Cross-layer Methods • PPC (Partially Protected Caches) • EAVE (Error-Aware Video Encoding) • Thesis Outline and Plan
Motivation • Mobile computing is popular Business Resource-limited mobile devices! Fundamental problem is to achieve low power with high performance Communication, Entertainment, & Education Battlefield Wellness Science
Motivation (cont’) • Reliability is an emerging and critical concern • Mobile applications are running close to humans • Wearable computing and wellness mobile devices • New enhanced technology makes devices vulnerable to errors due to high complexity and high integration • Exponential increase of soft error rate as technology scales [Hazucha, 00] • Redundancy techniques incur high overheads of power and performance • TMR (Triple Modular Redundancy) may exceed 200% overheads without optimization [Nieuwland, 06] • Challenging to optimize multiple properties (e.g., performance, power, and reliability) in mobile embedded systems
Reliability Across Layers in Mobile Devices Application Middleware/ OS Hardware Network • Errors and error control schemes at system abstraction layers
Errors and Error Control Schemes at Hardware Hardware Application Network MW/ OS • Hardware failures are increasing as technology scales • (e.g.) SER increases by up to 1000 times [Mastipuram, 04] • Redundancy techniques are expensive • (e.g.) ECC-based protection in caches incurs 95% performance penalty [Li, 05] • FIT: Failures in Time (109 hours) • MTTF: Mean Time To Failure • MTBF: Mean Time b/w Failures • TMR: Triple Modular Redundancy • EDC: Error Detection Codes • ECC: Error Correction Codes • RAID: Redundant Array of • Inexpensive Drives
Errors and Error Control Schemes at Software Hardware Application Network MW/ OS • Software errors become dominant as system’s complexity increases • (e.g.) Several bugs per kilo lines • Hard to debug, and redundancy techniques are expensive • (e.g.) Backward recovery with checkpoints is inappropriate for real-time applications • QoS: Quality of Service
Errors and Error Control Schemes in Networks Hardware Application Network MW/ OS • Network is unreliable (especially, wireless networks) • Combined approaches across OSI layers have been investigated for optimal solutions [Vuran, 06][Schaar, 07] • SNR: Signal to Noise Ratio • MTTR: Mean Time To Recovery • CRC: Cyclic Redundancy Check • MIMO: Multiple-In Multiple-Out
Thesis Problem Statement • Study conflicts among system properties • Examine errors and error control schemes across system abstraction layers • Maximize reliability while minimizing costs of power and performance for mobile embedded systems
Why Cross-Layer Approach? • Cross-layer interactions and conflicts arise between system properties • DVS increases SER exponentially • Over protection or under protection • All ECC for multimedia data is an overkill • Cross-layer approaches can maximize the reliability with minimal power and performance overheads • Benefits of Cross-layer approaches • Global system view • Coordination for intelligent selection • Adaptation • Cross-layer approaches have been promising to save the resources at the cost of QoS [Mohapatra, 05][Yuan, 04] • DVS: Dynamic Voltage Scaling • SER: Soft Error Rate • ECC: Error Correction Codes • QoS: Quality of Service
Thesis Proposed Contribution: CC-PROTECT • Cooperative Cross-layer Protection (CC-PROTECT) by exploiting error-awareness and error control schemes across system abstraction layers • Contribution • Present cost-efficient reliability methods (cooperative cross-layer protection) • Open expanded tradeoff spaces and operating points • Rediscover applicability of existing approaches for other purposes
Contents Application Hardware Middleware/ OS Network • Motivation • Cooperative, Cross-layer Methods • PPC (Partially Protected Caches) • EAVE (Error-Aware Video Encoding) • Thesis Outline and Plan
Soft Errors (Transient Faults) • SER increases exponentially as technology scales • Integration, voltage scaling, altitude, latitude • Caches are most hit due to: • Larger portion in processors (more than 50%) • No masking effects (e.g., logical masking) Intel Itanium II Processor [Baumann, 05] Transistor 5 hours MTTF 0 1 1 month MTTF Bit Flip • MTTF: Mean time To Failure
Conventional Protection for Caches • Conventional Protected Caches (Safe) • Unaware of fault tolerance at applications • Implement a redundancy technique such as ECC to protect all data for every access • Overkill for multimedia applications • ECC (e.g., a Hamming Code) incurs high performance penalty by up to 95%, and power overhead by up to 22% Unaware of Application High Cost Cache ECC
Related Work • Process Technology Solutions • Hardening [Baze, IEEE Trans. on Nuclear Science 00] • SOI [O. Musseau, IEEE Trans. on Nuclear Science 96] • Process complexity, yield loss, and substrate cost • Microarchitectural Solutions for Caches • Cache Scrubbing [Mukherjee, PRDC04] • Low Power Cache [Li, ISLPED04] • Area Efficient Protection [Kim, DATE06] • Multiple Bit Correction [Neuberger, TODAES 03] • Cache Size Selection [Cai, ASP-DAC06] • In-Cache Replication [Zhang, DSN03] • Replication Cache [Zhang, IEEE Computers 05] • High overheads in terms of power, performance, and area • Our Solution • Protects caches from failures due to soft errors exploiting error-tolerance of applications • Protection can be in conjunction with any techniques
Unequal Data Protection • All pages are not equally failure critical • Multimedia data is failure non-critical • Program variables are failure critical • Failures: system crash, infinite loop, segmentation faults, etc • QoS degradation is not a failure Only 9 pages out of 83 are failure critical
PPC (Partially Protected Caches) • Propose PPC architectures to provide an unequal protection for mobile multimedia systems [Lee, TVLSI08] • Unprotected cache and Protected cache at the same level of memory hierarchy • Protected cache is typically smaller to keep power and delay the same as or less than those of Unprotected cache PPC Unprotected Cache Protected Cache How to Partition Data? Memory
PPC Unprotected Cache Protected Cache Memory PPC for Multimedia Applications • Propose a selective data protection based on HPC [Lee, CASES06] • Unequal protection at hardware layer exploiting error-tolerance at application layer • Simple data partitioning for multimedia applications • Multimedia data is failure non-critical • All other data is failure critical Power/Delay Reduction Fault Tolerance • HPC: Horizontally Partitioned Caches
PPC Unprotected Cache Protected Cache Memory PPC for General Applications • DPExplore [Lee, PPCDIPES08] • Explore partitioning space by exploiting awareness of vulnerability of each data page • Vulnerable time • It is vulnerable for the time when eventually it is read by CPU or written back to Memory • Pages causing high vulnerable timeare failure critical • Vulnerable time closely estimates failure rate invulnerable Incoming Eviction data Read Write t0 t1 t2 t3 Vulnerable
Experimental Results – Failure Rate Failure rate of PPC is close to that of Safe (Safe is a protected cache configuration with an ECC protection, i.e., protecting all data, and Unsafe is an unprotected cache)
Experimental Results – Performance Runtime of PPC is close to that of Unsafe
Experimental Results – Power Energy consumption of PPC is close to that of Unsafe
Summary – PPC (Partially Protected Caches) • All data are not equally failure critical • Propose a PPC architecture to provide unequal data protection • Support an unequal protection at hardware layer by exploiting error-tolerance and vulnerability at application • Present cost-efficient reliability • Related Publications • [Lee, CASES06] • [Lee, PPCDIPES08] • [Lee, TVLSI08]
Contents Application Middleware/ OS Hardware Network • Motivation • Cooperative, Cross-layer Methods • PPC (Partially Protected Caches) • EAVE (Error-Aware Video Encoding) • Thesis Outline and Plan
Parameters Resilience PLR Error-Resilient Video Encoding Network • Error-resilient video encodings have been developed to combat errors in networks • PBPAIR – energy-efficient and error-resilient video encoding [Kim,06] • Passive Error Exploitation • It compresses video data according to PLR Mobile Video Application Embed Error-Resilience against packet losses Maintain the QoS Packet Loss • PBPAIR: Probability-Based Power Aware Intra Refresh Error-prone Networks
Related Work • Energy/QoS-aware video encoding • Video encoding parameters [Mopatra, IPDPS05] • Motion estimation algorithm [Tourapis, VCIP00] • Integrated power management [Mohapatra, ACM MM03] • Global cross-layer adaption [Yuan, MMCN04] • Transmission power and QoS [Eisenberg, IEEE Trans. on CSVT 02] • Not consider error-resilience • Error-resilient video encoding • Error-resilient GOP [Yang, JVCIP07] • AIR (Adaptive Intra Refreshing) [Worral, ICASSP01] • PGOP (Progressive GOP) [Cheng, PCS04] • PBPAIR (Probability-Based Power Aware Intra Refresh) [Kim, MCCR06] • Passive error exploitation • Our Solution • Error-aware video encoding: exploits errors actively to minimize energy consumption
Active Error Exploitation – Intentional Frame Drop • Intentional Frame Drop (one way to actively exploit errors) can result in energy reduction for each operation • FDT-1 affects the following components with respect to power, performance, and QoS in mobile video applications Mobile Video Application Enc Tx Rx Dec CPU WNI WNI CPU FDT-1 FDT-2 FDT-3 Packet Loss • FDT: Frame Drop Type • Enc: Encoding, Dec: Decoding • WNI: Wireless Network Interface Error-prone Networks
Error-Aware Video Encoding • Propose EE-PBPAIR [Lee, DIPES08] • Intentionally drop frames at video encoding • Reduce the energy consumption for video encoding • Maintain the video quality by exploiting error-resilience of PBPAIR Intentional frame drop Packet Loss Error-Aware Video Encoder (EAVE) Error- Resilient Video Error- Aware Video Original Video Error-Controller (e.g., frame dropping) Error-Resilient Encoder (e.g., PBPAIR) EIR • EIR: Error Injection Rate Error-prone Networks
Error-Aware Video Encoding (EAVE) Network • Cross-layer architecture • Intentional exploitation of errors at application incorporating error-resilience in network Resilience FLR EIR feedback PLR • EIR: Error Injection Rate • FLR: Frame Loss Rate • PLR: Packet Loss Rate
Experimental Results – Energy Reduction Energy saving occurs at every component in a path from encoding to decoding in mobile video applications EC = Energy Consumption Enc EC = EC for Encoding Tx EC = EC for Transmission Dec EC = EC for Decoding Rx EC = EC for Receiving • PLR = 10% and EIR = 10% • PSNR: Peak Signal to Noise Ratio
Summary – EAVE (Error-Aware Video Encoding) • Intentional Frame Drop is one way to exploit errors actively • Propose an error-aware video encoding (EE-PBPAIR) • Intentional frame dropping and the nature of energy-efficiency of PBPAIR reduces the energy consumption for video encoding • Present a knob (EIR) to adjust the amount of errors considering the QoS feedback • Maintain the video quality using error-resilience of PBPAIR • Related Publication • [Lee, DIPES08]
Contents • Motivation • Cooperative, Cross-layer Methods • PPC (Partially Protected Caches) • EAVE (Error-Aware Video Encoding) • Thesis Outline and Plan
Thesis Outline Middleware/ OS Network Hardware Application • Thesis Problem • Exploit errors and error control schemes across layers to maximize reliability with minimal costs for mobile embedded systems • Topic 1 – Approach at hardware and application layers • PPC (unequal data protection at hardware exploiting error tolerance at application) [Lee, CASES06][Lee, DIPES08][Lee, TVLSI08] • Topic 2 – Approach at application, middleware, and network layers • EAVE (intentional exploitation of errors at application, incorporating error resilience in networks) [Lee, DIPES08] • Topic 3 – Approach across application/middleware-OS/HW • CC-PROTECT (middleware-driven cooperative exploitation of errors and error control schemes across layers) [under submission to ACM MM 08 and on-going work]
Outline of CC-PROTECT Original Video Error-Controller (e.g., frame drop) Error-Resilient Encoder (e.g., PBPAIR) Error- Aware Video Error-Aware Video Encoder (EAVE) Mobile Video Application Error Injection Rate & Frame Loss Rate QoS Loss BER (Backward Error Recovery) DFR (Drop & Forward Recovery) Monitor & Translate SER Trigger Selective DFR Support EAVE & PPC Packet Loss Frame Drop MW/OS Mobile Video Application Feedback SER Data Mapping frame K frame K+1 Parameter Unprotected Cache Protected Cache EDC Error detection PPC Error-prone Networks Error-prone Networks
Time Plan • Fall, 2003 ~ Spring, 2008 • PPC, EAVE, etc. • Summer, 2008 • CC-PROTECT • Extended versions of previous work • End of Summer, 2008 • Final Defense • Dissertation
Publications Application Middleware/ OS Hardware Network [Lee, TVLSI08] K. Lee, A. Shrivastava, I. Issenin, N. Dutt, and N. Venkatasubramanian, “Partially protected caches to reduce failures due to soft errors in multimedia applications”, In IEEE Transactions on Very Large Scale Integration Systems (TVLSI), 2008, to appear. [Lee, DIPES08] K. Lee, M. Kim, N. Dutt, and N. Venkatasubramanian, “Error exploiting video encoder to extend energy/QoS tradeoffs for mobile embedded systems”, In 6th IFIP Working Conference on Distributed and Parallel Embedded Systems (DIPES), Sep. 2008, to appear [Lee, PPCDIPES08] K. Lee, A. Shrivastava, N. Dutt, and N. Venkatasubramanian, “Data partitioning techniques for partially protected caches to reduce soft error induced failures”, In 6th IFIP Working Conference on Distributed and Parallel Embedded Systems (DIPES), Sep. 2008, to appear [Lee, CASES06] K. Lee, A. Shrivastava, I. Issenin, N. Dutt, and N. Venkatasubramanian, “Mitigating soft error failures for multimedia applications by selective data protection”, In Int.Conference on Compilers, Architecture, & Synthesis for Embedded Systems (CASES), Oct. 2006. [Lee, ICME05] K. Lee, N. Dutt, and N. Venkatasubramanian, “Experimental Study on Energy Consumption of Video Encryption for Mobile Handheld Devices", In IEEE International Conference on Multimedia and Expo (ICME 05), Poster Session, July 2005. [Mohapatra, IPDPS05] S. Mohapatra, R. Cornea, H. Oh, K. Lee, M. Kim, N. Dutt, R. Gupta, A. Nicolau, S. Shukla, and N. Venkatasubramanian, “A cross-layer approach for power-performance optimization in distributed mobile systems”, In Next Generation Software Program in conjunction with IEEE International Parallel and Distributed Processing Symposium (IPDPS), April 2005. [Lee, DIPES08] [Lee, TVLSI08] [Lee, PPCDIPES08] [Lee, CASES06] [Mohapatra, IPDPS05] [Lee, ICME05]
Performance vs. Capacity • Total energy available from a battery is a design issue and is fixed at a design time, along with its weight and size • Stark contrast between linear growth rate of battery capacity and exponential technology improvement rate of system components [Udani] Sanjay Udani and Jonathan Smith, “Power management in mobile computing”
Generalized Fault Tolerance Techniques • Modular Redundancy • N-Version Programming • Error-Control Coding • Checkpoints and Rollbacks • Recovery Blocks [Chetan, SPC04] S. Chetan, A. Ranganathan, and R. Campbell, “Towards Fault Tolerant Pervasive Computing”, in SPC ’04 [Somani, IEEECom97] A. K. Somani and N. H. Vaidya, “Understanding Fault Tolerance and Reliability”, in IEEE Computer ’97 vol. 30 issue 4
1) Modular Redundancy • Modular Redundancy • Multiple identical replicas of hardware modules • Voter mechanism • Compare outputs and select the correct output Tolerate most hardware faults Effective but expensive fault Data Producer A Consumer voter Producer B
2) N-version Programming • N-version Programming • Differentversions by different teams • Different versions may not contain the same bugs • Voter mechanism Tolerate some software bugs Data Producer A Consumer voter Program i fault Program j Programmer K Programmer L
3) Error-Control Coding • Error-Control Coding • Replication is effective but expensive • Error-Detection Coding and Error-Correction Coding • (example) Parity Bit, Hamming Code, CRC Much less redundancy than replication fault Data Producer A Consumer Error Control Data
4) Checkpoints & Rollbacks • Checkpoints and Rollbacks • Checkpoint • A copy of an application’s state • Save it in storage immune to the failures • Rollback • Restart the execution from a previously saved checkpoint Recover from transient and permanent hardware and software failures Data Producer A Consumer Application State K Rollback state (K-1) state K fault Checkpoint
5) Recovery Blocks • Recovery Blocks • Multiple alternates to perform the same functionality • One Primary module and Secondary modules • Different approaches • Select a module with output satisfying acceptance test • Recovery Blocks and Rollbacks • Restart the execution from a previously saved checkpoint with secondary module Tolerate software failures Data Producer A Consumer Application Block X Block X2 Block Y Block Z Rollback state (K-1) state K fault Checkpoint
Soft Errors on Increase • Increase exponentially due to technology scaling • 0.18 µm • 1,000 FIT per Mbit of SRAM • 0.13 µm • 10,000 to 100,000 FIT per Mbit of SRAM • Voltage Scaling • Voltage scaling increases SER significantly Qcritical CS SER Nflux x x exp {- } Qs where Qcritical = V C x