210 likes | 354 Views
IVF: Characterizing the Vulnerability of Microprocessor Structures to Intermittent Faults. Songjun Pan 1,2 , Yu Hu 1 , and Xiaowei Li 1 1 Key Laboratory of Computer System and Architecture Institute of Computing Technology Chinese Academy of Sciences
E N D
IVF: Characterizing the Vulnerability of Microprocessor Structures to Intermittent Faults Songjun Pan1,2, Yu Hu1, and Xiaowei Li1 1Key Laboratory of Computer System and ArchitectureInstitute of Computing Technology Chinese Academy of Sciences 2Graduate University of Chinese Academy of Sciences
Outline Background and Related Work IVF Computing Methodology Experimental Results Conclusions
Background Failure Rate Deep Submicron Era Infant Mortality Stage Useful Life Stage Wear-out Stage Defect escape Faster Aging Lifetime Soft Errors Intermittent faults Intermittent faults are emerging as a major source of failures in microprocessors [DSN’02]
Intermittent Faults Description Occur frequently and irregularly for a period of time Caused by loose connection, manufacturing residuals, process variation, or in-progress wear-out, combined with voltage and temperature fluctuations Characteristics Occur in bursts at the same location Removed if replace the offending circuit Activated or deactivated by PVT (process, temperature, and voltage) variations
Protecting the Microprocessor Information redundancy techniques Parity and error-correcting codes High area overhead High power consumption Hardware redundancy techniques Dual modular redundancy/Triple modular redundancy 100%~200% area overhead Software redundancy techniques Redundant multi-threading 10%~30% performance overhead Conventional protection methods ensure high reliability but also cause high overhead
Trade-off Reliability and Overhead Key Observation Not all faults lead to external program failures A fault in branch predictor: doesn’t matter at all A fault in program counter: almost always matters Which bit matters? ACE bit / un-ACE bit: Architectural Correct Execution (ACE) bit [MICRO’03] ACE bit: If changed will lead to an external error Reliability evaluation Protect the most vulnerable structures
Related Metrics Mean Time To Failure (MTTF) / Mean Time Between Repair (MTBR) Masking effect Structure utilization Soft Error Vulnerability Analysis Architectural Vulnerability Factor (AVF) [MICRO’03] Program Vulnerability Factor (PVF) [HPCA’09] Hard Fault Vulnerability Analysis Hard-Faults AVF (H-AVF) [SIGMETRICS’06] The vulnerability to intermittent faults are rarely considered due to their rich causes and behaviors
Our Contributions Propose a metric Intermittent Vulnerability Factor (IVF) to characterize the vulnerability to intermittent faults IVF definition: a structure’s IVF is the probability an intermittent fault in that structure causes an external visible error Present IVF computing algorithms for reorder buffer and register file Compute IVF with different fault configurations
Intermittent Fault Models Causes and mechanisms Manufacturing residues Timing violations Oxide breakdown Inductive noise Cell Solder joint Electro- migration Crosstalk Soft breakdown Intermittent contacts Variation of metal R&C Fluctuation of leakage current Memory Buses Interconnection lines, buses Power supply Intermittent indetermination Intermittent Stuck-at Intermittent short Intermittent open Intermittent pulse Intermittent delay Fault models at the logic level
Intermittent Stuck-at Faults Intermittent stuck-at faults Change the correct value intermittently to logic one or logic zero Vulnerable structures: storage structures such as memory and register file Key Parameters Burst length/active time/inactivity time Have adverse effect during the active time active time inactive time . . . time burst length burst length
IVF Computing Determine whether an intermittent fault affects program execution or not Analyze ACE bit / critical time Set the three key parameters: burst length, active time, and inactive time Burst length: randomly generated from [10T, 30T] Duty cycle: 50% Start time: randomly generated Compute IVFs for reorder buffer and register file active time inactive time . . . time burst length burst length
Time Active time Inactive time An example of an intermittent fault IVF Computing – Reorder buffer ACE Bit Analysis B2 B3 Z B1 cycle ACE X bit entry Y Planar representation
IVF Computing – Register File Critical Time Analysis F1 F2 F3 … W R1 Allocation R2 Rlast Deallocation Time non- critical non- critical critical time n-1 n+1 register version n
Experimental Setup • Simulated processor configurations • Execution-driven simulator Sim-Alpha • Reorder buffer/register file 80/80 entries • 4 integer ALUs, 2 integer multipliers, 2 float ALUs • Hybrid, 4K global + 2-level 1K local + 4K choice branch predictor • 64KB 2-way L1 data cache, 2MB direct mapped L2 cache • Workload • SPEC2000 integer benchmark suite • Simulate 100M instructions with SimPoint
IVF vs AVF Reorder Buffer IVF varies significantly across benchmarks Longer burst length, higher IVF IVF is much higher than AVF
Different Fault Configurations Reorder Buffer IVF varies little across burst length configuration files IVF varies significantly for different active time
IVF at Entry Level Register File Architecture registers Renaming registers IVF varies across different entries Architecture registers are more vulnerable
Implications • Quantitatively guide reliability design at early design stage and evaluate system reliability • Harden partial structures/entries for high reliability while minimizing the overhead • Razor [MICRO’03] • Parshield [DSN’07] • Easily extend to analyze other structures (issue queue, load/store queue, and cache)
Conclusions • Propose a methodology to characterize the vulnerability of microprocessor structures to intermittent faults • Compute IVF for reorder buffer and register file • IVF varies significantly across inter- and intra-structures, motivating to protect the most vulnerable structures to improve system reliability
Thank You for Your Attention • Question?