330 likes | 611 Views
Architectural Vulnerability Factor (does a soft error matter?). Arijit Biswas Shubu Mukherjee Intel Corporation July, 2009. Outline . Background Terminology & Metrics AVF Computation Techniques AVF Results & Summary. What is a soft error? Strike Changes State of a Single Bit. 0. 1.
E N D
Architectural Vulnerability Factor(does a soft error matter?) Arijit BiswasShubu Mukherjee Intel Corporation July, 2009
Outline • Background • Terminology & Metrics • AVF Computation Techniques • AVF Results & Summary
What is a soft error?Strike Changes State of a Single Bit 0 1
source drain Impact of Neutron Strike on a Si Device neutron strike Strikes release electron & hole pairs that can be absorbed by source & drain to alter the state of the device + + - + + - - - Transistor Device • Secondary source of upsets: alpha particles from packaging
p p n n p n n p n p n Earth’s Surface Cosmic Rays Come From Deep Space • Neutron flux is higher at higher altitudes
Impact of Elevation Figure 8, Ziegler, et al., “IBM experiments in soft fails in computer electronics (1978 - 1994),” IBM J. of R. & D., Vol. 40, No. 1, Jan. 1996. • 3x-5x increase in Denver at 5,000 feet • 100x increase in airplanes at 30,000+ feet • No practical shielding exists (eg. >10 ft. of concrete)
# Vulnerable Bits Growing with Moore’s Law • Aggressive designs have significantly higher number of vulnerable latches • Additional soft errors on RAM cells, static logic, & dynamic logic • Higher soft error rate in multiprocessor systems • If left unprotected, a data center with hundreds to thousands of such systems may encounter data corruptions on a weekly or daily basis • Especially troubling for HPC systems • Even detected errors can be a concern for HPC systems • High rate of detection without recovery could cripple HPC systems • System halt or app-kill • HPC systems must consider recovery and failover mechanisms
Outline • Background • Terminology & Metrics • AVF Computation Techniques • AVF Results & Summary
Bit Read? Bit has error protection benign fault no error benign fault no error benign fault no error Does bit matter? Does bit matter? False Detected Unrecoverable Error True Detected Unrecoverable Error Silent Data Corruption Strike on a bit (e.g., in register file) Particle Strike Causes Bit Flip! no yes Detection & Correction no Detection only yes no no yes SDC = Silent Data Corruption, DUE = Detected Unrecoverable Error
Metrics • Interval-based • MTTF = Mean Time to Failure • MTTR = Mean Time to Repair • MTBF = Mean Time Between Failures = MTTF + MTTR • Availability = MTTF / MTBF • Rate-based • FIT = Failure in Time = 1 failure in a billion hours • SER FIT is broken into SDC FIT and DUE FIT • Total Soft Error FIT = (for each vulnerable device i) (circuit-level soft error ratei * AVFi) • AVF = fraction of faults that become user-visible errors
Architectural Vulnerability FactorDoes a bit matter? • Branch Predictor • Doesn’t matter at all (AVF = 0%) • Program Counter • Almost always matters (AVF ~ 100%) • Computing AVF for complex structures • Statistical Fault Injection • ACE Analysis
Outline • Background • Terminology & Metrics • AVF Computation Techniques • AVF Results & Summary
Statistical Fault Injection (SFI) into RTL Simulate Strike on Latch Logic 0 1 output 0 Does Fault Propagate to Architectural State AVF (crudely) = number of errors in arch. state / number of injected faults SFI into RTL good for latches, but very difficult for arch. structures SFI is extremely compute intensive as many fault simulations are needed in order to provide statistically significant results
Architecturally Correct Execution (ACE) Program Input • ACE path requires only a subset of values to flow correctly through the program’s data flow graph (and the machine) • Anything else (un-ACE path) can be derated away Dynamically Dead Instruction Program Outputs
NOP Prefetch ACE Inst Wrong- Path Inst Invalid Mapping ACE & un-ACE Instructions to the Instruction Queue ACEInst Architectural un-ACE Micro-architectural un-ACE SER & AVF are properties of a bit in hardware so instruction ACE-ness must be mapped to the hardware level
ACE Lifetime Analysis (1)(e.g., write-through data cache) • Idle is unACE • Assuming all time intervals are equal • For 3/5 of the lifetime the bit is valid • Gives a measure of the structure’s utilization • Number of useful bits • Amount of time useful bits are resident in structure • Valid for a particular trace Fill Read Read Evict Idle Valid Valid Valid Idle Write-through Data Cache
Fill Read Read Evict Idle Idle Write-through Data Cache ACE Lifetime Analysis (2)(e.g., write-through data cache) • Valid is not necessarily ACE • ACE % = AVF = 2/5 = 40% • Example Lifetime Components • ACE: fill-to-read, read-to-read • unACE: idle, read-to-evict, write-to-evict
ACE Lifetime Analysis (3)(e.g., write-through data cache) • Data ACEness is a function of instruction ACEness • Second Read is by an unACE instruction • AVF = 1/5 = 20% Fill Read Read Evict Idle Idle Write-through Data Cache
Computing AVF of address-based structures (tags) • A fault associated with a tag that is nominally associated with a particular instruction can impact the correct execution of a different independent instruction • False Negatives only error if writeback is necessary • Uses standard lifetime analysis • False Positives always result in error • Need bit-level analysis
False Positive • Expected Tag Miss, but got Hit – Error • How do you compute the AVF? Incoming Address Tag Address 1 0 0 1 1 0 0 0 • Expect: MISS Tag Address Incoming Address 1 0 0 1 1 0 0 1 • Acquire: HIT
Hamming-Distance-1 Analysis • Assuming a single-bit error model • Now we can use lifetime analysis on the identified bit(s) Tag Array 101010 Hamming-Distance-1 Match Incoming Address 001010 111010 000001 111000 Hamming-Distance-1 Match 010101 111111
Outline • Background • Terminology & Metrics • AVF Computation Techniques • AVF Results & Summary • Summary
Instruction Queue IA64-like SPEC 2K ACE percentage = AVF = 29%
Summary • Soft Errors: real problem today • Culprits: neutrons from deep space & alpha from packaging • Major problem in next few technology generations • Problem scales with Moore’s Law, die size, & system size • HPC systems are particularly affected due to size • Industry looking for cost-effective solutions • AVF analysis is the first step towards accurately estimating soft error rates at the architectural level • ACE Lifetime Analysis • Hamming-distance-one analysis
Data AVFs (Average) IA64-like SPEC 2K • STB AVF lower due to large idle component and bytemasks • DTB AVF higher due to high average utilization • Dcache(WB) AVF higher than Dcache(WT) since dirty bytes still ACE after last read
Tag AVFs (Average) IA64-like SPEC 2K • Tag AVFs are low for DTB and DCache (WT) • Only Hamming-Distance-1 matches contribute to ACE time • Tag AVFs higher than data for STB and DCache (WB) • Dynamically dead tags are still ACE for dirty bytes