280 likes | 729 Views
Architectural Vulnerability Factor (AVF) Computation for Address-Based Structures. Arijit Biswas, Paul Racunas, Shubu Mukherjee FACT Group, DEG, Intel Joel Emer VSSAD, Intel Razvan Cheveresan Sun Microsystems, Intern FACT Group Ram Rangan Princeton University, Intern FACT Group. 12x GAP.
E N D
Architectural Vulnerability Factor (AVF) Computation for Address-Based Structures Arijit Biswas, Paul Racunas, Shubu Mukherjee FACT Group, DEG, Intel Joel Emer VSSAD, Intel Razvan Cheveresan Sun Microsystems, Intern FACT Group Ram Rangan Princeton University, Intern FACT Group
12x GAP 10000 1000 100 Latches Failure rate from Vulnerable 100% Vulnerable 10 20% Vulnerable 1 2007 2008 2009 2010 2011 2012 2003 2004 2005 2006 1000 year MTBF Year Goal Moore’s Law Graph • Soft errors are a serious problem • Assuming a certain error rate, failure rate of whole chip increases Chart based on 200,000 latches as used in the Fujitsu SPARC Processor (2003) FACT Group, Intel
Bit 1 0 All bits are not created equal! Particle Strike Causes Bit Flip! FACT Group, Intel
Bit Read? Bit has error protection benign fault no error benign fault no error benign fault no error Does bit matter? Does bit matter? True Detected Unrecoverable Error False Detected Unrecoverable Error Silent Data Corruption All bits are not created equal! Particle Strike Causes Bit Flip! no yes Detection & Correction no Detection only no yes no yes FACT Group, Intel
Does bit matter? • Architectural Vulnerability Factor (AVF) • Probability that a bit flip will cause user-visible error • Soft Error Rate of a Structure = (AVFbit) x (# Bits) x (Intrinsic Error Rate)bit • Reducing AVF reduces SER • High AVF indicates need for protection • Low AVF can help remove protection hardware • SER Protection can be Expensive • Impacts Area, Power, Performance, Design Time FACT Group, Intel
Simple Examples • Committed Program Counter AVF ~ 100% • Branch Predictor AVF = 0% FACT Group, Intel
Complex Examples • Instruction Queue AVF = 29% • Execution Units AVF = 9% • Used a new concept • Architecturally Correct Execution (ACE) FACT Group, Intel
Architecturally Correct Execution (ACE) Program Input Program Outputs • ACE path requires only a subset of values to flow correctly through the program’s data flow graph (and the machine) • Anything else (un-ACE path) can be derated away FACT Group, Intel
Example of un-ACE instruction: Dynamically Dead Instruction Dynamically Dead Instruction Most bits of an un-ACE instruction do not affect program output FACT Group, Intel
ACE Breakdown of Instruction Queue Average across all of Spec2K slices for an IA64-like processor ACE % = AVF = 29% FACT Group, Intel
A New AVF Analysis – Address-Based Structures • Caches, data translation buffers, store buffers • Make up large portions of a modern chip • Simple ACE analysis is no longer enough • Data & Tag structures need new concepts • Extended Lifetime Analysis • Hamming-Distance-1 Analysis • Cooldown • AVF Reduction - Flushing FACT Group, Intel
Lifetime Analysis • Idle is unACE • Assuming all time intervals are equal • For 3/5 of the lifetime the bit is valid • Gives a measure of the structure’s utilization • Number of useful bits • Amount of time useful bits are resident in structure • Valid for a particular trace Fill Read Read Evict Idle Valid Valid Valid Idle FACT Group, Intel
Lifetime Analysis of Write-through Data Cache • Valid is not necessarily ACE • ACE % = AVF = 2/5 = 40% • Example Lifetime Components • ACE: fill-to-read, read-to-read • unACE: idle, read-to-evict, write-to-evict Fill Read Read Evict Idle Idle Write-through Data Cache FACT Group, Intel
Lifetime Analysis of Write-through Data Cache • Data ACEness is a function of instruction ACEness • Second Read is by an unACE instruction • AVF = 1/5 = 20% Fill Read Read Evict Idle Idle Write-through DCache FACT Group, Intel
Tags are Hard • A fault associated with a tag that is nominally associated with a particular instruction can impact the correct execution of a different independent instruction • False Negatives only error if writeback is necessary • Uses standard lifetime analysis • False Positives always result in error • Need bit-level analysis FACT Group, Intel
False Positive • Expected Tag Miss, but got Hit – Error • How do you compute the AVF? Fault injection? Incoming Address Tag Address 1 0 0 1 1 0 0 0 • Expect: MISS Tag Address Incoming Address 1 0 0 1 1 0 0 1 • Acquire: HIT FACT Group, Intel
Hamming-Distance-1 Analysis • Assuming a single-bit error model • Now we can use lifetime analysis on the identified bit(s) Tag Array 101010 Hamming-Distance-1 Match Incoming Address 001010 111010 000001 111000 Hamming-Distance-1 Match 010101 111111 FACT Group, Intel
Edge Effects • Simulation introduces unknown component • Simulation not run to completion • Only execute small segment of code • Worst Case AVF = Known AVF + Unknown AVF • How do we reduce/eliminate unknown? Fill Read Read Evict Idle Unknown Not Simulated Idle Sim End FACT Group, Intel
Cooldown • run simulation beyond end interval. • Any bits that were already valid (the unknown bits), are resolved • Trend: unknown AVF primarily resolves to unACE • Best Estimate AVF = Known AVF after Cooldown 10 Million Instructions Simulation 10 Million Instructions Cooldown No Cooldown Cooldown FACT Group, Intel
Data AVFs (Average) • STB AVF lower due to large idle component and bytemasks • DTB AVF higher due to high average utilization • Dcache (WB) AVF higher than Dcache (WT) since dirty bytes still ACE after last read FACT Group, Intel
Data AVF of DTB • Large variability in AVF • Ranges from ~0% to 80% • Based on structure utilization by benchmark FACT Group, Intel
Tag AVFs (Average) • Tag AVFs lower than expected for DTB and DCache (WT) • Only Hamming-Distance-1 matches contribute ACE time • Tag AVFs higher than data for STB and DCache (WB) • Dynamically dead tags are still ACE for dirty bytes FACT Group, Intel
Tag AVF of DTB • AVFs surprisingly small, little variation • Protection added to DTB CAMs prior to AVF calculation (large # bits) • AVF calculation shows NO protection was needed in this case FACT Group, Intel
AVF Observations • DTB and Write-through Data Cache • Typically Tag AVF < Data AVF • only hamming-distance 1 hits contribute to Tag AVF • dynamic dead data are unACE • STB and Write-back Data Cache • Typically Tag AVF ≥ Data AVF • Tag AVF ACE till eviction if line is dirty • dynamic dead data can be ACE • Bytemasks and writes may make certain bytes of data unACE while all bits of tag are always ACE FACT Group, Intel
Fill Flush AVF Reduction: Flushing • Flushing (emulates a context switch) • Also eliminates unknowns by flushing all live entries at end of simulation • Main concept: Transform part of ACE time into unACE at the Expense of some Performance Fill Read Read Evict Idle ACE ACE Idle FACT Group, Intel
Data Tags AVF Reduction: Flushing • >50% AVF reduction for 100K cycle Flush (Flush takes 0 time) • Max IPC reduction: 1.77% DTB, 1.25% WT/WB DCache • Avg IPC reduction: 0.56% DTB, 0.19% WT/WB DCache No Flushing 5M cycle Flush 1M cycle Flush 100K cycle Flush FACT Group, Intel
Summary • SER is an ever-increasing problem • Need standard, quantitative way to evaluate design cost of adding protection/recovery to structures • AVF Gives us a Quantitative way to Measure the cost of adding Protection • Presented a Methodology to Compute the AVF of Address Based Structures • Lifetime Analysis • False Negatives and False Positives • Hamming Distance-1 Analysis for False Positives • Edge Effects and Cooldown • Analogous to Warmup • AVF Reduction - Flushing FACT Group, Intel