1 / 27

Architectural Vulnerability Factor (AVF) Computation for Address-Based Structures

Architectural Vulnerability Factor (AVF) Computation for Address-Based Structures. Arijit Biswas, Paul Racunas, Shubu Mukherjee FACT Group, DEG, Intel Joel Emer VSSAD, Intel Razvan Cheveresan Sun Microsystems, Intern FACT Group Ram Rangan Princeton University, Intern FACT Group. 12x GAP.

Jims
Download Presentation

Architectural Vulnerability Factor (AVF) Computation for Address-Based Structures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Architectural Vulnerability Factor (AVF) Computation for Address-Based Structures Arijit Biswas, Paul Racunas, Shubu Mukherjee FACT Group, DEG, Intel Joel Emer VSSAD, Intel Razvan Cheveresan Sun Microsystems, Intern FACT Group Ram Rangan Princeton University, Intern FACT Group

  2. 12x GAP 10000 1000 100 Latches Failure rate from Vulnerable 100% Vulnerable 10 20% Vulnerable 1 2007 2008 2009 2010 2011 2012 2003 2004 2005 2006 1000 year MTBF Year Goal Moore’s Law Graph • Soft errors are a serious problem • Assuming a certain error rate, failure rate of whole chip increases Chart based on 200,000 latches as used in the Fujitsu SPARC Processor (2003) FACT Group, Intel

  3. Bit 1 0 All bits are not created equal! Particle Strike Causes Bit Flip! FACT Group, Intel

  4. Bit Read? Bit has error protection benign fault no error benign fault no error benign fault no error Does bit matter? Does bit matter? True Detected Unrecoverable Error False Detected Unrecoverable Error Silent Data Corruption All bits are not created equal! Particle Strike Causes Bit Flip! no yes Detection & Correction no Detection only no yes no yes FACT Group, Intel

  5. Does bit matter? • Architectural Vulnerability Factor (AVF) • Probability that a bit flip will cause user-visible error • Soft Error Rate of a Structure = (AVFbit) x (# Bits) x (Intrinsic Error Rate)bit • Reducing AVF reduces SER • High AVF indicates need for protection • Low AVF can help remove protection hardware • SER Protection can be Expensive • Impacts Area, Power, Performance, Design Time FACT Group, Intel

  6. Simple Examples • Committed Program Counter AVF ~ 100% • Branch Predictor AVF = 0% FACT Group, Intel

  7. Complex Examples • Instruction Queue AVF = 29% • Execution Units AVF = 9% • Used a new concept • Architecturally Correct Execution (ACE) FACT Group, Intel

  8. Architecturally Correct Execution (ACE) Program Input Program Outputs • ACE path requires only a subset of values to flow correctly through the program’s data flow graph (and the machine) • Anything else (un-ACE path) can be derated away FACT Group, Intel

  9. Example of un-ACE instruction: Dynamically Dead Instruction Dynamically Dead Instruction Most bits of an un-ACE instruction do not affect program output FACT Group, Intel

  10. ACE Breakdown of Instruction Queue Average across all of Spec2K slices for an IA64-like processor ACE % = AVF = 29% FACT Group, Intel

  11. A New AVF Analysis – Address-Based Structures • Caches, data translation buffers, store buffers • Make up large portions of a modern chip • Simple ACE analysis is no longer enough • Data & Tag structures need new concepts • Extended Lifetime Analysis • Hamming-Distance-1 Analysis • Cooldown • AVF Reduction - Flushing FACT Group, Intel

  12. Lifetime Analysis • Idle is unACE • Assuming all time intervals are equal • For 3/5 of the lifetime the bit is valid • Gives a measure of the structure’s utilization • Number of useful bits • Amount of time useful bits are resident in structure • Valid for a particular trace Fill Read Read Evict Idle Valid Valid Valid Idle FACT Group, Intel

  13. Lifetime Analysis of Write-through Data Cache • Valid is not necessarily ACE • ACE % = AVF = 2/5 = 40% • Example Lifetime Components • ACE: fill-to-read, read-to-read • unACE: idle, read-to-evict, write-to-evict Fill Read Read Evict Idle Idle Write-through Data Cache FACT Group, Intel

  14. Lifetime Analysis of Write-through Data Cache • Data ACEness is a function of instruction ACEness • Second Read is by an unACE instruction • AVF = 1/5 = 20% Fill Read Read Evict Idle Idle Write-through DCache FACT Group, Intel

  15. Tags are Hard • A fault associated with a tag that is nominally associated with a particular instruction can impact the correct execution of a different independent instruction • False Negatives only error if writeback is necessary • Uses standard lifetime analysis • False Positives always result in error • Need bit-level analysis FACT Group, Intel

  16. False Positive • Expected Tag Miss, but got Hit – Error • How do you compute the AVF? Fault injection? Incoming Address Tag Address 1 0 0 1 1 0 0 0 • Expect: MISS Tag Address Incoming Address 1 0 0 1 1 0 0 1 • Acquire: HIT FACT Group, Intel

  17. Hamming-Distance-1 Analysis • Assuming a single-bit error model • Now we can use lifetime analysis on the identified bit(s) Tag Array 101010 Hamming-Distance-1 Match Incoming Address 001010 111010 000001 111000 Hamming-Distance-1 Match 010101 111111 FACT Group, Intel

  18. Edge Effects • Simulation introduces unknown component • Simulation not run to completion • Only execute small segment of code • Worst Case AVF = Known AVF + Unknown AVF • How do we reduce/eliminate unknown? Fill Read Read Evict Idle Unknown Not Simulated Idle Sim End FACT Group, Intel

  19. Cooldown • run simulation beyond end interval. • Any bits that were already valid (the unknown bits), are resolved • Trend: unknown AVF primarily resolves to unACE • Best Estimate AVF = Known AVF after Cooldown 10 Million Instructions Simulation 10 Million Instructions Cooldown No Cooldown Cooldown FACT Group, Intel

  20. Data AVFs (Average) • STB AVF lower due to large idle component and bytemasks • DTB AVF higher due to high average utilization • Dcache (WB) AVF higher than Dcache (WT) since dirty bytes still ACE after last read FACT Group, Intel

  21. Data AVF of DTB • Large variability in AVF • Ranges from ~0% to 80% • Based on structure utilization by benchmark FACT Group, Intel

  22. Tag AVFs (Average) • Tag AVFs lower than expected for DTB and DCache (WT) • Only Hamming-Distance-1 matches contribute ACE time • Tag AVFs higher than data for STB and DCache (WB) • Dynamically dead tags are still ACE for dirty bytes FACT Group, Intel

  23. Tag AVF of DTB • AVFs surprisingly small, little variation • Protection added to DTB CAMs prior to AVF calculation (large # bits) • AVF calculation shows NO protection was needed in this case FACT Group, Intel

  24. AVF Observations • DTB and Write-through Data Cache • Typically Tag AVF < Data AVF • only hamming-distance 1 hits contribute to Tag AVF • dynamic dead data are unACE • STB and Write-back Data Cache • Typically Tag AVF ≥ Data AVF • Tag AVF ACE till eviction if line is dirty • dynamic dead data can be ACE • Bytemasks and writes may make certain bytes of data unACE while all bits of tag are always ACE FACT Group, Intel

  25. Fill Flush AVF Reduction: Flushing • Flushing (emulates a context switch) • Also eliminates unknowns by flushing all live entries at end of simulation • Main concept: Transform part of ACE time into unACE at the Expense of some Performance Fill Read Read Evict Idle ACE ACE Idle FACT Group, Intel

  26. Data Tags AVF Reduction: Flushing • >50% AVF reduction for 100K cycle Flush (Flush takes 0 time) • Max IPC reduction: 1.77% DTB, 1.25% WT/WB DCache • Avg IPC reduction: 0.56% DTB, 0.19% WT/WB DCache No Flushing 5M cycle Flush 1M cycle Flush 100K cycle Flush FACT Group, Intel

  27. Summary • SER is an ever-increasing problem • Need standard, quantitative way to evaluate design cost of adding protection/recovery to structures • AVF Gives us a Quantitative way to Measure the cost of adding Protection • Presented a Methodology to Compute the AVF of Address Based Structures • Lifetime Analysis • False Negatives and False Positives • Hamming Distance-1 Analysis for False Positives • Edge Effects and Cooldown • Analogous to Warmup • AVF Reduction - Flushing FACT Group, Intel

More Related