630 likes | 750 Views
Increasing Reliability of Performance-critical Pipeline structures. Niranjan Soundararajan Advisors: Vijaykrishnan Narayanan Anand Sivasubramaniam Computer Systems Lab (CSL) Microsystems Design Lab (MDL) Computer Science and Engineering The Pennsylvania State University. 1.
E N D
Increasing Reliability of Performance-critical Pipeline structures Niranjan Soundararajan Advisors: Vijaykrishnan Narayanan Anand Sivasubramaniam Computer Systems Lab (CSL) Microsystems Design Lab (MDL) Computer Science and Engineering The Pennsylvania State University 1
Reliability – Increasing Importance Decreasing transistor size More transistors Power/Temperature Hotspots Increasing Market Segments HARDWARE RELIABILITY 2
Performance critical pipeline structures Out-of-order entry activity Back-to-Back wakeup Multi-width pipeline Clock frequency increase FRONT END BACK END BHT BTB Load/Store Queue Dcache Inst Issue Queue Fetch Decode ALU RAT Inst Retires Alloc Icache Reorder Buffer ARF
Transistor Failure Solutions to address impact of Process Variations on Issue Queue Solutions to reduce non-uniform aging due to NBTI, HCE on microprocessor structures Manufacturing Defects Wearout Failure Rate Soft Error impact of DVFS on vulnerability of GALS architectures Bounding vulnerability of processor structures to provide reliability guarantees Random Errors Time
Outline • Motivation • Contributions • Vulnerability bounding mechanisms • Other solutions • Impact of DVFS on architectural vulnerability of GALS architectures • Address process variations in issue queue • Mitigate NBTI, HCE degradation in structures • Conclusion and Future work 5
Strike creates electron-hole pairs that can be absorbed by source/diffusion areas of the transistor to change state of device N Error 1 0 p n+ n+ - - + + - - + + Introduction to Soft Errors Source: M. Tahoori
Severity In 2003, Fujitsu released SPARC64 with 80% of 200,000 latches covered by transient fault protection Single Event Upset (SEU) model Metrics MTBF : Mean Time Between Failures FIT : Failure in Time = 1 failure in a billion hours. FITeff = FITraw * AVF Impact of Soft Errors Severity of Soft Error Rates Source: Shekar Borkar, Intel 2004 7
Architectural Vulnerability Factor (AVF) LD A Architecturally Correct Execution (ACE) Instruction Wrong Path Dead Store BR ST B ADD unACE Instruction AVF - Fraction of bits in a structure vulnerable to soft errors - ACE bits / (ACE bits + UnACE bits) - Fn (Size, Time) ST B User Visible Output
AVF: Why is it important to Micro-architects? System Specification Architectural Design Logic Synthesis Circuit Design AVF per structure AVF FITraw System Reliability = ∑ (FITraw * AVF) Fabrication and Packaging Physical Design
State-of-Art • Microprocessor design: Multi-dimensional problem involving Performance, Power and Reliability • Transient Fault Tolerance • Simultaneous Redundant Threading (SRT) • Lockstepping • Optimization techniques • Parashar et al., ISCA’04 • Gomaa et al., ISCA’05 • Parashar et al., ASPLOS’06 • Reddy et al., ASPLOS’06 Performance Overhead Single point in Performance-Reliability space
Micro-architectural Reliability Knob More Reliable Less Performance FITeff = FITraw * AVF FITraw and AVF being constants Reliability Ideal Solution FITraw inflexible Tune AVF to meet specifications FITrequired Less Reliable More Performance Performance “Challenge for computer architects is not to provide absolute guarantees in reliability, but rather how to provide the adequate amount of reliability at the lowest cost for the target market segment” Architecture Design for Soft Errors – Shubu Mukherjee, Intel 11
Contributions • First work that provides micro-architectural knobs to satisfy processor reliability budgets for transient faults • Proactive and Reactive mechanisms to monitor and bound vulnerabilities of processor structures at cycle-level granularity
AVF Monitoring Reorder Buffer/Physical Register File RAT ARF Issue Queue ALU Fetch Decode Reorder Buffer (ROB) Commit Reorder Buffer (PRF) 1. Large pipeline structure holding number of instructions 2. Each instruction spends significant percentage of lifetime in ROB Pipeline out-of-order Pipeline In-order Pipeline In-order
AVF Monitoring MechanismReorder Buffer (ROB) R Commit Event Filled at WB Filled at Dispatch B Mis-speculation Reorder Buffer N N entries Each entry B bits Result R bits Writeback Event Dispatch Event 14
Vulnerability Control via Throttling (VCT) STALL DISPATCH AND WRITEBACK DISPATCH WRITEBACK Writeback cannot be stalled Entire Entry ACE at Dispatch Size = Fn (AVF Bound) N - Entry REORDERBUFFER 15
VCT Performance VCT High Integrity Low Integrity
Advantages of a Reactive Bounding Mechanism Reorder Buffer AVF Bound Exceeded Verify Results Early Accounting of Writebacks Mis-speculated Instructions
Simultaneous Redundant Threading (SRT): Importance of Selective Redundancy ARF RAT ISQ ALU Fetch Decode Reorder Buffer (PRF) ARF RAT Redundant Thread After Primary Thread Result Verification Reduces AVF Redundant Execution protects entire pipeline AVF goes down
Vulnerability Control via Selective Redundancy (VCSR) Infrastructure ARF RAT ISQ ALU Fetch Decode Reorder Buffer (ROB) RAT ARF AVF Bound Exceeded Greedy Heuristic Result Buffer 19
VCSR Performance VCSR SRT VCT High Integrity Low Integrity
OptimizationsPrimary Thread Out Of Order Commit Non-compacting Reorder Buffer Reduces AVF Performance Boost since lesser inst are re-executed RAT ARF ISQ ALU Fetch Decode Reorder Buffer (PRF) ARF RAT Writeback – Commit ROB AVF affected Sec. Thread maintains architected state Result Buffer 21
VCH with OOO Commit Performance VCH(OOO) SRT VCSR VCT High Integrity Low Integrity
Impact of vulnerability bounding • Per-cycle vulnerability bounds, guaranteeing FIT rates are met • Future Work • Looking at developing a system-level AVF monitoring and bounding infrastructure
Outline • Motivation • Contributions • Vulnerability bounding mechanisms • Summary of other works • Impact of DVFS on architectural vulnerability of GALS architectures • Address process variations in issue queue • Mitigate NBTI, HCE degradation in structures • Conclusion and Future work 24
Need for vulnerability analysis in GALS Architectures • Multiple domains, each driven by individual clocks • Need for global clock network avoided • GALS enables fine-grained VF scaling tuned to individual domains • DVFS provides high performance per watt • DVFS algorithms for GALS architectures are studied w.r.t IPC per watt • Voltage scalingaffects FITraw, Frequency scalingaffects AVF • Impact on AVF due to applying different DVFS algorithms • Help designers choose DVFS algorithms meeting reliability requirements Reliability Impact ignored
AVF impact across algorithms Significant AVF variations when applying different algorithms Most DVFS algorithms lead to worser AVF than Non-DVFS 38% variation Lower is better 26
Outline • Motivation • Contributions • Vulnerability bounding mechanisms • Other solutions • Impact of DVFS on architectural vulnerability of GALS architectures • Address process variations in issue queue • Mitigate NBTI, HCE degradation in structures • Conclusion and Future work 27
Process Variation (PV) - Introduction Process Variation: Variation in characteristics between two identically designed circuits Process Variation • Performance and Power impact significant • Lack of predictability in timing characteristics lead • to loss of yield Dynamic Static • Aging • Thermal Effects Definite need to address PV at circuit and microarchitectural level Systematic Random • Sub-wavelength • Lithography • Overlay • Dose • RDF [J. Tschanz et al., DAC 2005] 28
Contributions • Study the impact of PV on the Issue Queueof a microprocessor • PV-unaware design has about 21% performance degradation w.r.t Non-PV design • PV is a non-deterministic phenomenon. Design-time static partitioning not possible. Our solution enables the fast and slow entries to co-exist • Instruction steering and sub-component switchingschemes to reduce the impact of PV • Performance loss is about 1.3% w.r.t Non-PV design
Issue Queue Entry Issue Read Tag1 Tag N Forwarding Comparison Forwarding Write Opcode R Tag Operand R Tag Operand Dest Tag V Select Logic Dispatch Write INSTRUCTION ISSUE SELECT INST. READY Valid Bit Reset t t+1 t+2 t+3 Alloc stalls Dispatch ALLOC LOGIC Time DISPATCH WRITE Valid Bit Set ISQ Full FORWARDING Operand Ready Bit Set Instruction wait for Ready Operands
Results Stalls reduced w.r.t specific activity Operand and port-switching further reduce stalls to a minimum 1.3% 7.3% 12% Non-PV Shutdown MCD PV-Aware
Outline • Motivation • Contributions • Vulnerability bounding mechanisms • Other solutions • Impact of DVFS on architectural vulnerability of GALS architectures • Address process variations in issue queue • Mitigate NBTI, HCE degradation in structures • Conclusion and Future work 32
Increasing impact of transistor wearout • Transistor lifetime decreasing with newer technologies • Conservative Guardbands impact performance • System longevity affects revenue More than 50% organizations, machine-age > 10 years Decreasing Technology Source: Intel Poll by Gartner Research, Source: J. Blome, Micro 2007
Contributions • NBTI, HCE impact increasing in upcoming technologies • Conventional collapsing issue queues have unwanted instruction movement across entries • Collapsing required for age-based selection • Round-Robin scheme to provide restricted collapsing • Restricted collapsing balances switching activity,not losing much of age-based selection
Implementation Capture Rd / Wr / Sw / Data probabilities per cell HSpice (32nm, 380K) 10-year degradation SPEC2K Benchmark 100M instructions Simplescalar Architectural simulator [ISQ] Transistor-level Degradation model Typically, solutions look at worst-case probabilities that might rarely occur Read Delay Degradation
Results 1% reduction 32% reduction
Conclusion • Growing Reliability concern “Pop culture of reliability has arrived” - Dr. Phil Emma, IBM [Architecture Design for Soft Errors] • Work looks at increasing the fault-tolerance in back-end • Soft errors • Process variation • Wearout 37
Current Work • Multi-core design have come to prominence • While cache have ECC, the multiple pipelines involve structures holding data – ECC is hard • Total vulnerability to soft errors increases • Study the impact on AVF of different structures in a multi-core environment 38
Future Work • Multi-core • Cores increase, market segments increase • ILP vs TLP vs Clock frequency increase • Application/Hardware sense best configuration • Reconfigurable Hardware • Defect Tolerance • Verification time increasing • “Firmware update” to control functionality
Thank you 40
DVFS Algorithms µk = µk-1 + KI (q’k – qref) + Kp (q’k – q’k-1) fk = µk / IPC • Threshold • VF scale use fixed thresholds. Preset thresholds affects algorithm efficiency • Attack-Decay(AD) • Based on util. in adjacent intervals. Attack whenever big util. change. Otherwise decay. Greedy nature affects efficiency • Modified Attack-Decay (ModAD) • Attack phase modified to correspond to util. change. Large VF swing can affect performance per watt • PI • Greedy • Sample and Hold phase. VF scaling based on ED2 of past 2 intervals 42
Vulnerability Efficiency Non-DVFS has the best vulnerability efficiency On average, AD and PI provide the best vulnerability efficiency Lower is better 40% variation 43
Round Robin scheme Head Clk Ctrl Bit PseudoHead (PH) New Inst Clk Clk Ctrl Bit N Ctrl Bit 0 Tail 1 1 1 0 0 44 PH Later Entries Collapse Control Vector
Reliability Issues of Importance • Solutions that are robust but overhead-aware as well 45
Contributions Hardware Failure • Bounding vulnerability of • processor structures to • provide reliability guarantees • Study impact of DVFS on vulnerability of GALS architectures Solutions to reduce non-uniform aging due to NBTI, HCE on microprocessor structures Permanent Temporary Solutions to address impact of process variations on issue queue Wearout Transient Intermittent Process variation Radiation Non-Radiation Soft Errors Power supply Source: ISCA 2005 tutorial 46
Results SR with T(OOO) SRT SR Throttling (T) High Integrity Low Integrity 47
PV-aware steering - OptiSteer Non-Collapsing Op STag1 STag2 DTag Issue Queue Dest Tag Dest Tag RAT ISQ Entry id STALL - - - Slow Entry Bit Alloc Decoder Demux - - - Assigns ISQ Entry Stall Optimization Table Source Tags (STag1, STag2) 48
Intra-Entry Variation schemes Operand- and Port-Switching Issue Read Op STag1 Operand1 STag2 DTag V Opcode R Tag Operand R Tag Operand Dest Tag Op STag2 STag1 Operand1 DTag Operand Switch Dispatch Write Dispatch Port Switch Op STag1 Operand1 STag2 DTag 49
Timeline of ISQ activities SELECT INST. READY Port Switch Slow issue read SELECT INST. READY Less instructions selected INSTRUCTION ISSUE Valid Bit Reset t t+1 t+2 t+3 ALLOC LOGIC Alloc stalls Dispatch Time DISPATCH WRITE Valid Bit Set Operand Switch Port Switch FORWARDING Operand Ready Bit Set ISQ Full SOT Fill Slow Dispatch Write Instruction wait for Ready Operands SOT Value Required Forwarding Stall 50