370 likes | 508 Views
StageWeb : Interweaving Pipeline Stages into a Wearout and Variation Tolerant CMP Fabric. Shantanu Gupta Amin Ansari Shuguang Feng Scott Mahlke University of Michigan - Ann Arbor June 29, 2010. Reliability Threats. Transient Faults due to Cosmic Rays & Alpha Particles
E N D
StageWeb: Interweaving Pipeline Stages into a Wearout and Variation Tolerant CMP Fabric Shantanu Gupta AminAnsari ShuguangFeng Scott Mahlke University of Michigan - Ann Arbor June 29, 2010
Reliability Threats Transient Faults due to Cosmic Rays & Alpha Particles (Increase exponentially with number of devices on chip) Silicon Defects (Manufacturing defects and device wear-out) Electromigration Process Variation (random and systematic variations Frequency Negative Bias Threshold Inversion C C C C C C C C C Oxide Breakdown Speed binning on a die Intra-die ILD thickness
Fault Tolerance Aspects Detect and Diagnose Reconfigure Recover Has anything gone wrong? Figure out the cause Isolate the broken components Resume execution from a safe point
Reconfiguring a Multi-core • At the coarsest level, cores can be disabled. • Rumors that industry already uses this…. • IBM Cell w/ 7 SPEs, AMD Tri-Core • Can’t scale to higher failure rates! C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C Year 1 Year 3 Year 5 Year 7
Reconfiguration Granularity For 100% area overhead (redundancy) Better resource utilization Lower complexity CORE level STAGE level MODULE level FETCH DEC EXEC MEM WB • ElastIC, DT’ 06 • Reunion, MICRO’06 • Configurable Isolation, ISCA’07 • Online Diagnosis of Hard Faults, MICRO’ 05 • Ultra Low-Cost Defect Protection, ASPLOS’ 06 100% MTTF ↑ 170% MTTF ↑ 200% MTTF ↑ + Good MTTF gains + Circuit / Architectural boundary + Full coverage -- Poor MTTF gains + Easy to implement + Best MTTF gains -- Complex implementation
Stage1 Latch Stage2 Latch Stage3 StageN CMP Fabric Stage1 Stage2 Stage1 Stage2 Stage3 Stage3 StageN StageN Core 1 Core 0 Stage1 Stage2 Stage1 Stage2 Stage3 Stage3 StageN StageN Core 2 Core 3
Wearout Sensors • Delay • Temperature • Current The StageNet (SN) Fabric Crossbar Switch StageNet Slice (SNS) Inputs Stage1 Stage2 Stage3 StageN Outputs Stage1 Stage2 Stage3 StageN Stage1 Stage2 Stage3 StageN Stage1 Stage2 Stage3 StageN Configuration Manager
A 4-Slice SN chip Fetch Decode Issue Ex/Mem Fetch Decode Issue Ex/Mem Fetch Decode Issue Ex/Mem Fetch Decode Issue Ex/Mem Configuration Manager
1 2 3 4 BR 5 6 7 register dependency 8 9 10 Performance Comparison: Pipline vs. SN Slice register wb Fetch Decode Issue Ex/Mem WB Gen PC Branch Predictor Register File LATCH LATCH LATCH LATCH branch resolution bypass > 5X slowdown Commit Time 5 stage pipeline 1 2 3 6 7 8 9 10 SN Slice Fetch Decode Issue Ex/Mem Register File Gen PC Branch Predictor buffer buffer buffer buffer buffer buffer buffer 1 2 3 6 7 8 9 10 double double double double double double double 2. Data forwarding 3. Transmission delays 1. Control stall
>> LD LD + / & + >> << ST ST SN Slice Microarchitecture [MICRO’08] Fetch Decode Issue Ex/Mem Macro-op Generator Bypass $ Register File Gen PC Branch Predictor buffer buffer buffer buffer buffer buffer buffer SID SID double double double double double double double 1. Control Handling 2. Data Forwarding 3. Transmission Delays • Bypass $ • Stores previous results • Fully associative structure • Emulates data forwarding • Stream ID • Control flow handling • Eliminates flush signals • Macro-Ops • Send instruction • bundles • Amortizes transfer • delay • Increases system • utilization 0 1
SN Slice Performance [MICRO’08] SNS + StreamID SNS + StreamID + Bypass$ SNS + Stream ID + Bypass$ + MOPs 6 10% slowdown 5 4 Normalized Runtime 3 2 1 0 wc mcf idct eqn grep 3des Mean rijndael rawcaudio rawdaudio g721encode g721decode
SN System - scaling to 100+ cores? D I E/M F 1. Crossbars don’t scale well due to wiring / layout complexity - Area - Delay - Power D I E/M F D I E/M F D I E/M F D I E/M F D I E/M F D I E/M F D I E/M F D I E/M F 2. Interconnection prone to failures - Single point of failure - Links have no redundancy D I E/M F D I E/M F D I E/M F
StageWeb: Scaling to 100+ cores SN Island SN SN SN SN SN SN SN SN SN SN SN SN SN SN SN SN Traditional many-core StageWeb many-core In a large many-core system, small groups of cores can form SN What’s the right size for a SN island?
StageWeb: Scaling to 100+ cores Good scaling Poor scaling In a large many-core system, small groups of cores can form SN What’s the right size for a SN island? Unfortunately, a single crossbar can’t scale to 8-10 pipelines!
Interconnection Alternatives • 1. Connectivity • Single • Single + Front-Back • Overlap • Overlap + Front-Back Fetch Decode Issue Ex/Mem Island 1 Fetch Decode Issue Ex/Mem Back-end Front-end Fetch Decode Issue Ex/Mem Island 2 Fetch Decode Issue Ex/Mem Fetch Decode Issue Ex/Mem Island 3 Fetch Decode Issue Ex/Mem Front-end Back-end Fetch Decode Issue Ex/Mem Island 4 Fetch Decode Issue Ex/Mem
Interconnection Alternatives • 1. Connectivity • Single • Single + Front-Back • Overlap • Overlap + Front-Back Fetch Decode Issue Ex/Mem Island 1 Fetch Decode Issue Ex/Mem Fetch Decode Issue Ex/Mem Island 2 Fetch Decode Issue Ex/Mem 2. Reliability Fetch Decode Issue Ex/Mem Island 3 Fetch Decode Issue Ex/Mem Inputs Fetch Decode Issue Ex/Mem Inputs Island 4 Inputs Fetch Decode Issue Ex/Mem Outputs Outputs Outputs c) fault-tolerant crossbar a) crossbar b) crossbar with spares
Interconnection Configuration Fetch Decode Issue Ex/Mem Island 1 Fetch Decode Issue Ex/Mem Fetch Decode Issue Ex/Mem Island 2 Fetch Decode Issue Ex/Mem Faults in stages, crossbar ports, links, force a reconfiguration….
Interconnection Configuration Fetch Decode Issue Ex/Mem Island 1 Fetch Decode Issue Ex/Mem Fetch Decode Issue Ex/Mem Island 2 Fetch Decode Issue Ex/Mem • Single crossbar configuration • Local to every island
Interconnection Configuration Fetch Decode Issue Ex/Mem Island 1 Fetch Decode Issue Ex/Mem Island 2 Fetch Decode Issue Ex/Mem Island 3 Fetch Decode Issue Ex/Mem • Overlap crossbar configuration • Sweep islands, forming pipelines opportunistically
StageWeb Benefits • Scalability • Scaling SN to benefit 100+ core systems • Interconnection Reliability • Handling faults in crossbars and links • Process Variation • Slower components can be isolated in a multi-core chip
Mitigating Process Variation Frequency Fast Fetch Decode Issue Ex/Mem Issue Ex/Mem Slow Fetch Decode Issue Ex/Mem Fetch Decode Issue Ex/Mem Medium Fetch Decode Fetch Decode Issue Ex/Mem Fast Severe process variation and lifetime wearout can result in a disparity of health for various resources StageNet can effectively isolate strong/weak resources
Evaluation Interconnections Crossbar types • Open RISC 1200 cores (4-stage in-order) • 12 configurations compared, 64-cores each • Experiments • Lifetime evaluations - throughput and total work • Process variation - speed binning on a die
Lifetime Reliability Evaluations • Monte Carlo simulation with 300+ lifetime experiments • Where, each lifetime experiment involves - • Assigning a time-to-failure to all stages • Killing components at their failure times • Reconfiguring system to isolate broken components • Repeating this until no logical pipeline can be formed • Cumulative work and throughput are recorded • Number of cores: 64 • Technology node: 90 nm
Cumulative Work ~70% more work!
Cumulative Work (area neutral) 52 cores • Best StageWebConfiguration • Overlapping interconnection network • 52 cores • 6 adjacent slices connected by each crossbar • Fault-tolerant crossbars
Mitigating Process Variation Freq 27 For a given frequency target, StageWeb can operate: More cores, OR Same # of cores at lower voltage 45
Conclusions • Architectural innovations will be crucial in tackling technological uncertainties • StageWebis a potential solution • Allows fine-grained isolation of failures • Most reliability gains from grouping 8-10 pipelines • Scalable to 100+ cores • StageWebcan also mitigate process variation by grouping together faster and slower parts
Thank You http://cccp.eecs.umich.edu
Interconnection Alternatives • 1. Connectivity • Simple • Simple + Front-Back • Overlap • Overlap + Front-Back 2. Reliability Front-end Back-end Fetch Decode Issue Ex/Mem Island 1 Fetch Decode Issue Ex/Mem Inputs Fetch Decode Issue Ex/Mem Inputs Island 2 Inputs Fetch Decode Issue Ex/Mem Outputs Outputs Outputs c) fault-tolerant crossbar a) crossbar b) crossbar with spares
SN System Level Issues D I E/M F 1. Crossbars don’t scale well due to wiring / layout complexity - Area - Delay - Power D I E/M F D I E/M F D I E/M F D I E/M F D I E/M F D I E/M F D I E/M F 2. Interconnection prone to failures - Single point of failure - Links have no redundancy D I E/M F D I E/M F D I E/M F