210 likes | 340 Views
StageNet: A Reconfigurable CMP Fabric for Resilient Systems. Shantanu Gupta Shuguang Feng Jason Blome Scott Mahlke. 2 nd Workshop on Reconfigurable and Adaptable Architecture Dec 1, 2007. Reliability Challenge . Increasing defect rates is a major challenge [ITRS’03]
E N D
StageNet: A Reconfigurable CMP Fabric for Resilient Systems Shantanu Gupta Shuguang Feng Jason Blome Scott Mahlke 2nd Workshop on Reconfigurable and Adaptable Architecture Dec 1, 2007 1
Reliability Challenge • Increasing defect rates is a major challenge [ITRS’03] • ↑ power density ↓ feature sizes↑ failures in time (FIT) • Permanent faults • Manufacturing defects • Time dependent dioxide breakdown (TDDB) • Negative bias threshold inversion (NBTI) • Electromigration (EM) • …. [Srinivasan, DSN‘04] For 32nm technology node, an 8 core CMP would face ~30 faults in 4 years 2
Tolerating Permanent Faults • Traditional solutions • TMR • Tandem / HP Non-stop • Impractical for mainstream • Cost • Power • Low gain • Current approaches • Detection/Prediction • Using sensors • Analytical models • Redundant execution • BIST • Repair • Replacement • Reconfiguration K-pos DP-31/32 Teramac (1995) 3
Reconfiguration Granularity • Range of choices for the reconfiguration granularity CORE level STAGE level MODULE level FETCH DEC EXEC MEM WB - ElastIC, DT’ 06 - Reunion, MICRO’06 - Configurable Isolation, ISCA’07 • Online Diagnosis of Hard Faults, MICRO’ 05 • - Ultra Low-Cost Defect Protection, ASPLOS’ 06 Better resource utilization Lower design complexity Lower overheads 4
Mean Time to Failure Comparison CORE level + Easiest to do in practice -- Poorest MTTF gains STAGE level + Circuit/logical boundary + Improved MTTF gains -- Architectural complexity MODULE level + Best MTTF gains -- Hardest to repair MODULE level STAGE level CORE level MTTF increase (%) Area increase (%) 5
Throughput Comparison • Monte-Carlo study • Randomly injected failures • Assumes that stages are shared resources STAGE level CORE level STAGE level reconfiguration allow significantly more graceful throughput degradation 6
Goal of this Research • Design a computing substrate • Fault tolerant • Graceful performance degradation with defects • Highly reconfigurable • Adaptable to the workload Design that can meet the challenge of facing ~ 100s of faults while maintaining 70-80% throughput 7
CMP Fabric Stage1 Stage2 Stage1 Stage2 Stage3 Stage3 StageN StageN Core 1 Core 0 Stage1 Stage2 Stage1 Stage2 Stage3 Stage3 StageN StageN Core 2 Core 3 8
Logical pipeline Allocator StageNet CMP Fabric Stage1 Stage2 Stage3 StageN Stage1 Stage2 Stage3 StageN Stage1 Stage2 Stage3 StageN Stage1 Stage2 Stage3 StageN Configuration Manager 9
StageNet CMP Fabric - Benefits Stage1 Stage2 Stage3 StageN Stage1 Stage2 Stage3 StageN Stage1 Stage2 Stage3 StageN Stage1 Stage2 Stage3 StageN Configuration Manager 10
Allocator StageNet CMP Fabric - Issues • Performance / Efficiency • Scaling with number of stages • Impact of router delay • Transmission delay (tdelay) • Congestion delay • Design overheads • Area • Power • Micro-architectural concerns • Data forwarding logic • Control flow handling 64 256 bits 11
Experimental Setup Simulates an in-order core with default parameters Stores statistics for the benchmarks Parameterizable performance model for StageNet SimpleScalar 4.0 - No. of instructions - No. of cycles - Branch mis-predicts - I/D cache misses …. StageNet Model MiBench suite CPI Results 12
Effect of varying pipeline depth tdelay 1 13
Effect of varying transmission delay stages 10 14
>> LD LD + / & + >> << ST ST Performance enhancement • Router delay is the leading cause for the slowdown • Need some way to improve system utilization • Let us send macro-ops (MOP) • MOP is an instruction bundle • Upper bound on length • Upper bound on live-ins / live-outs • No branches in between • Advantages • Amortizes delay / contention • Increases resource utilization Max length 4 Max live-ins 2 15
Effect of varying MOP size tdelay 4 stages 10 16
Conclusions • Reliability aware architectures with a finer grained reconfiguration are desirable for: • Better MTTF gains • Graceful throughput degradation • StageNet, a potential solution, allows stage level reconfiguration and is: • Easy to reconfigure • Inherently redundant • Potentially scalable issue width • Using StageNet, significant reconfiguration flexibility can be traded with a small loss in performance 17
Future Work • Micro-architectural issues • Data bypass handling • Control flow handling • Sharing state between pipeline stages • Network design • Design of routers • Design of interconnection • Simulation setup • Validation of results using a cycle accurate simulator 18
StageNet: A Reconfigurable CMP Fabric for Resilient Systems 19
DECODER DECODER IF/ID ID/EX DECODER BIST BIST CHECKER CHECKER Test Vectors Test Vectors (majority) (majority) Ultra Low-Cost Defect Protection for Microprocessor Pipelines, ASPLOS’ 06 Repair ElastIC DT’06 F. Bower, Tolerating Hard Faults in Microprocessor Array Structures, DSN’ 04 H.Qin, UC Berkeley 21