270 likes | 420 Views
Dynamically Collapsing Dependencies for IPC and Frequency Gain. Peter G. Sassone D. Scott Wills Georgia Tech Electrical and Computer Engineering { sassone, scott.wills } @ ece.gatech.edu. Motivation. Outside of pipeline, global communication dominates Memory wall is well studied
E N D
Dynamically Collapsing Dependencies for IPC and Frequency Gain Peter G. Sassone D. Scott Wills Georgia Tech Electrical and Computer Engineering { sassone, scott.wills } @ ece.gatech.edu
Motivation • Outside of pipeline, global communication dominates • Memory wall is well studied • Inside, traditionally computation or logic dominated I cache fetch memory decode L2 cache rename issue D cache exec commit Sassone & Wills / Georgia Tech / Dynamic Strands
Motivation • Now dominated by local communication paths: • issue window • reorder buffer • register file • bypass network • Bottlenecks both IPC and frequency issue logic reg file alu issue queue alu alu Sassone & Wills / Georgia Tech / Dynamic Strands
Motivation • RISC instruction sets create superfluous traffic • All instructions and operands are treated as equal • Little focus on exposing sequentiality issue logic reg file alu issue queue alu alu Sassone & Wills / Georgia Tech / Dynamic Strands
Contributions • Dynamic Strands: • collapse dependence-chains without fan-out • exploit properties for simple value precomputation • increase efficiency of critical resources • preserve binary compatibility • IPC improvements: • 17-20% speedup on Spec2000int and MediaBench • Frequency improvements: • 37% fewer in-flight instructions • reduced dependence on dependencies Sassone & Wills / Georgia Tech / Dynamic Strands
Outline • Motivation • Transient Operands and Strands • Instruction Replacement Hardware • Results • Conclusion Sassone & Wills / Georgia Tech / Dynamic Strands
Dyadic Dilemma R1 R2 + R1’ R3 + R1’’ R4 + R9 Performing any operation on more than two sources requires temporary values int sum( int a, int b, int c, int d ) { return a + b + c + d; } . . . add R1 R1, R2 add R1 R1, R3 add R9 R1, R4 . . . Sassone & Wills / Georgia Tech / Dynamic Strands
Transient Operands • We term these temporary values transient operands: • values produced by an ALU inst • values consumed only once, and only by an ALU inst • Common in modern integer workloads… On average, about 40% of all dynamic operands are transient Sassone & Wills / Georgia Tech / Dynamic Strands
Strands • Strands: • linear chains of instructions joined by transient operands • non-consecutive • span basic blocks • three instructions • only the final output needs to be committed • Strands are common • dyadic temporaries • compiler strategies • language semantics a b c + d + + Sassone & Wills / Georgia Tech / Dynamic Strands
Outline • Motivation • Transient Operands and Strands • Instruction Replacement Hardware • Results • Conclusion Sassone & Wills / Georgia Tech / Dynamic Strands
Hardware Overview instructions dispatch engine strands strands strand cache closed-loop ALUs transients strand cache fill unit instructions off the critical path fetch decode rename issue queue reg file ALU ALU ALU commit Sassone & Wills / Georgia Tech / Dynamic Strands
3 3 2 2 0 1 1 Algorithm Example instructions dispatch engine strands 1 strands 2 3 strand cache closed-loop ALUs transients strand cache fill unit instructions fetch decode rename issue queue reg file ALU ALU ALU commit Sassone & Wills / Georgia Tech / Dynamic Strands
Strand Cache Fill Unit PC 1412 1 • Based around the operand table • Detects conditions of transients • When found… • append to existing strand • begin new strand operand table arch reg last producer instruction last consumer instruction consumer count 1404:R5 R0 + 0 R4 1408: . . . R5 PC 1416 PC 1404 1412: R1 R5+ 0 R6 1416:R5 R0 + 0 Sassone & Wills / Georgia Tech / Dynamic Strands
Strand Cache + + + this instruction source 1 source 2 seen pc inst seen pc inst seen pc inst About 175 bytes per line, though very few lines are needed for effect status bits instructions previous reader info strand 1 strand 2 101110101 i1 i2 i3 pc ready value strand 3 Sassone & Wills / Georgia Tech / Dynamic Strands
Dispatch Engine • Watches for strand cache matches • Inserts ready strands into the stream eagerly • Removes component instructions when seen • Correctness checking with dirty table dirtytable decode pre-renamed instructions dispatch engine strand cache rename strands, recovery strands, kill signals, Sassone & Wills / Georgia Tech / Dynamic Strands
Closed-Loop ALUs • Full bypass is half of the execute stage delay • Regular ALUs with double-speed closed-loop mode • two dependent ALU operations in a single cycle • intermediate values (the transients) are discarded! • final result still takes ½ cycle for full bypass “free”local bypass ALU ½ cycle mode switch full bypass network ½ cycle Sassone & Wills / Georgia Tech / Dynamic Strands
Oops… Dirty Read R1 R2 R1 R2 + + R1’ R3 R1’ R3 + + R1’’ R4 R1’’ + R9 insert recovery sub-strand to recover R1 load 16 [ R1 ] R1 is dirty! Sassone & Wills / Georgia Tech / Dynamic Strands
Oops… Anti-Dependence Violation R1 R2 + R1’ R3 + R1’’ R4 + R9 previous value R9 insert load immediate of previous value load 32 [ R9 ] renaming not sufficent – outside reorder buffer safety net R9 has already been replaced Sassone & Wills / Georgia Tech / Dynamic Strands
Outline • Motivation • Transient Operands and Strands • Instruction Replacement Hardware • Results • Conclusion Sassone & Wills / Georgia Tech / Dynamic Strands
Instruction Coverage Average ALU inst coverage: 16: 12% 1024: 27% High coverage rates, but only with a big strand cache. Less than a 15% replacement rate, regardless of cache size coverage with various strand cache sizes Sassone & Wills / Georgia Tech / Dynamic Strands
IPC Improvements Some benchmarks almost double in IPC Average IPC Speedup: 4-wide: 17% 8-wide: 20% Some see almost no speedup at all 4-wide IPC speedup with 16-entry strand cache Sassone & Wills / Georgia Tech / Dynamic Strands
Resource Occupancy + + + + + + + + + + • CISCification of instructions reduces traffic • reorder buffer occupancy is reduced up to 37%. • issue queue occupancy is reduced up to 34%. • traffic reduction coverage • Reduced dependence on dependencies • opportunity for pipelined bypass • opportunity for pipelined issue. strand strand Sassone & Wills / Georgia Tech / Dynamic Strands
Resource Occupancy • Caveat emptor • more worst case issue CAMs • more worst case register ports • Prior work applicable • only 1.2 live inputs / strand strand strand + + + + + + + + + + Sassone & Wills / Georgia Tech / Dynamic Strands
Outline • Motivation • Transient Operands and Strands • Instruction Replacement Hardware • Results • Conclusion Sassone & Wills / Georgia Tech / Dynamic Strands
Conclusion • Key points: • eagerly executing macro-instructions value precomputation • limiting focus to transient operands • all new hardware off critical path • Results: • IPC speedup of 18-20% with 3KB strand cache • potential for frequency gains • full binary compatibility • Lots of current and future research: • relaxed constraint of ALU instructions • quantified frequency improvements • static detection of strands Questions? Sassone & Wills / Georgia Tech / Dynamic Strands
Backup Slides Sassone & Wills / Georgia Tech / Dynamic Strands
Sensitivity to Dispatch Delay On average, speedup only drops 1% with three cycles of delay Some actually get faster due to less errant strands Most benchmarks lose a small amount of speedup 4-wide IPC speedup with 16-entry strand cache Sassone & Wills / Georgia Tech / Dynamic Strands