1 / 27

Dynamically Collapsing Dependencies for IPC and Frequency Gain

Explore how Dynamic Strands collapse dependence-chains in computing, showing IPC improvements and reduced dependencies for increased efficiency.

Download Presentation

Dynamically Collapsing Dependencies for IPC and Frequency Gain

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dynamically Collapsing Dependencies for IPC and Frequency Gain Peter G. Sassone D. Scott Wills Georgia Tech Electrical and Computer Engineering { sassone, scott.wills } @ ece.gatech.edu

  2. Motivation • Outside of pipeline, global communication dominates • Memory wall is well studied • Inside, traditionally computation or logic dominated I cache fetch memory decode L2 cache rename issue D cache exec commit Sassone & Wills / Georgia Tech / Dynamic Strands

  3. Motivation • Now dominated by local communication paths: • issue window • reorder buffer • register file • bypass network • Bottlenecks both IPC and frequency issue logic reg file alu issue queue alu alu Sassone & Wills / Georgia Tech / Dynamic Strands

  4. Motivation • RISC instruction sets create superfluous traffic • All instructions and operands are treated as equal • Little focus on exposing sequentiality issue logic reg file alu issue queue alu alu Sassone & Wills / Georgia Tech / Dynamic Strands

  5. Contributions • Dynamic Strands: • collapse dependence-chains without fan-out • exploit properties for simple value precomputation • increase efficiency of critical resources • preserve binary compatibility • IPC improvements: • 17-20% speedup on Spec2000int and MediaBench • Frequency improvements: • 37% fewer in-flight instructions • reduced dependence on dependencies Sassone & Wills / Georgia Tech / Dynamic Strands

  6. Outline • Motivation • Transient Operands and Strands • Instruction Replacement Hardware • Results • Conclusion Sassone & Wills / Georgia Tech / Dynamic Strands

  7. Dyadic Dilemma R1 R2 + R1’ R3 + R1’’ R4 + R9 Performing any operation on more than two sources requires temporary values int sum( int a, int b, int c, int d ) { return a + b + c + d; } . . . add R1  R1, R2 add R1  R1, R3 add R9  R1, R4 . . . Sassone & Wills / Georgia Tech / Dynamic Strands

  8. Transient Operands • We term these temporary values transient operands: • values produced by an ALU inst • values consumed only once, and only by an ALU inst • Common in modern integer workloads… On average, about 40% of all dynamic operands are transient Sassone & Wills / Georgia Tech / Dynamic Strands

  9. Strands • Strands: • linear chains of instructions joined by transient operands • non-consecutive • span basic blocks • three instructions • only the final output needs to be committed • Strands are common • dyadic temporaries • compiler strategies • language semantics a b c + d + + Sassone & Wills / Georgia Tech / Dynamic Strands

  10. Outline • Motivation • Transient Operands and Strands • Instruction Replacement Hardware • Results • Conclusion Sassone & Wills / Georgia Tech / Dynamic Strands

  11. Hardware Overview instructions dispatch engine strands strands strand cache closed-loop ALUs transients strand cache fill unit instructions off the critical path fetch decode rename issue queue reg file ALU ALU ALU commit Sassone & Wills / Georgia Tech / Dynamic Strands

  12. 3 3 2 2 0 1 1 Algorithm Example instructions dispatch engine strands 1 strands 2 3 strand cache closed-loop ALUs transients strand cache fill unit instructions fetch decode rename issue queue reg file ALU ALU ALU commit Sassone & Wills / Georgia Tech / Dynamic Strands

  13. Strand Cache Fill Unit PC 1412 1 • Based around the operand table • Detects conditions of transients • When found… • append to existing strand • begin new strand operand table arch reg last producer instruction last consumer instruction consumer count 1404:R5 R0 + 0 R4 1408: . . . R5 PC 1416 PC 1404 1412: R1 R5+ 0 R6 1416:R5 R0 + 0 Sassone & Wills / Georgia Tech / Dynamic Strands

  14. Strand Cache + + + this instruction source 1 source 2 seen pc inst seen pc inst seen pc inst About 175 bytes per line, though very few lines are needed for effect status bits instructions previous reader info strand 1 strand 2 101110101 i1 i2 i3 pc ready value strand 3 Sassone & Wills / Georgia Tech / Dynamic Strands

  15. Dispatch Engine • Watches for strand cache matches • Inserts ready strands into the stream eagerly • Removes component instructions when seen • Correctness checking with dirty table dirtytable decode pre-renamed instructions dispatch engine strand cache rename strands, recovery strands, kill signals, Sassone & Wills / Georgia Tech / Dynamic Strands

  16. Closed-Loop ALUs • Full bypass is half of the execute stage delay • Regular ALUs with double-speed closed-loop mode • two dependent ALU operations in a single cycle • intermediate values (the transients) are discarded! • final result still takes ½ cycle for full bypass “free”local bypass ALU ½ cycle mode switch full bypass network ½ cycle Sassone & Wills / Georgia Tech / Dynamic Strands

  17. Oops… Dirty Read R1 R2 R1 R2 + + R1’ R3 R1’ R3 + + R1’’ R4 R1’’ + R9 insert recovery sub-strand to recover R1 load  16 [ R1 ] R1 is dirty! Sassone & Wills / Georgia Tech / Dynamic Strands

  18. Oops… Anti-Dependence Violation R1 R2 + R1’ R3 + R1’’ R4 + R9 previous value R9 insert load immediate of previous value load 32 [ R9 ] renaming not sufficent – outside reorder buffer safety net R9 has already been replaced Sassone & Wills / Georgia Tech / Dynamic Strands

  19. Outline • Motivation • Transient Operands and Strands • Instruction Replacement Hardware • Results • Conclusion Sassone & Wills / Georgia Tech / Dynamic Strands

  20. Instruction Coverage Average ALU inst coverage: 16: 12% 1024: 27% High coverage rates, but only with a big strand cache. Less than a 15% replacement rate, regardless of cache size coverage with various strand cache sizes Sassone & Wills / Georgia Tech / Dynamic Strands

  21. IPC Improvements Some benchmarks almost double in IPC Average IPC Speedup: 4-wide: 17% 8-wide: 20% Some see almost no speedup at all 4-wide IPC speedup with 16-entry strand cache Sassone & Wills / Georgia Tech / Dynamic Strands

  22. Resource Occupancy + + + + + + + + + + • CISCification of instructions reduces traffic • reorder buffer occupancy is reduced up to 37%. • issue queue occupancy is reduced up to 34%. • traffic reduction  coverage • Reduced dependence on dependencies • opportunity for pipelined bypass • opportunity for pipelined issue. strand strand Sassone & Wills / Georgia Tech / Dynamic Strands

  23. Resource Occupancy • Caveat emptor • more worst case issue CAMs • more worst case register ports • Prior work applicable • only 1.2 live inputs / strand strand strand + + + + + + + + + + Sassone & Wills / Georgia Tech / Dynamic Strands

  24. Outline • Motivation • Transient Operands and Strands • Instruction Replacement Hardware • Results • Conclusion Sassone & Wills / Georgia Tech / Dynamic Strands

  25. Conclusion • Key points: • eagerly executing macro-instructions  value precomputation • limiting focus to transient operands • all new hardware off critical path • Results: • IPC speedup of 18-20% with 3KB strand cache • potential for frequency gains • full binary compatibility • Lots of current and future research: • relaxed constraint of ALU instructions • quantified frequency improvements • static detection of strands Questions? Sassone & Wills / Georgia Tech / Dynamic Strands

  26. Backup Slides Sassone & Wills / Georgia Tech / Dynamic Strands

  27. Sensitivity to Dispatch Delay On average, speedup only drops 1% with three cycles of delay Some actually get faster due to less errant strands Most benchmarks lose a small amount of speedup 4-wide IPC speedup with 16-entry strand cache Sassone & Wills / Georgia Tech / Dynamic Strands

More Related