510 likes | 662 Views
Reducing Datapath Energy Through the Isolation of Short-Lived Operands. Dmitry Ponomarev, Gurhan Kucuk, Oguz Ergin, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY 13902-6000 http://www.cs.binghamton.edu/~lowpower. Outline. Introduction Motivations
E N D
Reducing Datapath Energy Through the Isolation of Short-Lived Operands Dmitry Ponomarev, Gurhan Kucuk, Oguz Ergin, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY 13902-6000 http://www.cs.binghamton.edu/~lowpower
Outline • Introduction • Motivations • Contributions • Basic idea: isolate short-lived operands in a small dedicated register file and avoid their writes to the ROB and the ARF • Resources impacted: ROB, ARF • Power savings: 21% with 32-entry additional RF • Results • Conclusions • Future work
A P6-like Superscalar Datapath Function Units Architectural Register File Instruction Issue IQ FU1 F1 F2 D1 D2 FU2 ROB ARF FUm Fetch Decode/Dispatch LSQ EX Instruction dispatch D-cache Result/status forwarding buses
Out-of-Order Execution and In-Order Retirement Ex Inst. Queue ARF F R D ROB In-order front end In-order retirement Out-of-order core
Energy-dissipating Events Ex Inst. Queue ARF F R D Write Write ROB In-order front end Read In-order retirement Out-of-order core
The Idea : Isolating Short-Lived Values Write short-lived values into a small dedicated RF (SRF) Ex Inst. Queue ARF Write F R D SRF Write In-order front end ROB Read In-order retirement Out-of-order core
Register Renaming • Used toavoid false data dependencies. • A new physical register is allocated for EVERY new result • P6 style: ROB slots serve as physical registers LOAD P31, P2, 100 SUB P32, P31, P3 ADD P33, P32, P4 LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4
Register Renaming: the Implementation Original code • Register Alias Table (RAT) maintains the mappings between logical and physical registers LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4
Register Renaming: the Implementation Original code • Register Alias Table (RAT) maintains the mappings between logical and physical registers LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4 Renamed code LOAD P31, R2, 100
Register Renaming: the Implementation Original code • Rename Table (RT) is used to maintain the mappings between logical and physical registers LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4 Renamed code LOAD P31, R2, 100 SUB P32, P31, R3
Register Renaming: the Implementation Original code • Rename Table (RT) is used to maintain the mappings between logical and physical registers LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4 Renamed code LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4
Short-Lived Values • Our definition: a value is short-lived if the destination register is renamed by the time of the result generation. • Identified one cycle before the result writeback LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4 LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4 RENAMER
The Good News : 80%+ of the Values are Short-Lived 96-entry ROB, 4-way processor As rename-to-writeback latency increases in future datapaths, the percentage of short-lived values will also go up
The Idea : Isolating Short-Lived Values Write short-lived values into a small dedicated RF (SRF) Ex Inst. Queue ARF Write F R D SRF Write In-order front end ROB Read LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4 In-order retirement Out-of-order core
Why do we need the SRF ? Need to hang on to the short-lived values to: • Recover from branch mispredictions • Reconstruct precise state LOAD R1, R2, 100 BEQ R5, R1, #100 ADD R1, R5, R4
Renamed 1 31 Identifying Short-Lived Values • Maintain the bit-vector Renamed • Set by the Renamer at the time of renaming LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4 LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4
Identifying Short-Lived Values • Maintain the bit-vector Renamed • Set by the Renamer at the time of renaming LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4 LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4 Renamed 1 31
Identifying Short-Lived Values • Renamed bit is checked one cycle before writeback • Value produced by LOAD is short-lived because Renamed [31]=1 LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4 LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4 Renamed 1 31
Managing the SRF: the Issues • When do we write short-lived values into the SRF? • When and how are the short-lived values removed from the SRF? • What happens on a branch misprediction? • How do we reconstruct a precise state?
Format of an SRF entry Dest. Arch. Reg. Branch Tag 1 Branch Tag 2 Valid ROB idx Data Branch Identifier for Renamer : used to remove this entry if renamer gets squashed Branch Identifier for this instruction : used to remove this entry if this instruction gets squashed Branch Identifier of an instruction = id/tag of immediately preceding conditional branch
Writing to the SRF: the Conditions • An instruction writes a short-lived result value into the SRF if: • A free entry exists in the SRF • No SRF entry keyed with the same ROB slot is already established • Bit-vector Allocated_in_SRF is maintained • One bit for each ROB entry • Set at the time of writeback if value is written into the SRF • Reset at the time of removing the value from the SRF Branch Tag 1 Branch Tag 2 Valid ROB idx Dest. reg Data
Scenarios for Removing the Values from the SRF Scenario 1 : Normal Commitment of Renamer Scenario 2 : Renamer gets squashed Scenario 3 : The instruction generating the short- lived value itself gets squashed
Removing the Values from the SRF : Scenario 1 • Values are removed by the Renamer • 2-step process: • Mark the instruction whose value is to be removed from the SRF (done at the time of renaming) • Remove the marked value from the SRFIF NEED BE (done at the time of commitment) • When ADD commits, it removes the value written by LOAD LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4 LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4 Renamer
Marking the Values for Removal LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4 LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4 ROB 31 32 33 LOAD SUB
Marking the Values for Removal LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4 LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4 ROB 31 32 33 LOAD SUB ADD 31 FS (Flush SRF) field of the ROB
Removing the Values (B is the renamer for A) SRF • FS field of B must match the ROB index field of a SRF entry • This SRF entry must belong to A ROB 31 32 33 1 31 1 load LOAD SUB ADD 31 A B LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4 SRF format Branch Tag 1 Branch Tag 2 Valid ROB idx Dest Data
Renamed 1 31 Another Example (LOAD could not write to SRF) Original code LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4 Renamed code LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4 SRF was full!
Renamed 0 31 Another Example Original code LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4 … MUL R2, R3, R4 DIV R2, R2, R5 Renamed code LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4 Committed Committed
Renamed 0 31 Another Example Original code LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4 … MUL R2, R3, R4 DIV R2, R2, R5 Renamed code LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4 … MUL P31, R3, R4 Committed Committed
Renamed 1 31 Another Example Original code LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4 … MUL R2, R3, R4 DIV R2, R2, R5 Renamed code LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4 … MUL P31, R3, R4 DIV P32, R31, R5 Committed Committed
Another Example (A’s ROB slot is assigned for C) SRF ROB 31 32 33 LOAD SUB ADD 0 31 A B LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4 SRF format Branch Tag 1 Branch Tag 2 Valid ROB idx Dest Data
Another Example (A’s ROB slot is assigned for C) SRF ROB 31 32 33 MUL DIV ADD 1 31 2 mul 31 C D B LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4 … MUL P31, R3, R4 DIV P32, R31, R5 SRF format Branch Tag 1 Branch Tag 2 Valid ROB idx Dest Data
Ensuring that the right values are removed • Bit-vector Uncommitted_Write is maintained • One bit for each ROB entry • Set at the time of establishing SRF entry • Reset at the time of commitment • Instruction B removes the value written by A (allocated to ROB slot i) if: • Allocated_in_SRF[i]=1, and • Uncommitted_Write[i]=0;
Avoiding Unnecessary Committments • When an instruction allocated to ROB slot i commits and Allocated_in_SRF[i]=1, the data is not copied to the ARF. Dest. reg Ex Inst. Queue ARF Write F R D SRF Write ROB Read
Handling Branch Mispredictions : Scenario 2 • Problem: • Renamer can get squashed -> stale entries remain in the SRF if nothing is done • Example: 31 32 33 34 1 31 1 load LOAD BR SUB ADD 31 SRF ROB
Handling Branch Mispredictions • Problem: • Renamer can get squashed -> stale entries remain in the SRF if nothing is done • Example: 31 32 33 34 1 31 1 load LOAD BR SRF ROB
Handling Branch Mispredictions • Solution: • Tag each entry in the SRF with the id of the branch preceding the renamer (BT1). • When the renamer is squashed, the value is removed from the SRF and is written to either the ROB (based on the value of Uncommitted_Write bit) • Multiplex the ports to reduce complexity SRF format Branch Tag 1 Branch Tag 2 Valid ROB idx Dest Data
Branch_Tags 7 31 Obtaining Branch Tag BT1 LOAD R1, R2, 100 BEQ R6, R7, 200 SUB R5, R1, R3 ADD R1, R5, R4 • Maintain the array Branch_Tags • One entry for each ROB slot LOAD P31, P2, 100 BEQ P6, P7, 200 SUB P33, P31, P3 ADD P34, P33, P4
Handling Branch Mispredictions : Scenario 3 • Problem: • The instruction whose value was inserted into the SRF can itself be squashed • Example: 30 31 32 33 1 31 1 load BR LOAD SUB ADD 31 SRF ROB
Handling Branch Mispredictions • Problem: • The instruction whose value was inserted into the SRF can itself be squashed • Example: 30 31 32 33 1 31 1 load BR SRF ROB
Handling Branch Mispredictions • Solution: • Tag each entry in the SRF with the id of the branch preceding the instruction itself (BT2). • Simply remove the value from the SRF if such a branch is mispredicted SRF format Branch Tag 1 Branch Tag 2 Valid ROB idx Dest Data
Supporting Precise Interrupts • Allow all instructions preceding the faulting instruction to commit • Squash all instructions following the faulting instruction • Copy the values of ALL valid SRF entries to the ARF. SRF format Branch Tag 1 Branch Tag 2 Valid ROB idx Dest Data
Experimental Setup Compiled SPEC benchmarks Performance stats Microarchitectural Simulator Datapath specs Two separate threads Transition counts, Context information Inter-thread buffers Data analyzer/ Intra-stream analysis Energy/Power Estimator VLSI layout data Power/energy stats SPICE SPICE decks SPICE measures of Energy per transition
Results: Percentage of Values Written into the SRF % 40.5% 60.1% 77.5% 82.3% 86.7%
Results: Average Time Spent by a Value in the SRF cycles Average: 12-15 cycles
Results: Percentage of Values not copied into the ARF % 42.2% 61.9% 79.3% 84.1% 86.7%
Results: Net Energy Reduction pJ SRF ARF ROB + additional logic 9% 16% 21% 23%
Related Work • Register Traffic Analysis (Franklin and Sohi, MICRO’92). • Studied the useful lifetime of register instances • Delaying the writes until 30 more instructions are dispatched, can eliminate 80% of the writes (if perfect knowledge of the last use is available) • Buffering 30 most recently generated results avoids 80% of wbks • Lozano and Gao (MICRO’95) • 90% of all results values are short-lived (consumed while in the ROB) • Mechanism to avoid commitment of these values and also avoid register allocation for them is proposed • ROB slots are exposed to the compiler in the form of symbolic registers • Lazy Retirement (Savransky, Ronen, Gonzalez, WCED’02) • Hardware-based scheme to avoid unnecessary commitments • Copying from the ROB to the ARF is delayed until the ROB slot is reused. In many cases, the register is invalidated by the newer instruction • Additional rename table is needed. About 75% of commits are avoided.
Conclusions • Significant power savings & negligible impact on performance • Sources of power savings: • majority of generated results written into small lightly-ported SRF • Unnecessary commitments are avoided • Additional logic/ storage needed to do this is simple • For a 32-entry SRF, more than 77% of writebacks and more than 79% of commitments can be avoided • This results in the energy savings of 21% on the ROB and the ARF
THANK YOU ! LOW POWER RESEARCH GROUP Department of Computer Science State University of New York Binghamton, NY 13902-6000 http://www.cs.binghamton.edu/~lowpower Parallel Architectures and Compilation Techniques (PACT’03) October 1st 2003 This work was supported in part by DARPA through the PAC-C program and NSF