610 likes | 791 Views
Low-Complexity Reorder Buffer Architecture*. Gurhan Kucuk, Dmitry Ponomarev, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY 13902-6000 http://www.cs.binghamton.edu/~lowpower.
E N D
Low-ComplexityReorder Buffer Architecture* Gurhan Kucuk, Dmitry Ponomarev, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY 13902-6000 http://www.cs.binghamton.edu/~lowpower 16th Annual ACM International Conference on Supercomputing (ICS’02), June 24th 2002 *supported in part by DARPA through the PAC-C program and NSF ICS’02
Outline • ROB complexities • Motivation for the low-complexity ROB • Low-complexity ROB design • Results • Concluding remarks ICS’02
What This Work is All About • Complex, richly-ported ROBs are common in modern superscalar datapaths • Number of ports are aggravated when results are held within ROB slots (Example: Pentium III) • ROB complexity reduction is important for reducing power and improving performance • ROB dissipates a non-trivial fraction of the total chip power • ROB accesses stretch over several cycles • Goal of this work: Reduce the complexity and power dissipation of the ROB without sacrificing performance ICS’02
Pentium III-like Superscalar Datapath Function Units Architectural Register File Instruction Issue IQ FU1 F1 F2 D1 D2 FU2 ROB ARF FUm Fetch Decode/Dispatch LSQ EX Instruction dispatch D-cache Result/status forwarding buses ICS’02
ROB Port Requirements for a W-way CPU Decode/Dispatch W write ports to setup entries Writeback W write ports to write results ROB Dispatch/Issue 2W read ports to read the source operands Commit W read ports for instruction commitment ICS’02
ROB Port Requirements for a W-way CPU Decode/Dispatch 1 W-wide write port to setup entries Writeback W write ports To write results ROB Dispatch/Issue 2W read ports to read the source operands Commit 1 W-wide read port for instruction commitment ICS’02
Where are the Source Values Coming From? Function Units Architectural Register File Instruction Issue 1 2 IQ FU1 F1 F2 D1 D2 FU2 ROB ARF FUm Fetch Decode/Dispatch LSQ EX 3 Instruction dispatch D-cache Result/status forwarding buses ICS’02
Where are the Source Values Coming From ? 62% 32% 6% 96-entry ROB, 4-way processor SPEC2K Benchmarks ICS’02
How Efficiently are the Ports Used ? Decode/Dispatch W write ports to setup entries Writeback W write ports To write results ROB Dispatch/Issue 2W read ports to read the source operands Commit W read ports for instruction commitment 6% ICS’02
Approaches to Reducing ROB Complexity • Reduce the number of read ports for reading out the source operand values • More radical (and better):Completely eliminate the read ports for reading source operand values! ICS’02
Reducing the Number of Read Ports 3.5% 1.0% Average IPC Drop: Performance Drop % bzip2 gap gcc gzip mcf parser perl twolf vortex vpr Int Avg. applu apsi art equake mesa mgrid swim wupwise FP Avg. ICS’02
Problems with Retaining Fewer Source Read Ports on the ROB • Need arbitration for the small number of ports • Additional logic needed to block the instructions which could not get the port. • Need a switching network to route the operands to correct destinations • Multi-cycle access still remains in the critical path of Dispatch/Issue logic ICS’02
Our Solution: Elimination of Read Ports Function Units Architectural Register File Instruction Issue 1 2 IQ FU1 F1 F2 D1 D2 FU2 ROB ARF FUm Fetch Decode/Dispatch LSQ EX 3 Instruction dispatch D-cache Result/status forwarding buses ICS’02
Our Solution: Elimination of Read Ports Function Units Architectural Register File Instruction Issue 1 2 IQ FU1 F1 F2 D1 D2 FU2 ROB ARF FUm Fetch Decode/Dispatch LSQ EX 3 Instruction dispatch D-cache Result/status forwarding buses ICS’02
Our Solution: Elimination of Read Ports Function Units Architectural Register File Instruction Issue 1 IQ FU1 F1 F2 D1 D2 FU2 ROB ARF FUm Fetch Decode/Dispatch LSQ EX 3 Instruction dispatch D-cache Result/status forwarding buses ICS’02
Comparison of ROB Bitcells (0.18µ, TSMC) Layout of a 32-ported SRAM bitcell Layout of a 16-ported SRAM bitcell Area Reduction – 71% Shorter bit and wordlines ICS’02
Our Solution: Elimination of Read Ports Function Units Architectural Register File Instruction Issue IQ FU1 F1 F2 D1 D2 FU2 ROB ARF FUm Fetch Decode/Dispatch LSQ EX Instruction dispatch D-cache Result/status forwarding buses Area Reduction – 45% ICS’02
Eliminating/Reducing the Number of Read Ports: Effects on Power Dissipation • Power is reduced because: • shorter bitlines and wordlines • lower capacitive loading • fewer decoders • fewer drivers and sense amps ICS’02
Completely Eliminating the Source Read Ports on the ROB • The Problem: Issue of instructions that require a value stored in the ROB will stall • Solutions: • Forward the value to the waiting instruction at the time of committing the value: LATE FORWARDING ICS’02
Late Forwarding: Use the Normal Forwarding Buses! Function Units Architectural Register File Instruction Issue IQ FU1 F1 F2 D1 D2 FU2 ROB ARF FUm Fetch Decode/Dispatch LSQ EX Instruction dispatch D-cache Result/status forwarding buses: ICS’02
Late Forwarding: Use the Normal Forwarding Buses! Function Units Architectural Register File Instruction Issue IQ FU1 F1 F2 D1 D2 FU2 ROB ARF FUm Fetch Decode/Dispatch LSQ EX Instruction dispatch D-cache Result/status forwarding buses: ICS’02
Optimizing Late Forwarding • PROBLEM: If Late Forwarding is done for every result that is committed, additional forwarding buses are needed in order not to degrade the performance • SOLUTION: Selective Late Forwarding (SLF) • SLF requires additional bit in the ROB • That bit is set by the dispatched instructions that require Late Forwarding • No additional forwarding buses are needed, since SLF traffic is very small ICS’02
Late Forwarding: Use the Normal Forwarding Buses! Function Units Architectural Register File Instruction Issue IQ FU1 F1 F2 D1 D2 FU2 ROB ARF FUm Fetch Decode/Dispatch LSQ EX Instruction dispatch Only 3.5% of the traffic is fromSELECTIVE LATE FORWARDING D-cache Result/status forwarding buses: ICS’02
Performance Drop of Simplified ROB 9.6% 3.5% 1.0% Average IPC Drop: 17% Performance Drop % bzip2 gap gcc gzip mcf parser perl twolf vortex vpr Int Avg. 37% applu apsi art equake mesa mgrid swim wupwise FP Avg. ICS’02
IPC Penalty:Source Value Not Accessible within the ROB Lifetime of a Result Value Late Forwarding/ Commitment Forwarding Value within ARF Result Generation Value within ROB time ICS’02
Improving IPC with No Read Ports • Cache recently generated values in a set ofRETENTION LATCHES (RL) • Retention Latches areSMALLandFAST • Only 8 to 16 latches needed in the set • Entire set has 1 or 2 read ports ICS’02
Datapath with the Retention Latches Function Units Architectural Register File Instruction Issue IQ FU1 F1 F2 D1 D2 FU2 ROB ARF FUm Fetch Decode/Dispatch LSQ EX Instruction dispatch D-cache Result/status forwarding buses ICS’02
Datapath with the Retention Latches RETENTION LATCHES Function Units Architectural Register File Instruction Issue IQ FU1 F1 F2 D1 D2 FU2 ROB ARF FUm Fetch Decode/Dispatch LSQ EX Instruction dispatch D-cache Result/status forwarding buses ICS’02
The Structure of the Retention Latch Set L recently-written results (L=1 or 2 works great) 8 or 16 latches L-ported CAM field (key = ROB_slot_id) Result Values Status W write ports for writing up to W results in parallel L ROB slot addresses (L=1 or 2) ICS’02
Retention Latch Management Strategies • FIFO • 8 entry RL: 42% hit rate • 16 entry RL: 55% hit rate • LRU • 8 entry RL: 56% hit rate • 16 entry RL: 62% hit rate • Random Replacement • Worse performance than FIFO ICS’02
Hit Ratios to Retention Latches 42% 55% 56% 62% Average Hit Ratio: Hit Ratios bzip2 gap gcc gzip mcf parser perl twolf vortex vpr Int Avg. applu apsi art equake mesa mgrid swim wupwise FP Avg. ICS’02
Accessing Retention Latch Entries • ROB index is used as a unique key in the Retention Latches to search the result values • Need to maintain unique keys even when we have: • Reuse of a ROB slot: • Not a problem for FIFO • simply flush a RL entry at commit time for LRU • Branch mispredictions ICS’02
Handling Branch Mispredictions • Selective RL Flushing: Retention latch entries that are in the mispredicted path are flushed • Uses branch tags • Complicated implementation • Complete RL Flushing: All retention latch entries are flushed • Very simple implementation • Performance drop is only 1.5% compared to selective flushing ICS’02
Misprediction Handling: Performance 1.5% Average IPC Drop: IPC ICS’02
Scenario 1: Traditional Design Instruction: ADD R1, R2, R3 5 ROB index Instruction ADD Src1 arch. 2 Src1 valid ? Src1 value ? Src2 arch. 3 ? Src2 valid Src2 value ? Simplified IDB entry #1 ICS’02
Scenario 1: Traditional Design Instruction: ADD R1, R2, R3 5 ROB=0 ARF=1 ROB# /Phys. ROB index Arch. 0 … … Instruction ADD 1 … … Src1 reg. 2 2 12 0 Src1 valid ? 3 3 1 Src1 value ? 4 … … Src2 reg. 3 … … … ? Src2 valid Rename Table Src2 value ? Simplified IDB entry #1 ICS’02
Scenario 1: Traditional Design Instruction: ADD R1, R2, R3 ROB# /Phys. Phys. valid Phys. value 5 ROB=0 ARF=1 ROB# /Phys. ROB index Arch. … … … 0 … … Instruction ADD 12 1 7 1 … … Src1 reg. 2 … … … 2 12 0 Src1 valid ? 3 3 1 ROB Src1 value ? 4 … … Src2 reg. 3 … … … ? Src2 valid Rename Table Src2 value ? Simplified IDB entry #1 ICS’02
Scenario 1: Traditional Design Instruction: ADD R1, R2, R3 ROB# /Phys. Phys. valid Phys. value 5 ROB=0 ARF=1 ROB# /Phys. ROB index Arch. … … … 0 … … Instruction ADD 12 1 7 1 … … Src1 reg. 2 … … … 2 12 0 Src1 valid 1 3 3 1 ROB Src1 value 7 4 … … Src2 reg. 3 … … … ? Src2 valid Rename Table Src2 value ? Simplified IDB entry #1 ICS’02
Scenario 1: Traditional Design Instruction: ADD R1, R2, R3 ROB# /Phys. Phys. valid Phys. value 5 ROB=0 ARF=1 ROB# /Phys. ROB index Arch. … … … 0 … … Instruction ADD 12 0 ? 1 … … Src1 reg. 2 … … … 2 12 0 Src1 valid ? 3 3 1 ROB Src1 value ? 4 … … Src2 reg. 3 … … … ? Src2 valid Rename Table Src2 value ? Simplified IDB entry #1 ICS’02
Scenario 1: Traditional Design Instruction: ADD R1, R2, R3 ROB# /Phys. Phys. valid Phys. value 5 ROB=0 ARF=1 ROB# /Phys. ROB index Arch. … … … 0 … … Instruction ADD 12 0 ? 1 … … Src1 reg. 2 … … … 2 12 0 Src1 valid 0 3 3 1 ROB Src1 value ? 4 … … Src2 reg. 3 … … … ? Src2 valid Rename Table Src2 value ? Simplified IDB entry #1 ICS’02
Scenario 1: Traditional Design Instruction: ADD R1, R2, R3 5 ROB=0 ARF=1 ROB# /Phys. ROB index Arch. 0 … … Instruction ADD 1 … … Src1 reg. 2 2 12 0 Src1 valid 1 3 3 1 Arch. value Arch. Src1 value 7 4 … … … … Src2 reg. 3 … … … ? 3 43 Src2 valid Rename Table … … Src2 value ? ARF Simplified IDB entry #1 ICS’02
Scenario 1: Traditional Design Instruction: ADD R1, R2, R3 5 ROB=0 ARF=1 ROB# /Phys. ROB index Arch. 0 … … Instruction ADD 1 … … Src1 reg. 2 2 12 0 Src1 valid 1 3 3 1 Arch. value Arch. Src1 value 7 4 … … … … Src2 reg. 3 … … … 1 3 43 Src2 valid Rename Table … … Src2 value 43 ARF Simplified IDB entry #1 ICS’02
Scenario 2: Simplified ROB with RLs Instruction: ADD R1, R2, R3 5 ROB index Instruction ADD Src1 arch. 2 Src1 valid ? Src1 value ? Src2 arch. 3 ? Src2 valid Src2 value ? Simplified IDB entry #1 ICS’02
Scenario 2: Simplified ROB with RLs Instruction: ADD R1, R2, R3 5 ROB=0 ARF=1 ROB# /Phys. ROB index Arch. 0 … … Instruction ADD 1 … … Src1 reg. 2 2 12 0 Src1 valid ? 3 3 1 Src1 value ? 4 … … Src2 reg. 3 … … … ? Src2 valid Rename Table Src2 value ? Simplified IDB entry #1 ICS’02
Scenario 2: Simplified ROB with RLs Instruction: ADD R1, R2, R3 ROB# /Phys. Phys. value 5 ROB=0 ARF=1 ROB# /Phys. ROB index Arch. … … 0 … … Instruction ADD Retention Latches 12 7 1 … … Src1 reg. 2 … … 2 12 0 Src1 valid ? 3 3 1 Src1 value ? 4 … … Src2 reg. 3 … … … ? Src2 valid Rename Table Src2 value ? Simplified IDB entry #1 ICS’02
Scenario 2: Simplified ROB with RLs Instruction: ADD R1, R2, R3 ROB# /Phys. Phys. value 5 ROB=0 ARF=1 ROB# /Phys. ROB index Arch. … … 0 … … Instruction ADD Retention Latches 12 7 1 … … Src1 reg. 2 … … 2 12 0 Src1 valid 1 3 3 1 Src1 value 7 4 … … Src2 reg. 3 … … … ? Src2 valid Rename Table Src2 value ? Simplified IDB entry #1 ICS’02
Scenario 2: Simplified ROB with RLs Instruction: ADD R1, R2, R3 ROB# /Phys. Phys. value 5 ROB=0 ARF=1 ROB# /Phys. ROB index Arch. … … 0 … … Instruction ADD Retention Latches MISS … … 1 … … Src1 reg. 2 … … 2 12 0 Src1 valid ? 3 3 1 Src1 value ? 4 … … Src2 reg. 3 … … … ? Src2 valid Rename Table Src2 value ? Simplified IDB entry #1 ICS’02
Scenario 2: Simplified ROB with RLs Instruction: ADD R1, R2, R3 ROB# /Phys. Phys. value 5 ROB=0 ARF=1 ROB# /Phys. ROB index Arch. … … 0 … … Instruction ADD Retention Latches MISS … … 1 … … Src1 reg. 2 … … 2 12 0 Src1 valid 0 3 3 1 ROB# /Phys. Phys. valid Phys. value SLF Src1 value ? 4 … … … … … … Src2 reg. 3 … … … ? Src2 valid 12 X X 0 Rename Table Src2 value ? … … … … ROB Simplified IDB entry #1 X: Don’t Care ICS’02
Scenario 2: Simplified ROB with RLs Instruction: ADD R1, R2, R3 ROB# /Phys. Phys. value 5 ROB=0 ARF=1 ROB# /Phys. ROB index Arch. … … 0 … … Instruction ADD Retention Latches MISS … … 1 … … Src1 reg. 2 … … 2 12 0 Src1 valid 0 3 3 1 ROB# /Phys. Phys. valid Phys. value SLF Src1 value ? 4 … … … … … … Src2 reg. 3 … … … ? Src2 valid 12 X X 1 Rename Table Src2 value ? … … … … ROB Simplified IDB entry #1 X: Don’t Care ICS’02
Scenario 2: Simplified ROB with RLs Instruction: ADD R1, R2, R3 5 ROB=0 ARF=1 ROB# /Phys. ROB index Arch. 0 … … Instruction ADD 1 … … Src1 reg. 2 2 12 0 Src1 valid 1 3 3 1 Arch. value Arch. Src1 value 7 4 … … … … Src2 reg. 3 … … … ? 3 43 Src2 valid Rename Table … … Src2 value ? ARF Simplified IDB entry #1 ICS’02