850 likes | 1.01k Views
Distributed Reorder Buffer Schemes for Low Power *. Gurhan Kucuk, Oguz Ergin, Dmitry Ponomarev, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY 13902-6000 http://www.cs.binghamton.edu/~lowpower.
E N D
ICCD’03 Distributed Reorder Buffer Schemes for Low Power * Gurhan Kucuk, Oguz Ergin, Dmitry Ponomarev, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY 13902-6000 http://www.cs.binghamton.edu/~lowpower 21st International Conference on Computer Design (ICCD’03), October 14th 2003 *supported in part by DARPA through the PAC-C program and NSF
ICCD’03 Outline • Reorder Buffer (ROB) complexities • Motivation for the low-complexity ROB • Low-complexity ROB designs • Fully Distributed ROB • Retention Latches (RLs) revisited (ICS’02) • Combined Scheme • Results • Concluding remarks
ICCD’03 P6-style Superscalar Datapath Function Units Architectural Register File Instruction Issue IQ FU1 F1 F2 D1 D2 FU2 ROB ARF FUm Fetch Decode/Dispatch EX Instruction dispatch Result/status forwarding buses
ICCD’03 PPC 620-style Superscalar Datapath Function Units Architectural Register File Instruction Issue RB IQ FU1 F1 F2 D1 D2 FU2 ROB ARF FUm Fetch Decode/Dispatch EX Instruction dispatch Result/status forwarding buses
ICCD’03 ROB Port Requirements for a W-way CPU Decode/Dispatch W write ports to setup entries Writeback W write ports to write results ROB Dispatch/Issue 2W read ports to read the source operands Commit W read ports for instruction commitment
ICCD’03 What This Work is All About • ROB complexity reduction is important for reducing power and improving performance • ROB dissipates a non-trivial fraction of the total chip power • ROB accesses stretch over several cycles • Goal of this work: Reduce the complexity and power dissipation of the ROB without sacrificing performance
ICCD’03 Comparison of ROB Bitcells (0.18µ, TSMC) Layout of a 32-ported SRAM bitcell Layout of a 16-ported SRAM bitcell Area Reduction – 71% Shorter bit and wordlines
ICCD’03 P6-style Superscalar Datapath Instruction dispatch Function Units Architectural Register File Instruction Issue IQ FU1 F1 F2 D1 D2 FU2 ROB ARF FUm Fetch Decode/Dispatch EX Result/status forwarding buses
ICCD’03 Reorder Buffer Distribution Instruction dispatch ROB Components (ROBCs) Function Units Architectural Register File Instruction Issue IQ ROBC 1 FU1 F1 F2 D1 D2 ROBC 2 FU2 ARF FUm Fetch Decode/Dispatch ROBC m EX ROB Result/status forwarding buses Holds pointers to entries within ROBCs
ICCD’03 Impact of Distributing the ROB • Each ROBC is effectively is a small Rename Buffer • Smaller read/write access energy • Faster access time • Distributing physical storage in this manner allows FUs to use shorter buses to write their respective ROBCs • Lower energy dissipation on the wires (We have NOT accounted for energy savings from using shorter wires) • Fits in naturally with a multi-clustered datapath design
ICCD’03 Problems with the earlier Multi-banked RF Schemes • Port conflicts result in performance penalty • Interconnection network is more complex
ICCD’03 Problems with the earlier Multi-banked RF Schemes and some good news! • Port conflicts result in performance penalty • Totally avoid write port conflicts • Minimize read port conflicts at commitment • Interconnection network is more complex
ICCD’03 Problems with the earlier Multi-banked RF Schemes and some good news! • Port conflicts result in performance penalty • Totally avoid write port conflicts • Minimize read port conflicts at commitment • Interconnection network is more complex • Completely remove source read ports
ICCD’03 Problems with the earlier Multi-banked RF Schemes and some good news! • Port conflicts result in performance penalty • Totally avoid write port conflicts • Minimize read port conflicts at commitment • Totally avoid source read port conflicts • Interconnection network is more complex • Completely remove source read ports
ICCD’03 ROBCs Assigned to Each Function Unit FU_id offset 1 1 1 1 ROBC #1 FU #1 2 2 m 1 3 3 2 1 1 4 FU #2 ROBC #2 2 3 4 FU #m ROBC #m n 1 Centralized ROB Distributed ROBCs
ICCD’03 Good News:Write port conflicts are avoided 1 write port FU_id offset 1 1 1 1 ROBC #1 FU #1 2 2 m 1 3 3 2 1 1 4 FU #2 ROBC #2 2 3 4 FU #m ROBC #m n 1 Centralized ROB Distributed ROBCs
ICCD’03 Round Robin Scheduling at Dispatch Time FU_id offset instruction Int ADDROBC #1 1 1 2 2 3 Int ADDROBC #2 1 4 2 5 Int ADDROBC #3 1 2 Int ADDROBC #4 1 n 2 Centralized ROB Distributed ROBCs
ICCD’03 Round Robin Scheduling at Dispatch Time ADD FU_id offset instruction Int ADDROBC #1 1 1 2 2 3 Int ADDROBC #2 1 4 2 5 Int ADDROBC #3 1 2 Int ADDROBC #4 1 n 2 Centralized ROB Distributed ROBCs
ICCD’03 Round Robin Scheduling at Dispatch Time ADD FU_id offset instruction Int ADDROBC #1 1 1 reserved 2 2 3 Int ADDROBC #2 1 4 2 5 Int ADDROBC #3 1 2 Int ADDROBC #4 1 n 2 Centralized ROB Distributed ROBCs
ICCD’03 Round Robin Scheduling at Dispatch Time ADD FU_id offset instruction Int ADDROBC #1 1 ADD 1 reserved 1 1 2 2 3 Int ADDROBC #2 1 4 2 5 Int ADDROBC #3 1 2 Int ADDROBC #4 1 n 2 Centralized ROB Distributed ROBCs
ICCD’03 Round Robin Scheduling at Dispatch Time FU_id offset instruction Int ADDROBC #1 1 ADD 1 reserved 1 1 SUB 2 2 3 Int ADDROBC #2 1 4 2 5 Int ADDROBC #3 1 2 Int ADDROBC #4 1 n 2 Centralized ROB Distributed ROBCs
ICCD’03 Round Robin Scheduling at Dispatch Time FU_id offset instruction Int ADDROBC #1 1 ADD 1 reserved 1 1 SUB 2 2 3 Int ADDROBC #2 1 reserved 4 2 5 Int ADDROBC #3 1 2 Int ADDROBC #4 1 n 2 Centralized ROB Distributed ROBCs
ICCD’03 Round Robin Scheduling at Dispatch Time FU_id offset instruction Int ADDROBC #1 1 ADD 1 reserved 1 1 SUB 2 2 SUB 2 1 3 Int ADDROBC #2 1 reserved 4 2 5 Int ADDROBC #3 1 2 Int ADDROBC #4 1 n 2 Centralized ROB Distributed ROBCs
ICCD’03 Round Robin Scheduling at Dispatch Time FU_id offset instruction Int ADDROBC #1 1 ADD 1 reserved 1 1 2 2 SUB 2 1 AND 3 Int ADDROBC #2 1 reserved 4 2 5 Int ADDROBC #3 1 2 Int ADDROBC #4 1 n 2 Centralized ROB Distributed ROBCs
ICCD’03 Round Robin Scheduling at Dispatch Time FU_id offset instruction Int ADDROBC #1 1 ADD 1 reserved 1 1 2 2 SUB 2 1 AND 3 Int ADDROBC #2 1 reserved 4 2 5 Int ADDROBC #3 1 reserved 2 Int ADDROBC #4 1 n 2 Centralized ROB Distributed ROBCs
ICCD’03 Round Robin Scheduling at Dispatch Time FU_id offset instruction Int ADDROBC #1 1 ADD 1 reserved 1 1 2 2 SUB 2 1 AND 3 AND 3 1 Int ADDROBC #2 1 reserved 4 2 5 Int ADDROBC #3 1 reserved 2 Int ADDROBC #4 1 n 2 Centralized ROB Distributed ROBCs
ICCD’03 Good News:Avoiding Read Port Conflicts 1 read port FU_id offset instruction 1 ADD 1 reserved 1 1 2 2 SUB 2 1 3 AND 3 1 1 reserved 4 2 To commitment 5 1 reserved 2 1 n 2 Centralized ROB Distributed ROBCs
ICCD’03 Round Robin Scheduling at Dispatch Time FU_id offset instruction IntMUL/DIVROBC #5 1 ADD 1 1 2 SUB 2 1 3 AND 3 1 MUL 1 4 2 5 n Centralized ROB Distributed ROBCs
ICCD’03 Round Robin Scheduling at Dispatch Time FU_id offset instruction IntMUL/DIVROBC #5 1 ADD 1 1 2 SUB 2 1 3 AND 3 1 MUL 1 reserved 4 2 5 n Centralized ROB Distributed ROBCs
ICCD’03 Round Robin Scheduling at Dispatch Time FU_id offset instruction IntMUL/DIVROBC #5 1 ADD 1 1 2 SUB 2 1 3 AND 3 1 MUL 1 reserved 4 MUL 5 1 2 5 n Centralized ROB Distributed ROBCs
ICCD’03 Round Robin Scheduling at Dispatch Time FU_id offset instruction IntMUL/DIVROBC #5 1 ADD 1 1 2 SUB 2 1 3 AND 3 1 1 reserved 4 MUL 5 1 2 DIV 5 n Centralized ROB Distributed ROBCs
ICCD’03 Round Robin Scheduling at Dispatch Time FU_id offset instruction IntMUL/DIVROBC #5 1 ADD 1 1 2 SUB 2 1 3 AND 3 1 1 reserved 4 MUL reserved 5 1 2 DIV 5 n Centralized ROB Distributed ROBCs
ICCD’03 Round Robin Scheduling at Dispatch Time FU_id offset instruction IntMUL/DIVROBC #5 1 ADD 1 1 2 SUB 2 1 3 AND 3 1 1 reserved 4 MUL reserved 5 1 2 DIV DIV 5 5 2 n Centralized ROB Distributed ROBCs
ICCD’03 Read Port Conflicts at Commitment FU_id offset instruction IntMUL/DIVROBC #5 1 ADD 1 1 2 SUB 2 1 1 read port 3 AND 3 1 1 reserved To commitment 4 MUL reserved 5 1 2 DIV DIV 5 5 2 CONFLICT: If MUL and DIV wants to commit in the same cycle n Centralized ROB Distributed ROBCs
ICCD’03 Distributed ROB Design 1 Writeback 1 write port to write results ROBC
ICCD’03 Distributed ROB Design 1 Writeback 1 write port to write results ROBC Commit 1 read port for instruction commitment
ICCD’03 Distributed ROB Design 1: with source read ports Writeback 1 write port to write results ROBC Dispatch/Issue1 read port to read the source operands Commit 1 read port for instruction commitment
ICCD’03 Experimental Setup: the AccuPower (DATE’02) Compiled SPEC benchmarks Performance stats Microarchitectural Simulator (Rooted in SimpleScalar) Datapath specs Transition counts, Context information Energy/Power Estimator VLSI layout data Power/energy stats SPICE SPICE deck SPICE measures of energy per transition
ICCD’03 Configuration of the Simulated System Machine width 4-way Issue Queue 32 entries Reorder Buffer 96 entries 32 entries Load/Store Queue Simulated the execution of SPEC2000 benchmarks
ICCD’03 Peak/Average demands on the number of ROBC entries peak avg. peak avg. peak avg. peak avg. peak avg.
ICCD’03 Peak/Average demands on the number of ROBC entries peak avg. peak avg. peak avg. peak avg. peak avg. Number of entriesassigned to eachROBC 8 8 8 8 4 4 4 4 4 4 16
ICCD’03 Peak/Average demands on the number of ROBC entries peak avg. peak avg. peak avg. peak avg. peak avg. Number of entriesassigned to eachROBC 72entry 8 + 8 + 8 + 8 + 4 + 4 + 4 + 4 + 4 + 4 + 16 = 8_4_4_4_16 configuration
ICCD’03 Percentage of cycles when dispatch blocks for 8_4_4_4_16 Average IPC drop% with 8_4_4_4_16 configuration = 4.8%
ICCD’03 Percentage of cycles when dispatch blocks for 8_4_4_4_16 Number of entriesassigned to eachROBC 72entry 8 + 8 + 8 + 8 + 4 + 4 + 4 + 4 + 4 + 4 + 16 =
ICCD’03 Reducing performance penalty: 12_6_4_6_20 Configuration Number of entriesassigned to eachROBC 96entry 12 + 12 + 12 + 12 + 6 + 4 + 4 + 4 + 4 + 6 + 20 = 12_6_4_6_20 configuration
ICCD’03 Performance Results for 12_6_4_6_20 Configuration gap gcc gzip parser perl twolf vortex vpr Int Avg. IPC applu art mesa mgrid swim wupwise FP Avg. Average IPC drop% with 12_6_4_6_20 configuration = 2.4%
ICCD’03 Distributed ROB Design 1: with source read ports Writeback 1 write port to write results ROBC Dispatch/Issue1 read port to read the source operands Commit 1 read port for instruction commitment
ICCD’03 Eliminating All Source Read Ports Writeback 1 write port to write results ROBC Dispatch/Issue1 read port to read the source operands Commit 1 read port for instruction commitment
ICCD’03 Eliminating All Source Read Ports Writeback 1 write port to write results ROBC Commit 1 read port for instruction commitment
ICCD’03 Where are the Source Values Coming From? Function Units Architectural Register File Instruction Issue 1 2 IQ FU1 F1 F2 D1 D2 FU2 ROB ARF FUm Fetch Decode/Dispatch EX 3 Instruction dispatch Result/status forwarding buses