1 / 85

Distributed Reorder Buffer Schemes for Low Power *

Distributed Reorder Buffer Schemes for Low Power *. Gurhan Kucuk, Oguz Ergin, Dmitry Ponomarev, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY 13902-6000 http://www.cs.binghamton.edu/~lowpower.

suzuki
Download Presentation

Distributed Reorder Buffer Schemes for Low Power *

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ICCD’03 Distributed Reorder Buffer Schemes for Low Power * Gurhan Kucuk, Oguz Ergin, Dmitry Ponomarev, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY 13902-6000 http://www.cs.binghamton.edu/~lowpower 21st International Conference on Computer Design (ICCD’03), October 14th 2003 *supported in part by DARPA through the PAC-C program and NSF

  2. ICCD’03 Outline • Reorder Buffer (ROB) complexities • Motivation for the low-complexity ROB • Low-complexity ROB designs • Fully Distributed ROB • Retention Latches (RLs) revisited (ICS’02) • Combined Scheme • Results • Concluding remarks

  3. ICCD’03 P6-style Superscalar Datapath Function Units Architectural Register File Instruction Issue IQ FU1 F1 F2 D1 D2 FU2 ROB ARF FUm Fetch Decode/Dispatch EX Instruction dispatch Result/status forwarding buses

  4. ICCD’03 PPC 620-style Superscalar Datapath Function Units Architectural Register File Instruction Issue RB IQ FU1 F1 F2 D1 D2 FU2 ROB ARF FUm Fetch Decode/Dispatch EX Instruction dispatch Result/status forwarding buses

  5. ICCD’03 ROB Port Requirements for a W-way CPU Decode/Dispatch W write ports to setup entries Writeback W write ports to write results ROB Dispatch/Issue 2W read ports to read the source operands Commit W read ports for instruction commitment

  6. ICCD’03 What This Work is All About • ROB complexity reduction is important for reducing power and improving performance • ROB dissipates a non-trivial fraction of the total chip power • ROB accesses stretch over several cycles • Goal of this work: Reduce the complexity and power dissipation of the ROB without sacrificing performance

  7. ICCD’03 Comparison of ROB Bitcells (0.18µ, TSMC) Layout of a 32-ported SRAM bitcell Layout of a 16-ported SRAM bitcell Area Reduction – 71% Shorter bit and wordlines

  8. ICCD’03 P6-style Superscalar Datapath Instruction dispatch Function Units Architectural Register File Instruction Issue IQ FU1 F1 F2 D1 D2 FU2 ROB ARF FUm Fetch Decode/Dispatch EX Result/status forwarding buses

  9. ICCD’03 Reorder Buffer Distribution Instruction dispatch ROB Components (ROBCs) Function Units Architectural Register File Instruction Issue IQ ROBC 1 FU1 F1 F2 D1 D2 ROBC 2 FU2 ARF FUm Fetch Decode/Dispatch ROBC m EX ROB Result/status forwarding buses Holds pointers to entries within ROBCs

  10. ICCD’03 Impact of Distributing the ROB • Each ROBC is effectively is a small Rename Buffer • Smaller read/write access energy • Faster access time • Distributing physical storage in this manner allows FUs to use shorter buses to write their respective ROBCs • Lower energy dissipation on the wires (We have NOT accounted for energy savings from using shorter wires) • Fits in naturally with a multi-clustered datapath design

  11. ICCD’03 Problems with the earlier Multi-banked RF Schemes • Port conflicts result in performance penalty • Interconnection network is more complex

  12. ICCD’03 Problems with the earlier Multi-banked RF Schemes and some good news! • Port conflicts result in performance penalty • Totally avoid write port conflicts • Minimize read port conflicts at commitment • Interconnection network is more complex

  13. ICCD’03 Problems with the earlier Multi-banked RF Schemes and some good news! • Port conflicts result in performance penalty • Totally avoid write port conflicts • Minimize read port conflicts at commitment • Interconnection network is more complex • Completely remove source read ports

  14. ICCD’03 Problems with the earlier Multi-banked RF Schemes and some good news! • Port conflicts result in performance penalty • Totally avoid write port conflicts • Minimize read port conflicts at commitment • Totally avoid source read port conflicts • Interconnection network is more complex • Completely remove source read ports

  15. ICCD’03 ROBCs Assigned to Each Function Unit FU_id offset 1 1 1 1 ROBC #1 FU #1 2 2 m 1 3 3 2 1 1 4 FU #2 ROBC #2 2 3 4 FU #m ROBC #m n 1 Centralized ROB Distributed ROBCs

  16. ICCD’03 Good News:Write port conflicts are avoided 1 write port FU_id offset 1 1 1 1 ROBC #1 FU #1 2 2 m 1 3 3 2 1 1 4 FU #2 ROBC #2 2 3 4 FU #m ROBC #m n 1 Centralized ROB Distributed ROBCs

  17. ICCD’03 Round Robin Scheduling at Dispatch Time FU_id offset instruction Int ADDROBC #1 1 1 2 2 3 Int ADDROBC #2 1 4 2 5 Int ADDROBC #3 1 2 Int ADDROBC #4 1 n 2 Centralized ROB Distributed ROBCs

  18. ICCD’03 Round Robin Scheduling at Dispatch Time ADD FU_id offset instruction Int ADDROBC #1 1 1 2 2 3 Int ADDROBC #2 1 4 2 5 Int ADDROBC #3 1 2 Int ADDROBC #4 1 n 2 Centralized ROB Distributed ROBCs

  19. ICCD’03 Round Robin Scheduling at Dispatch Time ADD FU_id offset instruction Int ADDROBC #1 1 1 reserved 2 2 3 Int ADDROBC #2 1 4 2 5 Int ADDROBC #3 1 2 Int ADDROBC #4 1 n 2 Centralized ROB Distributed ROBCs

  20. ICCD’03 Round Robin Scheduling at Dispatch Time ADD FU_id offset instruction Int ADDROBC #1 1 ADD 1 reserved 1 1 2 2 3 Int ADDROBC #2 1 4 2 5 Int ADDROBC #3 1 2 Int ADDROBC #4 1 n 2 Centralized ROB Distributed ROBCs

  21. ICCD’03 Round Robin Scheduling at Dispatch Time FU_id offset instruction Int ADDROBC #1 1 ADD 1 reserved 1 1 SUB 2 2 3 Int ADDROBC #2 1 4 2 5 Int ADDROBC #3 1 2 Int ADDROBC #4 1 n 2 Centralized ROB Distributed ROBCs

  22. ICCD’03 Round Robin Scheduling at Dispatch Time FU_id offset instruction Int ADDROBC #1 1 ADD 1 reserved 1 1 SUB 2 2 3 Int ADDROBC #2 1 reserved 4 2 5 Int ADDROBC #3 1 2 Int ADDROBC #4 1 n 2 Centralized ROB Distributed ROBCs

  23. ICCD’03 Round Robin Scheduling at Dispatch Time FU_id offset instruction Int ADDROBC #1 1 ADD 1 reserved 1 1 SUB 2 2 SUB 2 1 3 Int ADDROBC #2 1 reserved 4 2 5 Int ADDROBC #3 1 2 Int ADDROBC #4 1 n 2 Centralized ROB Distributed ROBCs

  24. ICCD’03 Round Robin Scheduling at Dispatch Time FU_id offset instruction Int ADDROBC #1 1 ADD 1 reserved 1 1 2 2 SUB 2 1 AND 3 Int ADDROBC #2 1 reserved 4 2 5 Int ADDROBC #3 1 2 Int ADDROBC #4 1 n 2 Centralized ROB Distributed ROBCs

  25. ICCD’03 Round Robin Scheduling at Dispatch Time FU_id offset instruction Int ADDROBC #1 1 ADD 1 reserved 1 1 2 2 SUB 2 1 AND 3 Int ADDROBC #2 1 reserved 4 2 5 Int ADDROBC #3 1 reserved 2 Int ADDROBC #4 1 n 2 Centralized ROB Distributed ROBCs

  26. ICCD’03 Round Robin Scheduling at Dispatch Time FU_id offset instruction Int ADDROBC #1 1 ADD 1 reserved 1 1 2 2 SUB 2 1 AND 3 AND 3 1 Int ADDROBC #2 1 reserved 4 2 5 Int ADDROBC #3 1 reserved 2 Int ADDROBC #4 1 n 2 Centralized ROB Distributed ROBCs

  27. ICCD’03 Good News:Avoiding Read Port Conflicts 1 read port FU_id offset instruction 1 ADD 1 reserved 1 1 2 2 SUB 2 1 3 AND 3 1 1 reserved 4 2 To commitment 5 1 reserved 2 1 n 2 Centralized ROB Distributed ROBCs

  28. ICCD’03 Round Robin Scheduling at Dispatch Time FU_id offset instruction IntMUL/DIVROBC #5 1 ADD 1 1 2 SUB 2 1 3 AND 3 1 MUL 1 4 2 5 n Centralized ROB Distributed ROBCs

  29. ICCD’03 Round Robin Scheduling at Dispatch Time FU_id offset instruction IntMUL/DIVROBC #5 1 ADD 1 1 2 SUB 2 1 3 AND 3 1 MUL 1 reserved 4 2 5 n Centralized ROB Distributed ROBCs

  30. ICCD’03 Round Robin Scheduling at Dispatch Time FU_id offset instruction IntMUL/DIVROBC #5 1 ADD 1 1 2 SUB 2 1 3 AND 3 1 MUL 1 reserved 4 MUL 5 1 2 5 n Centralized ROB Distributed ROBCs

  31. ICCD’03 Round Robin Scheduling at Dispatch Time FU_id offset instruction IntMUL/DIVROBC #5 1 ADD 1 1 2 SUB 2 1 3 AND 3 1 1 reserved 4 MUL 5 1 2 DIV 5 n Centralized ROB Distributed ROBCs

  32. ICCD’03 Round Robin Scheduling at Dispatch Time FU_id offset instruction IntMUL/DIVROBC #5 1 ADD 1 1 2 SUB 2 1 3 AND 3 1 1 reserved 4 MUL reserved 5 1 2 DIV 5 n Centralized ROB Distributed ROBCs

  33. ICCD’03 Round Robin Scheduling at Dispatch Time FU_id offset instruction IntMUL/DIVROBC #5 1 ADD 1 1 2 SUB 2 1 3 AND 3 1 1 reserved 4 MUL reserved 5 1 2 DIV DIV 5 5 2 n Centralized ROB Distributed ROBCs

  34. ICCD’03 Read Port Conflicts at Commitment FU_id offset instruction IntMUL/DIVROBC #5 1 ADD 1 1 2 SUB 2 1 1 read port 3 AND 3 1 1 reserved To commitment 4 MUL reserved 5 1 2 DIV DIV 5 5 2 CONFLICT: If MUL and DIV wants to commit in the same cycle n Centralized ROB Distributed ROBCs

  35. ICCD’03 Distributed ROB Design 1 Writeback 1 write port to write results ROBC

  36. ICCD’03 Distributed ROB Design 1 Writeback 1 write port to write results ROBC Commit 1 read port for instruction commitment

  37. ICCD’03 Distributed ROB Design 1: with source read ports Writeback 1 write port to write results ROBC Dispatch/Issue1 read port to read the source operands Commit 1 read port for instruction commitment

  38. ICCD’03 Experimental Setup: the AccuPower (DATE’02) Compiled SPEC benchmarks Performance stats Microarchitectural Simulator (Rooted in SimpleScalar) Datapath specs Transition counts, Context information Energy/Power Estimator VLSI layout data Power/energy stats SPICE SPICE deck SPICE measures of energy per transition

  39. ICCD’03 Configuration of the Simulated System Machine width 4-way Issue Queue 32 entries Reorder Buffer 96 entries 32 entries Load/Store Queue Simulated the execution of SPEC2000 benchmarks

  40. ICCD’03 Peak/Average demands on the number of ROBC entries peak avg. peak avg. peak avg. peak avg. peak avg.

  41. ICCD’03 Peak/Average demands on the number of ROBC entries peak avg. peak avg. peak avg. peak avg. peak avg. Number of entriesassigned to eachROBC 8 8 8 8 4 4 4 4 4 4 16

  42. ICCD’03 Peak/Average demands on the number of ROBC entries peak avg. peak avg. peak avg. peak avg. peak avg. Number of entriesassigned to eachROBC 72entry 8 + 8 + 8 + 8 + 4 + 4 + 4 + 4 + 4 + 4 + 16 = 8_4_4_4_16 configuration

  43. ICCD’03 Percentage of cycles when dispatch blocks for 8_4_4_4_16 Average IPC drop% with 8_4_4_4_16 configuration = 4.8%

  44. ICCD’03 Percentage of cycles when dispatch blocks for 8_4_4_4_16 Number of entriesassigned to eachROBC 72entry 8 + 8 + 8 + 8 + 4 + 4 + 4 + 4 + 4 + 4 + 16 =

  45. ICCD’03 Reducing performance penalty: 12_6_4_6_20 Configuration Number of entriesassigned to eachROBC 96entry 12 + 12 + 12 + 12 + 6 + 4 + 4 + 4 + 4 + 6 + 20 = 12_6_4_6_20 configuration

  46. ICCD’03 Performance Results for 12_6_4_6_20 Configuration gap gcc gzip parser perl twolf vortex vpr Int Avg. IPC applu art mesa mgrid swim wupwise FP Avg. Average IPC drop% with 12_6_4_6_20 configuration = 2.4%

  47. ICCD’03 Distributed ROB Design 1: with source read ports Writeback 1 write port to write results ROBC Dispatch/Issue1 read port to read the source operands Commit 1 read port for instruction commitment

  48. ICCD’03 Eliminating All Source Read Ports Writeback 1 write port to write results ROBC Dispatch/Issue1 read port to read the source operands Commit 1 read port for instruction commitment

  49. ICCD’03 Eliminating All Source Read Ports Writeback 1 write port to write results ROBC Commit 1 read port for instruction commitment

  50. ICCD’03 Where are the Source Values Coming From? Function Units Architectural Register File Instruction Issue 1 2 IQ FU1 F1 F2 D1 D2 FU2 ROB ARF FUm Fetch Decode/Dispatch EX 3 Instruction dispatch Result/status forwarding buses

More Related