1 / 43

*supported in part by DARPA through the PAC-C program and NSF

Reducing Power Requirements of Instruction Scheduling Through Dynamic Allocation of Multiple Datapath Resources*. Dmitry Ponomarev, Gurhan Kucuk, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY 13902-6000 http://www.cs.binghamton.edu/~lowpower.

Download Presentation

*supported in part by DARPA through the PAC-C program and NSF

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MICRO’01 Reducing Power Requirements of Instruction Scheduling Through Dynamic Allocation of Multiple Datapath Resources* Dmitry Ponomarev, Gurhan Kucuk, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY 13902-6000 http://www.cs.binghamton.edu/~lowpower 34th International Symposium on Microarchitecture (MICRO-34), December 3rd, 2001 *supported in part by DARPA through the PAC-C program and NSF

  2. MICRO’01 Presentation Outline • Motivation • Resource usage in superscalar datapaths • Resource allocation strategy • Performance results • Concluding remarks

  3. MICRO’01 Motivation • High-end superscalar CPUs employ a substantial amount of datapath resources • Consequences: • High overall power dissipation • Areal Energy/Power density is at a dangerous level • Thus: • Energy dissipation needs to be preferably controlled through technology independent techniques

  4. MICRO’01 What This Work is All About • Power-hungry resources are allocated on a “one-size-fits-all” basis • Unnecessary dissipation from overcommitted resources • Examples of resources: Issue Queue, Reorder Buffer, Load/Store Queue, caches, Function units - Resources considered in this work: IQ, ROB, LSQ • Main idea: • Control resource allocation/deallocation dynamically to track the demands of the application • Goals: • Must limit any impact on performance • Must allow for easy retrofit into existing datapaths • Must have a stable and low-overhead control strategy

  5. MICRO’01 Dynamic Resizing of IQ, ROB and LSQ Architectural Register File Instruction Issue IQ Function Units FU1 F1 F2 Dec/ RN1 RN2/ Dis FU2 ROB ARF FUm Fetch Decode/Dispatch LSQ EX Instruction dispatch Result/status forwarding buses : resized resource

  6. MICRO’01 Main Issues • How do we measure/estimate resource needs? • Continuous measurement vs. periodic sampling • What is the control strategy? • Centralized vs. distributed • How is the performance impact limited? • Periodic upsizing vs. asynchronous upsizing • What are the relevant circuit techniques? • Overall redesign vs.simple changes

  7. MICRO’01 Resource Usage in Superscalar Datapath: Example (fpppp)

  8. MICRO’01 Resource Usage in Superscalar Datapath: Example (apsi)

  9. MICRO’01 Incremental Resource Allocation/Deallocation • The ROB, IQ and LSQ are each implemented as a set of independent partitions • Each partition is a register file, complete with its own sensing and precharge/write logic, multiple ports and through busses • All partitions have associative addressing logic

  10. MICRO’01 Partitioned Organization Precharger array Associative part Partition 1 Non-associative part Bitlines or forwarding lines within a partition Input/output drivers Bypass switch array Precharger array Associative part Non-associative part Bypass switch Partition 2 Input/output drivers Bypass switch array Through line Precharger array Associative part Non-associative part Partition 3 Input/output drivers Bypass switch array Bitlines Forwarding lines

  11. MICRO’01 Incremental Resource Allocation/Deallocation • Allocations are increased by adding a free partition • Deallocations are performed by powering down a partition after its contents have been used up • Easy to do for the IQ • A little more challenging for the ROB and the LSQ because of the FIFO nature.

  12. MICRO’01 Sampling and Downsizing Strategies • Downsizing decisions are taken at the end of update period • Update periods have a fixed duration of UP cycles • Within an update period, multiple samples of the occupancies are taken at regular intervals of SP cycles SP cycles UP

  13. MICRO’01 A Resizing Example (SP=4, UP=16) SP SP SP SP / UP SP SP SP SP / UP 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 32 24 Actual occupancy 16 8 0 32 24 Allocated entries 16 8 0

  14. MICRO’01 A Resizing Example (SP=4, UP=16) SP SP SP SP / UP SP SP SP SP / UP 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 32 24 Actual occupancy 16 8 0 32 24 Allocated entries 16 8 0

  15. MICRO’01 A Resizing Example (SP=4, UP=16) SP SP SP SP / UP SP SP SP SP / UP 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 32 24 Actual occupancy 16 8 0 32 24 Allocated entries 16 8 0

  16. MICRO’01 A Resizing Example (SP=4, UP=16) SP SP SP SP / UP SP SP SP SP / UP 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 32 24 Actual occupancy 16 8 0 32 24 Allocated entries 16 8 0

  17. MICRO’01 A Resizing Example (SP=4, UP=16) SP SP SP SP / UP SP SP SP SP / UP 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 32 24 Actual occupancy 16 8 0 32 24 Allocated entries 16 1 2 3 4 Avg. 8 0

  18. MICRO’01 Upsizing Strategy • Count the number of cycles when dispatch blocks because the resource is full. • If the counter exceeds OT (Overflow Threshold), add one partition • upsizing is more aggressive than downsizing – reduces hit on performance • Reset the overflow counter to 0 at the beginning of a new UP (Update Period)

  19. MICRO’01 A Resizing Example (SP=4, UP=16) SP SP SP SP / UP SP SP SP SP / UP 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 32 24 Actual occupancy 16 8 0 32 24 Allocated entries 16 1 2 3 4 Avg. 8 0

  20. MICRO’01 A Resizing Example (SP=4, UP=16, OT=4) SP SP SP SP / UP SP SP SP SP / UP 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 32 24 Actual occupancy 16 8 0 32 24 Allocated entries 16 8 0

  21. MICRO’01 A Resizing Example (SP=4, UP=16, OT=4) SP SP SP SP / UP SP SP SP SP / UP 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 32 24 Actual occupancy 16 8 0 32 24 Allocated entries 16 8 0

  22. MICRO’01 A Resizing Example (SP=4, UP=16, OT=4) SP SP SP SP / UP SP SP SP SP / UP 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 32 1 24 Actual occupancy 16 8 0 32 24 Allocated entries 16 8 0

  23. MICRO’01 A Resizing Example (SP=4, UP=16, OT=4) SP SP SP SP / UP SP SP SP SP / UP 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 32 1 2 24 Actual occupancy 16 8 0 32 24 Allocated entries 16 8 0

  24. MICRO’01 A Resizing Example (SP=4, UP=16, OT=4) SP SP SP SP / UP SP SP SP SP / UP 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 32 1 2 24 Actual occupancy 16 8 0 32 24 Allocated entries 16 8 0

  25. MICRO’01 A Resizing Example (SP=4, UP=16, OT=4) SP SP SP SP / UP SP SP SP SP / UP 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 32 1 2 3 24 Actual occupancy 16 8 0 32 24 Allocated entries 16 8 0

  26. MICRO’01 A Resizing Example (SP=4, UP=16, OT=4) SP SP SP SP / UP SP SP SP SP / UP 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 32 1 2 3 4 24 Actual occupancy 16 8 0 OT = 4 32 24 Allocated entries 16 8 0

  27. MICRO’01 A Resizing Example (SP=4, UP=16, OT=4) SP SP SP SP / UP SP SP SP 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 32 1 2 3 4 24 Actual occupancy 16 8 0 OT = 4 32 24 Allocated entries 16 8 0

  28. MICRO’01 A Resizing Example (SP=4, UP=16, OT=4) SP SP SP SP / UP SP SP SP 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 32 1 2 3 4 24 Actual occupancy 16 8 0 32 24 Allocated entries 16 8 0

  29. MICRO’01 A Resizing Example (SP=4, UP=16, OT=4) SP SP SP SP / UP SP SP SP 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 32 1 2 3 4 24 Actual occupancy 16 8 0 32 24 Allocated entries 16 8 0

  30. MICRO’01 A Resizing Example (SP=4, UP=16, OT=4) SP SP SP SP / UP SP SP SP 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 32 1 2 3 4 24 Actual occupancy 16 8 0 32 24 Allocated entries 16 8 0

  31. MICRO’01 Summary of the Control Strategy • Only three parameters used for control: • OT (Overflow Threshold) • UP (Update Period) • SP (Sample Period) • Less than 1% power overhead for control logic • Advantages: • Can easily achieve a desired power/performance tradeoff by adjusting OT and UP • Monitoring on a cycle-by-cycle basis is avoided – done once every SP cycles

  32. MICRO’01 General Considerations for Deallocations • All information within the partition to be deallocated must be consumed • For the IQ, instructions from the partition must be issued • For the ROB, entries within the partition must be committed • For the LSQ, entries within the partition must start the D-cache access • No new instruction should be dispatched to this partition • This can cause dispatch to block for a longer duration in the case of the ROB because of its circular nature

  33. MICRO’01 Experimental Setup: the Accupower Toolkit Compiled SPEC benchmarks Performance stats Microarchitectural Simulator Datapath specs Transition counts, Context information Power/energy stats Energy/Power Estimator VLSI layout data SPICE SPICE deck SPICE measures of Energy per transition

  34. MICRO’01 Configuration of the Simulated System Machine width 4-way Issue Queue 32 entries with 4 partitions Reorder Buffer 96 entries with 6 partitions 32 entries with 4 partitions Load/Store Queue Simulated the execution of SPEC2000 benchmarks.

  35. MICRO’01 Experimental Results: Effect on Performance IPC OT 128 512 2048 IPC Drop % 0.9% 4.9% 19.3%

  36. MICRO’01 Experimental Results: Average Active Size (IQ) IPC OT 128 512 2048 Savings% 14% 27% 51%

  37. MICRO’01 Experimental Results: Average Active Size (ROB) IPC OT 128 512 2048 Savings% 19% 34% 58%

  38. MICRO’01 Experimental Results: Average Active Size (LSQ) IPC OT 128 512 2048 Savings% 7% 20% 47%

  39. MICRO’01 Experimental Results (OT=512, UP=2048, SP=32)

  40. MICRO’01 Experimental Results: Power Reduction mW OT 128 512 2048 Power Savings % 40% 48% 65% IPC Drop % 0.9% 4.9% 19.3%

  41. MICRO’01 Other Matters • Dispatch rate modulation on top of resizing does not cause substantial additional power savings and results in higher IPC drops (WCED’01) • Note that this work also addresses leakage dissipations! • We are in the process of extending this work to add caches, FUs, TLBs, …, and dynamic threshold variation • Work in progress on the use of resizing hooks that are exposed to the compiler

  42. MICRO’01 Related Work • Adaptive Issue Queue (Buyuktosunoglu et al, PACS’00): • Multi-partitioned issue queue • Number of partitions dynamically allocated based on the number of ready flags set in entries within active partition • IPC drop triggers growth • Resizable Issue Queue (Folegnani and Gonzalez, ISCA’01): • FIFO issue queue, multi-partitioned • Resizing based on number of instruction committed from the “youngest” partition used for downsizing • Pipeline Balancing (Bahar and Manne, ISCA’01): • For multi-clustered datapath organizations • Dynamic resizing of Issue Queue & Dynamic Cluster Activation • IPC monitored to allow clusters/issue queue partitions to be turned off with minimal impact on performance • Others (IPC monitoring & resource control by OS, dynamic profiling)

  43. MICRO’01 Concluding Remarks • Significant power savings with minimal impact on performance are achieved by dynamically resizing multiple datapath resources. 48% power savings with only a 4.9% IPC drop • Simple control strategy is used that avoids resource monitoring on a cycle-by-cycle basis • Basic techniques are orthogonal to other power reduction strategies like selective bit-slice activation, frequency and voltage scaling and additional circuit techniques

More Related