430 likes | 533 Views
Reducing Power Requirements of Instruction Scheduling Through Dynamic Allocation of Multiple Datapath Resources*. Dmitry Ponomarev, Gurhan Kucuk, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY 13902-6000 http://www.cs.binghamton.edu/~lowpower.
E N D
MICRO’01 Reducing Power Requirements of Instruction Scheduling Through Dynamic Allocation of Multiple Datapath Resources* Dmitry Ponomarev, Gurhan Kucuk, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY 13902-6000 http://www.cs.binghamton.edu/~lowpower 34th International Symposium on Microarchitecture (MICRO-34), December 3rd, 2001 *supported in part by DARPA through the PAC-C program and NSF
MICRO’01 Presentation Outline • Motivation • Resource usage in superscalar datapaths • Resource allocation strategy • Performance results • Concluding remarks
MICRO’01 Motivation • High-end superscalar CPUs employ a substantial amount of datapath resources • Consequences: • High overall power dissipation • Areal Energy/Power density is at a dangerous level • Thus: • Energy dissipation needs to be preferably controlled through technology independent techniques
MICRO’01 What This Work is All About • Power-hungry resources are allocated on a “one-size-fits-all” basis • Unnecessary dissipation from overcommitted resources • Examples of resources: Issue Queue, Reorder Buffer, Load/Store Queue, caches, Function units - Resources considered in this work: IQ, ROB, LSQ • Main idea: • Control resource allocation/deallocation dynamically to track the demands of the application • Goals: • Must limit any impact on performance • Must allow for easy retrofit into existing datapaths • Must have a stable and low-overhead control strategy
MICRO’01 Dynamic Resizing of IQ, ROB and LSQ Architectural Register File Instruction Issue IQ Function Units FU1 F1 F2 Dec/ RN1 RN2/ Dis FU2 ROB ARF FUm Fetch Decode/Dispatch LSQ EX Instruction dispatch Result/status forwarding buses : resized resource
MICRO’01 Main Issues • How do we measure/estimate resource needs? • Continuous measurement vs. periodic sampling • What is the control strategy? • Centralized vs. distributed • How is the performance impact limited? • Periodic upsizing vs. asynchronous upsizing • What are the relevant circuit techniques? • Overall redesign vs.simple changes
MICRO’01 Resource Usage in Superscalar Datapath: Example (fpppp)
MICRO’01 Resource Usage in Superscalar Datapath: Example (apsi)
MICRO’01 Incremental Resource Allocation/Deallocation • The ROB, IQ and LSQ are each implemented as a set of independent partitions • Each partition is a register file, complete with its own sensing and precharge/write logic, multiple ports and through busses • All partitions have associative addressing logic
MICRO’01 Partitioned Organization Precharger array Associative part Partition 1 Non-associative part Bitlines or forwarding lines within a partition Input/output drivers Bypass switch array Precharger array Associative part Non-associative part Bypass switch Partition 2 Input/output drivers Bypass switch array Through line Precharger array Associative part Non-associative part Partition 3 Input/output drivers Bypass switch array Bitlines Forwarding lines
MICRO’01 Incremental Resource Allocation/Deallocation • Allocations are increased by adding a free partition • Deallocations are performed by powering down a partition after its contents have been used up • Easy to do for the IQ • A little more challenging for the ROB and the LSQ because of the FIFO nature.
MICRO’01 Sampling and Downsizing Strategies • Downsizing decisions are taken at the end of update period • Update periods have a fixed duration of UP cycles • Within an update period, multiple samples of the occupancies are taken at regular intervals of SP cycles SP cycles UP
MICRO’01 A Resizing Example (SP=4, UP=16) SP SP SP SP / UP SP SP SP SP / UP 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 32 24 Actual occupancy 16 8 0 32 24 Allocated entries 16 8 0
MICRO’01 A Resizing Example (SP=4, UP=16) SP SP SP SP / UP SP SP SP SP / UP 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 32 24 Actual occupancy 16 8 0 32 24 Allocated entries 16 8 0
MICRO’01 A Resizing Example (SP=4, UP=16) SP SP SP SP / UP SP SP SP SP / UP 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 32 24 Actual occupancy 16 8 0 32 24 Allocated entries 16 8 0
MICRO’01 A Resizing Example (SP=4, UP=16) SP SP SP SP / UP SP SP SP SP / UP 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 32 24 Actual occupancy 16 8 0 32 24 Allocated entries 16 8 0
MICRO’01 A Resizing Example (SP=4, UP=16) SP SP SP SP / UP SP SP SP SP / UP 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 32 24 Actual occupancy 16 8 0 32 24 Allocated entries 16 1 2 3 4 Avg. 8 0
MICRO’01 Upsizing Strategy • Count the number of cycles when dispatch blocks because the resource is full. • If the counter exceeds OT (Overflow Threshold), add one partition • upsizing is more aggressive than downsizing – reduces hit on performance • Reset the overflow counter to 0 at the beginning of a new UP (Update Period)
MICRO’01 A Resizing Example (SP=4, UP=16) SP SP SP SP / UP SP SP SP SP / UP 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 32 24 Actual occupancy 16 8 0 32 24 Allocated entries 16 1 2 3 4 Avg. 8 0
MICRO’01 A Resizing Example (SP=4, UP=16, OT=4) SP SP SP SP / UP SP SP SP SP / UP 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 32 24 Actual occupancy 16 8 0 32 24 Allocated entries 16 8 0
MICRO’01 A Resizing Example (SP=4, UP=16, OT=4) SP SP SP SP / UP SP SP SP SP / UP 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 32 24 Actual occupancy 16 8 0 32 24 Allocated entries 16 8 0
MICRO’01 A Resizing Example (SP=4, UP=16, OT=4) SP SP SP SP / UP SP SP SP SP / UP 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 32 1 24 Actual occupancy 16 8 0 32 24 Allocated entries 16 8 0
MICRO’01 A Resizing Example (SP=4, UP=16, OT=4) SP SP SP SP / UP SP SP SP SP / UP 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 32 1 2 24 Actual occupancy 16 8 0 32 24 Allocated entries 16 8 0
MICRO’01 A Resizing Example (SP=4, UP=16, OT=4) SP SP SP SP / UP SP SP SP SP / UP 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 32 1 2 24 Actual occupancy 16 8 0 32 24 Allocated entries 16 8 0
MICRO’01 A Resizing Example (SP=4, UP=16, OT=4) SP SP SP SP / UP SP SP SP SP / UP 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 32 1 2 3 24 Actual occupancy 16 8 0 32 24 Allocated entries 16 8 0
MICRO’01 A Resizing Example (SP=4, UP=16, OT=4) SP SP SP SP / UP SP SP SP SP / UP 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 32 1 2 3 4 24 Actual occupancy 16 8 0 OT = 4 32 24 Allocated entries 16 8 0
MICRO’01 A Resizing Example (SP=4, UP=16, OT=4) SP SP SP SP / UP SP SP SP 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 32 1 2 3 4 24 Actual occupancy 16 8 0 OT = 4 32 24 Allocated entries 16 8 0
MICRO’01 A Resizing Example (SP=4, UP=16, OT=4) SP SP SP SP / UP SP SP SP 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 32 1 2 3 4 24 Actual occupancy 16 8 0 32 24 Allocated entries 16 8 0
MICRO’01 A Resizing Example (SP=4, UP=16, OT=4) SP SP SP SP / UP SP SP SP 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 32 1 2 3 4 24 Actual occupancy 16 8 0 32 24 Allocated entries 16 8 0
MICRO’01 A Resizing Example (SP=4, UP=16, OT=4) SP SP SP SP / UP SP SP SP 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 32 1 2 3 4 24 Actual occupancy 16 8 0 32 24 Allocated entries 16 8 0
MICRO’01 Summary of the Control Strategy • Only three parameters used for control: • OT (Overflow Threshold) • UP (Update Period) • SP (Sample Period) • Less than 1% power overhead for control logic • Advantages: • Can easily achieve a desired power/performance tradeoff by adjusting OT and UP • Monitoring on a cycle-by-cycle basis is avoided – done once every SP cycles
MICRO’01 General Considerations for Deallocations • All information within the partition to be deallocated must be consumed • For the IQ, instructions from the partition must be issued • For the ROB, entries within the partition must be committed • For the LSQ, entries within the partition must start the D-cache access • No new instruction should be dispatched to this partition • This can cause dispatch to block for a longer duration in the case of the ROB because of its circular nature
MICRO’01 Experimental Setup: the Accupower Toolkit Compiled SPEC benchmarks Performance stats Microarchitectural Simulator Datapath specs Transition counts, Context information Power/energy stats Energy/Power Estimator VLSI layout data SPICE SPICE deck SPICE measures of Energy per transition
MICRO’01 Configuration of the Simulated System Machine width 4-way Issue Queue 32 entries with 4 partitions Reorder Buffer 96 entries with 6 partitions 32 entries with 4 partitions Load/Store Queue Simulated the execution of SPEC2000 benchmarks.
MICRO’01 Experimental Results: Effect on Performance IPC OT 128 512 2048 IPC Drop % 0.9% 4.9% 19.3%
MICRO’01 Experimental Results: Average Active Size (IQ) IPC OT 128 512 2048 Savings% 14% 27% 51%
MICRO’01 Experimental Results: Average Active Size (ROB) IPC OT 128 512 2048 Savings% 19% 34% 58%
MICRO’01 Experimental Results: Average Active Size (LSQ) IPC OT 128 512 2048 Savings% 7% 20% 47%
MICRO’01 Experimental Results (OT=512, UP=2048, SP=32)
MICRO’01 Experimental Results: Power Reduction mW OT 128 512 2048 Power Savings % 40% 48% 65% IPC Drop % 0.9% 4.9% 19.3%
MICRO’01 Other Matters • Dispatch rate modulation on top of resizing does not cause substantial additional power savings and results in higher IPC drops (WCED’01) • Note that this work also addresses leakage dissipations! • We are in the process of extending this work to add caches, FUs, TLBs, …, and dynamic threshold variation • Work in progress on the use of resizing hooks that are exposed to the compiler
MICRO’01 Related Work • Adaptive Issue Queue (Buyuktosunoglu et al, PACS’00): • Multi-partitioned issue queue • Number of partitions dynamically allocated based on the number of ready flags set in entries within active partition • IPC drop triggers growth • Resizable Issue Queue (Folegnani and Gonzalez, ISCA’01): • FIFO issue queue, multi-partitioned • Resizing based on number of instruction committed from the “youngest” partition used for downsizing • Pipeline Balancing (Bahar and Manne, ISCA’01): • For multi-clustered datapath organizations • Dynamic resizing of Issue Queue & Dynamic Cluster Activation • IPC monitored to allow clusters/issue queue partitions to be turned off with minimal impact on performance • Others (IPC monitoring & resource control by OS, dynamic profiling)
MICRO’01 Concluding Remarks • Significant power savings with minimal impact on performance are achieved by dynamically resizing multiple datapath resources. 48% power savings with only a 4.9% IPC drop • Simple control strategy is used that avoids resource monitoring on a cycle-by-cycle basis • Basic techniques are orthogonal to other power reduction strategies like selective bit-slice activation, frequency and voltage scaling and additional circuit techniques