230 likes | 340 Views
Integrating Adaptive On-Chip Storage Structures for Reduced Dynamic Power. Steve Dropsho , Alper Buyuktosunoglu , Rajeev Balasubramonian , David H. Albonesi , Sandhya Dwarkadas , Greg Semeraro , Grigorios Magklis , and Michael Scott ECE and CS Departments University of Rochester.
E N D
Integrating Adaptive On-Chip Storage Structures for Reduced Dynamic Power Steve Dropsho, Alper Buyuktosunoglu, Rajeev Balasubramonian, David H. Albonesi, Sandhya Dwarkadas, Greg Semeraro, Grigorios Magklis, and Michael Scott ECE and CS Departments University of Rochester
Why Adaptive Structures? • General purpose uP are “one size fits all” • But, needs vary across (within) applications • Can save considerable energy by matching resources to the application Objective: Less energy for same performance by adapting storage structures to application
Related Work • Adaptable cache • Balasubramonian et al., MICRO 2000 • Dhodapkar and Smith, ISCA 2002 • Adaptable issue logic • Buyuktosunoglu et al., GLS VLSI 2001 • Folegnani and Gonzalez, ISCA 2000
Common Themes • A single adaptive structure • Use of global information for feedback • Exploration-based (caches)
Related Work (cont) • Adaptable IQ, LSQ, and ROB • Ponomarev et al., MICRO 2001 • Three (3) adaptable structures • Reconfigurations based on local state
Integrating Multiple Adaptive Structures IPREG Integer Int FUs IIQ Memory L2 Unified Cache L1 Icache Branch predict Rename map ROB L1 Dcache FetchQ LSQ Floating Pt FP FUs FPQ FPREG
Challenges • Multiple (9) adaptive structures creates state explosion problem • Use of global information makes assigning cause and effect difficult • Potential for additive performance effects among the structures
Approach: Local Management • Local information for configuration decisions • Tight control over performance variance
Part I: The Caches IPREG Integer Int FUs IIQ Memory L2 Unified Cache L1 Icache Branch predict Rename map ROB L1 Dcache FetchQ LSQ Floating Pt FP FUs FPQ FPREG
0 1 2 3 A access (primary) B access (secondary) The Accounting Cache A1 B3 A2 B2 0 1 2 3 • Sequential accesses, A then B • Save energy on A access hit • Swap blocks on A access miss A3 B1 0 1 2 3 Swap 0 1 2 3 A4 B0 0 1 2 3
Way 1 2 3 4 Line A B C D 0 0 0 1 1 1 2 2 2 3 3 3 1 0 2 3 Most-Recently-Used Statistics MRU State Counters MRU[0] 3 A MRU[1] 2 B 0 1 2 3 MRU[2] 1 MRU State Transitions B A 1 0 2 3 MRU[3] 0 A Misses 0 C 1 2 0 3
Configuration Evaluation (mru) (lru) MRU[0] MRU[1] MRU[2] MRU[3] Misses 3 2 1 0 0 Delay = 6 DA + 3 DB Energy = 6 E1 + 3 E3 Delay = 6 DA + 1 DB Energy = 7 E2 Energy = 6 E3 Delay = 6 DA Energy = 6 E4 BASE Delay = 6 DA
Tolerance and the Bank Account • Tolerance allows more delay than BASE • DTOL = DBASE (1 + TOL) • TOL = {0.015, 0.062, 0.25} (1/64, 1/16, 1/4) • Bank account allows accumulation of unused tolerance • Use account credits in later intervals • Allows aggressive resizing • Amortizes mistakes over many intervals
Memory Hierarchy L2 Unified Cache (A/B) One Possible Configuration 0 1 2 3 L1 D-Cache (A, no B) L1 I-Cache (A/B) 0 1 2 3 0 1 2 3
Environment • Simplescalar simulator • Microarchitecture is similar to Alpha 21264 • Benchmarks are a mix of SPEC95, SPEC2K, and Olden • Energy models for buffers and caches from Buyuktosunoglu et al., GLS VLSI 2001 and Balasubramonian et al., MICRO 2000
Part II: Queues, Regs, and ROB IPREG Integer Int FUs IIQ Memory L2 Unified Cache L1 Icache Branch predict Rename map ROB L1 Dcache FetchQ LSQ Floating Pt FP FUs FPQ FPREG
Resizable Queues/Reg File Buffer PN N partitions of m elements m P1
Distribution of Buffer Size Buffer Sizing With Limited Histogramming 0 Full Grow buffer • 8K cycle period • Tolerances: • 1.5% (1/64) • 6.2% (1/16) • 25.0% (1/4) ave 0 Full Proper size ave 0 Full Precise shrink
Resizing the Register File • Issue: Do not know when registers expire • Solution: To make reg file smaller, move values out of partition (P) to be turned off • First, inhibit new assignments to P • Next, use a software interrupt routine to move values via normal rename logic mov r1 r1 • Register mappings automatically updated
Conclusion • Simultaneous adaptation of all major regular structures • Accounting cache • Limited histogramming for buffers • Adaptable register file • Local control yet tolerable performance loss • Future work • Augment local control with global control for bounded performance loss