240 likes | 365 Views
Dynamically Trading Frequency for Complexity in a GALS Microprocessor. Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott University of Rochester. The gist of the paper….
E N D
Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott University of Rochester
The gist of the paper… Radical idea: Trade off frequency and hardware complexity dynamicallyat runtime rather than statically at design time The new twist: A Globally-Asynchronous, Locally-Synchronous (GALS) microarchitecture is key to making this worthwhile
Application phase behavior Varying behavior over time Can exploit to save power L2 misses E per interval L1I misses L1D misses branch mispred IPC gcc adaptive issue queue [Sherwood, Sair, Calder, ISCA 2003] [Buyuktosunoglu, et al., GLSVLSI 2001]
What about performance? RAM delay entries relative delay 32 24 16 8 1.0 0.77 0.52 0.31 CAM delay entries relative delay 32 24 26 8 1.0 0.77 0.55 0.34 Lower power and faster access time! [Buyuktosunoglu, GLSVLSI 2001]
What about performance? How do we exploit the faster speed? Variable latency Increase frequency when downsizing Decrease frequency when upsizing
What about performance? L1 I-Cache Main Memory Br Pred Fetch Unit Dispatch, Rename, ROB L2 Cache Issue Queue Issue Queue Ld/St Unit integer FP ALUs & RF ALUs & RF L1 D-Cache clock [Albonesi, ISCA 1998]
What about performance? [Albonesi, ISCA 1998]
Enter GALS… Front-end Domain External Domain L1 I-Cache Main Memory Br Pred Fetch Unit Memory Domain Dispatch, Rename, ROB L2 Cache Integer Domain FP Domain Issue Queue Issue Queue Ld/St Unit L1 D-Cache ALUs & RF ALUs & RF [Semeraro et al., HPCA 2002] [Iyer and Marculescu, ISCA 2002]
Outline • Motivation and background • Adaptive GALS microarchitecture • Control mechanisms • Evaluation methodology • Results • Conclusions and future work
Main Memory Adaptive GALS microarchitecture Front-end Domain External Domain L1 I-Cache L1 I-Cache L1 I-Cache L1 I-Cache Br Pred Br Pred Br Pred Br Pred Fetch Unit Memory Domain Dispatch, Rename, ROB L2 Cache L2 Cache L2 Cache L2 Cache Integer Domain FP Domain Ld/St Unit Issue Queue Issue Queue Issue Queue Issue Queue Issue Queue L1 D-Cache L1 D-Cache L1 D-Cache L1 D-Cache ALUs & RF ALUs & RF
Main Memory Adaptive GALS operation Front-end Domain External Domain L1 I-Cache L1 I-Cache L1 I-Cache L1 I-Cache L1 I-Cache L1 I-Cache Br Pred Br Pred Br Pred Br Pred Br Pred Fetch Unit Memory Domain Dispatch, Rename, ROB L2 Cache L2 Cache L2 Cache L2 Cache Integer Domain FP Domain Ld/St Unit Issue Queue Issue Queue Issue Queue Issue Queue Issue Queue L1 D-Cache L1 D-Cache L1 D-Cache L1 D-Cache ALUs & RF ALUs & RF
Resizable cache organization • Access A part first, then B part on a miss • Swap A and B blocks on a A miss, B hit • Select A/B split according to application phase behavior
Resizable cache control MRU State • Config A1 B3 • hitsA = MRU[0] • hitsB = MRU[1] + [2] + [3] (MRU) 0 1 2 3 (LRU) MRU[1]++ A B C D • Config A2 B2 • hitsA = MRU[0] + [1] • hitsB = MRU[2] + [3] MRU[2]++ Example Accesses B A C D • Config A3 B1 • hitsA = MRU[0] + [1] + [2] • hitsB = MRU[3] MRU[0]++ C B A D • Config A4 B0 • hitsA = MRU[0] + [1] + [2] + [3] • hitsB = 0 MRU[3]++ C B A D • Calculate the cost for each possible configuration: A access costs = (hitsA + hitsB + misses) * CostA B access costs = (hitsB + misses) * CostB Miss access costs = misses * CostMiss Total access cost = A + B + Miss (normalized to frequency)
Resizable issue queue control • Measures the exploitable ILP for each queue size • Timestamp counter is reset at the start of an interval and incremented each cycle • During rename, a destination register is given a timestamp based on the timestamp + execution latency of its slowest source operand • The maximum timestamp, MAXN is maintained for each of the four possible queue sizes over N fetched instructions (N=16, 32, 48, 64) • ILP is estimated as N/MAXN • Queue size with highest ILP (normalized to frequency) is selected Read the paper
Resizable hardware – some details • Front end domain • Icache “A”: 16KB 1-way, 32KB 2-way, 48KB 3-way, 64KB 4-way • Branch predictor sized with Icache • gshare PHT: 16KB-64KB • Local BHT: 2KB-8KB • Local PHT: 1024 entries • Meta: 16KB-64KB • Load/store domain • Dcache “A”: 32KB 1-way, 64KB 2-way, 128KB 4-way, 256KB, 8-way • L2 cache “A” sized with Dcache • 256KB 1-way, 512KB 2-way, 1MB 4-way, 2MB 8-way • Integer and floating point domains • Issue queue: 16, 32, 48, or 64 entries
Evaluation methodology • SimpleScalar and Cacti • 40 benchmarks from SPEC, Mediabench, and Olden • Baseline: best overall performing fully synchronous 21264-like design found out of 1,024 simulated options • Adaptive MCD costs imposed: • Additional branch penalty of 2 integer domain cycles and 1 front end domain cycle (overpipelined) • Frequency penalty as much as 31% • Mean PLL locking time of 15 µsec • Program-Adaptive: profile application and pick the best adaptive configuration for the whole program • Phase-Adaptive: use online cache and issue queue control mechanisms
Performance improvement Mediabench Olden SPEC
Phase behavior – art issue queue entries 100 million instruction window
Phase behavior – apsi 256KB 128KB Dcache “A” size 64KB 32KB 100 million instruction window
Performance summary • Program Adaptive: 17% performance improvement • Phase Adaptive: 20% performance improvement • Automatic • Never degrades performance for 40 applications • Few phases in chosen application windows – could perhaps do better • Distribution of chosen configurations for Program Adaptive: Integer IQ FP IQ D/L2 Cache Icache 16 85% 32 5% 48 5% 64 5% 16 73% 32 15% 48 8% 64 5% 32KB/256KB 50% 64KB/512KB 18% 128KB/1MB 23% 256KB/2MB 10% 16KB 55% 32KB 18% 48KB 8% 64KB 20%
Conclusions • Application phase behavior can be exploited to improve performance in addition to power savings • GALS approach is key to localizing the impact of slowing the clock • Cache and queue control mechanisms can evaluate all possible configurations within a single interval • Phase adaptive approach improves performance by as much as 48% and by an average of 20%
Future work • Explore multiple adaptive structures in each domain • Better take into account the branch predictor • Resize the instruction cache by sets rather than ways • Explore better issue queue design alternatives • Build circuits • Dynamically customized heterogeneous multi-core architectures using phase-adaptive GALS cores
Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott University of Rochester