370 likes | 520 Views
A Low-Power I-Cache Design with Tag-Comparison Reuse. K. Inoue, H. Tanaka, V. Moshnyaga, and K. Murakami. Introduction. On-chip Caches Indispensable to high-performance, low-energy SOCs Confine memory accesses in on-chip
E N D
A Low-Power I-Cache Design with Tag-Comparison Reuse K. Inoue, H. Tanaka, V. Moshnyaga, and K. Murakami
Introduction • On-chip Caches • Indispensable to high-performance, low-energy SOCs • Confine memory accesses in on-chip • Reduces not only off-chip memory-access latency but also energy for driving external I/O pins • However… • Larger and higher associative organization consumes more energy • E.g. 25% (DEC 21164) and 43%(Strong ARM) of total chip power * • Particularly I-Caches due to their high access frequency * Kamble et. al., “Analytical energy Dissipation Models for Low Power Caches”, ISLPED’97
HBTC (ICCD’02) Phased Conventional Reuse Cache Hit Parallel search produces unnecessary way activation! No Reuse Cache Miss Fast access with High energy Slow access with Low energy Fast access with Low energy Conventional vs. Phased vs. HBTC
Contribution • Detailed evaluation based on a 0.18μm SRAM design • Comparison with other low-power caches and hybrid models • Exploration for reducing energy-overhead caused by the HBTC approach
Outline • Introduction • History-Based Tag-Comparison (HBTC) Cache • Evaluation • Designing an SRAM array • Evaluating performance/energy efficiency • Reducing energy-overhead • Conclusions
Outline • Introduction • History-Based Tag-Comparison (HBTC) Cache • Evaluation • Designing an SRAM array • Evaluating performance/energy efficiency • Reducing energy-overhead • Conclusions
Reuse! Reuse! History-Based Tag-Comparison Cache (HBTC) Attempts to reuse tag-check results produced during a cache-miss interval! • If the target instruction has been referenced before, and • No cache miss has occurred since the previous reference. Miss! Miss! Ref. A Ref. A Ref. A time Tag Check! Tag Check! Tag Check! Cache-miss interval
1. Execute an instruction A at time T way3 way2 way1 way0 • Perform tag check • Save the tag-check result into an extended BTB Index [way2] is the Hit-way! 3. Execute the instruction A at time T+X way3 way2 way1 way0 • Reuse the tag-check result to activate only the hit-way’s data sub-array Index [way2] is the Hit-way! Operation 2. If a cache miss occurs, then we invalidate all the stored tag-check results
Pred. Result Branch Inst. Addr. Organization PBAreg WP Recode Reg. Tag Check Result Address for writing WP Table Branch-Inst. Addr. Target Addr. I-Cache PC BTB (Branch Target Buffer) Not Taken Taken Branch-Inst. Addr. Target Addr. Branch Prediction Result Entry of the WP Table valid n of way pointers WP valid flag WP Reg. Mode Controller Mode Miss?
HBTC I-$ Operation Example Mode Transition Valid OM BTB Hit WPRreg PBAreg GOtoNM From I-Cache Invalid WP Table NM TM GOtoNM Inst. Addr. A Target Addr. T N Branch Target Buffer PC Inst. Addr. B Target Addr. 4-way I-Cache Pred. (T or N) 2 1 0 3 WPreg Mode Controller
HBTC I-$ Operation Example Mode Transition Valid OM BTB Hit WPRreg PBAreg GOtoNM From I-Cache Invalid WP Table NM TM GOtoNM Inst. Addr. A Target Addr. T N A Branch Target Buffer PC Inst. Addr. B Target Addr. 4-way I-Cache Taken 2 1 0 3 WPreg Mode Controller
HBTC I-$ Operation Example Mode Transition PC and Branch prediction result are saved! Valid OM BTB Hit WPRreg PBAreg GOtoNM From I-Cache Invalid T A WP Table NM TM GOtoNM Inst. Addr. A Target Addr. T N A Branch Target Buffer PC Inst. Addr. B Target Addr. 4-way I-Cache Taken 2 1 0 3 NO valid WPs are detected! WPreg Mode Controller
HBTC I-$ Operation Example Mode Transition Tag-Comparison result is stored into the WPRreg! Valid OM BTB Hit WPRreg PBAreg 1 GOtoNM Invalid T A WP Table NM TM GOtoNM Inst. Addr. A Target Addr. T N Branch Target Buffer Conventional Accesses! PC Inst. Addr. B Target Addr. 4-way I-Cache Pred. (T or N) 2 1 0 3 WPreg Mode Controller
HBTC I-$ Operation Example Mode Transition Tag-Comparison result is stored into the WPRreg! Valid OM BTB Hit WPRreg PBAreg 3 GOtoNM Invalid T A WP Table NM TM GOtoNM Inst. Addr. A Target Addr. T N Branch Target Buffer Conventional Accesses! PC Inst. Addr. B Target Addr. 4-way I-Cache Pred. (T or N) 2 1 0 3 WPreg Mode Controller
HBTC I-$ Operation Example Mode Transition Tag-Comparison result is stored into the WPRreg! Valid OM BTB Hit WPRreg PBAreg 0 GOtoNM Invalid T A WP Table NM TM GOtoNM Inst. Addr. A Target Addr. T N Branch Target Buffer Conventional Accesses! PC Inst. Addr. B Target Addr. 4-way I-Cache Pred. (T or N) 2 1 0 3 WPreg Mode Controller
HBTC I-$ Operation Example Mode Transition The WPRreg is stored into the WP-Table entry pointed by the PBAreg! Valid OM BTB Hit WPRreg PBAreg GOtoNM From I-Cache Invalid T A WP Table NM TM GOtoNM Inst. Addr. A Target Addr. BTB Hit! T N B Branch Target Buffer PC Inst. Addr. B Target Addr. 4-way I-Cache Pred. (T or N) 2 1 0 3 WPreg Mode Controller
HBTC I-$ Operation Example Mode Transition Valid OM BTB Hit WPRreg PBAreg GOtoNM From I-Cache Invalid WP Table NM TM GOtoNM Inst. Addr. A Target Addr. T N A Branch Target Buffer PC Inst. Addr. B Target Addr. 4-way I-Cache Taken 2 1 0 3 WPreg Mode Controller
HBTC I-$ Operation Example Mode Transition Valid OM BTB Hit WPRreg PBAreg GOtoNM From I-Cache Invalid WP Table NM TM GOtoNM Inst. Addr. A Target Addr. T N A Branch Target Buffer PC Inst. Addr. B Target Addr. 4-way I-Cache Taken 2 1 0 3 Valid WPs are detected! WPreg Mode Controller
HBTC I-$ Operation Example Mode Transition Valid OM BTB Hit WPRreg PBAreg GOtoNM From I-Cache Invalid WP Table NM TM GOtoNM Inst. Addr. A Target Addr. T N Branch Target Buffer Tag-Comparison Reuse PC Inst. Addr. B Target Addr. 4-way I-Cache 1 Pred. (T or N) 2 1 0 3 WPreg Mode Controller
HBTC I-$ Operation Example Mode Transition Valid OM BTB Hit WPRreg PBAreg GOtoNM From I-Cache Invalid WP Table NM TM GOtoNM Inst. Addr. A Target Addr. T N Branch Target Buffer Tag-Comparison Reuse PC Inst. Addr. B Target Addr. 4-way I-Cache 3 Pred. (T or N) 2 1 0 3 WPreg Mode Controller
HBTC I-$ Operation Example Mode Transition Valid OM BTB Hit WPRreg PBAreg GOtoNM From I-Cache Invalid WP Table NM TM GOtoNM Inst. Addr. A Target Addr. T N Branch Target Buffer Tag-Comparison Reuse PC Inst. Addr. B Target Addr. 4-way I-Cache 0 Pred. (T or N) 2 1 0 3 WPreg Mode Controller
Outline • Introduction • History-Based Tag-Comparison (HBTC) Cache • Evaluation • Designing an SRAM array • Evaluating performance/energy efficiency • Reducing energy-overhead • Conclusions
Evaluation • SimpleScalar simulation tool set • In-order execution (fetch width = 1) • 16 KB 4-way I-cache with a 32B line size • 4-way BTB with 128 sets • Benchmark • Five SPEC95 integer programs • Four Mediabench programs (enc and dec for each) • Energy Model • ETOTAL = ECACHE + EBTBEXT + ELG (=0) • ECACHE = EDEC + ETAG + ELINE • EBTBEXT= EWPrd + EWPwt + EWPinv
Cache Model ILT HBTC BASE (conventional) PREDppc The MRU table is accessed by using the previous PC (ppc)
Design of a 4KB SRAM array • 4KB SRAM design • 0.18μm CMOS technology • One way of the 16KB cache • Hspice simulation • w/ extracted load capacitances • Measure the energy consumed for 1-bit accesses • Estimated energy per access SRAM cell layout w/ reset control Average energy and delay per access
Energy Efficiency For all but one, HBTC+ILT produces the best performance! The best approach is application dependent! ILT PREDppc HBTC HBTC+ILT HBTC+PREDppc BASE Normalized Energy Consumption EBTBEXT ETAG ELINE EDEC 126.gcc 129.compress 130.li adpcm(d) epic(e) mpeg2(d) all
Only Taken/Not-Taken Bit-Line Partitioning Pre-Decoding Energy-Overhead Reduction EBTBEXT = #bits for Tag-Comp. Reuse * Ave. Energy per Bit-Access * #BTB Accesses Pre-Decoding is an efficient way to reduce the overhead Only Taken/Not-Taken can’t improve energy efficiency +Pre-Decoding +BLP +Only Not-Taken +Only Taken HBTC Normalized Energy Consumption EBTBEXT ETAG ELINE EDEC 126.gcc 129.compress 130.li adpcm(d) epic(e) mpeg2(d)
Performance • One cycle stall occurs when • A set of tag-comparison result is stored to the extended BTB • An invalidation of tag-comparison results takes place Performance degradation is trivial (less than 1% for many benchmarks) Normalized Execution Time
Outline • Introduction • History-Based Tag-Comparison (HBTC) Cache • Evaluation • Designing an SRAM array • Evaluating performance/energy efficiency • Reducing energy-overhead • Conclusions
Conclusions • Detail evaluation of the HBTC approach for high-performance, low-energy caches • HBTC cache can achieve 60% of energy reduction compared with a conventional cache • Combination with another low-energy technique produces significant energy reduction (70% of energy reduction in the best case) • Pre-decoding to reduce the frequency of BTB look-up alleviate the negative effect of the HBTC approach • Future work • Complete design of the HBTC cache
Buck Up Slides (History-based Tag-Comparison Cache)
HBTC I-$ Operation Normal Mode (NM):w/ Tag checks Omitting Mode (OM):w/o Tag checks (Reuse) Tracing Mode (TM):w/ Tag checks (tag-check results are preserved into the WPRreg, and are stored into the WP-table on the next BTB hit ) Mode Transition GOtoNM I-Cache miss or BTB replacement or RAS access or Branch misprediction Valid OM BTB Hit GOtoNM Invalid PC and Pred.-result PBAreg NM TM GOtoNM All WPs are invalidated!
Evaluation– Effect of The Number of WPs – w/o Pre-Decoding w/ Pre-Decoding 1.2 126.gcc Energy for Cache Access 1.0 Energy Overhead of BTB 0.8 0.6 Normalized Energy (Joule) 0.4 0.2 0.0 1 2 4 8 16 32 1 2 4 8 16 32 # of Way Pointer • Increasing the number of WPs makes it possible to reuse many tag-check results • But, it produces BTB access energy overhead ICCD’02
Evaluation– Effect of Cache Associativity – mpeg2decode Conventional HBTC Eothers Etag Edata,bl Edata,prectl Energy (Joule) 1 2 4 8 16 32 64 1 2 4 8 16 32 64 Associativity • Conv.: Ecache grows with the increase in assiciativity • HBTC: Ecache is reduced with the increase in associativity (n<=4), after that, It starts to increase (n>4) ICCD’02
Cache Miss Penalty Evaluation– Effect of WP invalidation penalty – BTB Replacement Cache Miss 126.gcc Normalized Exe. Time (cycle) Breakdown of WP invalidations 099.go mpeg2(d) 132.ijpeg 099.go 126.gcc 130.li 102.swim adpcm(d) mpeg2(d) 124.m88ksim 129.comp.132.ijpeg adpcm(e) mpeg2(e) WP Invalidation Penalty (cycle) • If the penalty is equal to or smaller than 4 clock cycles, the performance overhead is trivial. • The performance overhead grows after the penalty is more than 4 clock cycles. ICCD’02
MRU Info. Way0 Way1 Way2 Way3 Way0 Way1 Way2 Way3 Prediction Hit every access Way0 Way1 Way2 Way3 Prediction Miss
Way0 Way1 Way2 Way3 Intra-Line Access Tag-Comp. Reuse OR Way0 Way1 Way2 Way3 Inter-Line Access No Reuse