410 likes | 556 Views
A Low Energy Set-Associative I-Cache with Extended BTB. K. Inoue, V. Moshnyaga, and K. Murakami. Introduction. Increase in cache size. Power consumed in on-chip caches. DEC 21164 CPU*. StrongARM SA-110 CPU*. Bipolar ECL CPU**. 50%. 25%. 43%.
E N D
A Low Energy Set-Associative I-Cache with Extended BTB K. Inoue, V. Moshnyaga, and K. Murakami
Introduction Increase in cache size Power consumed in on-chip caches DEC 21164 CPU* StrongARM SA-110 CPU* Bipolar ECL CPU** 50% 25% 43% * Kamble et. al., “Analytical energy Dissipation Models for Low Power Caches”, ISLPED’97 ** Joouppi et. al., “A 300-MHz 115-W 32-b Bipolar ECL Microprocessor” ,IEEE Journal of Solid-State Circuits’93
Conventional Cache Phased Cache Tag way0 way1 way2 way3 Line Cycle 1 Cycle 1 Parallel search strategy produces unnecessary way activation! Cycle 2 First access but High energy consumption Low energy consumption but Slow access Problem of Conventional Caches
Our Proposal History-Based Tag-Comparison I-Cache • Attempts to reduce cache-access energy without performance degradation • Reuses tag-check results to eliminate unnecessary way activation • Can achieve 62% of energy reduction with only 0.2% of performance degradation
Cache is updated only when replacement occurs! A loaded data stays at the same location at least until the next cache-miss takes place Programs include a lot of loops! A number of instructions are executed repetitively Inst. A is referenced N times Ref. A Ref. A Ref. A Ref. A Ref. A Miss! Miss! Tag Check! Tag Check! Tag Check! Tag Check! Tag Check! time Cache-miss interval Cache is stable! Conventional Tag-Check Scheme Completely the same tag-check result!
History-Based Tag-Comparison (HBTC) Scheme Attempts to reuse tag-check results produced before during a cache-miss interval! • The target instruction has been referenced before, and • No cache miss has occurred since the previous reference. Miss! Miss! Ref. A Ref. A time Tag Check! Reuse! Cache-miss interval
1. Execute an instruction A at time T way3 way2 way1 way0 • Perform tag check • Save the tag-check result into extended BTB Index [way2] is the Hit-way! 3. Execute the instruction A at time T+X way3 way2 way1 way0 • Reuse the tag-check result to activate only the hit-way’s data sub-array Index [way2] is the Hit-way! Concept of the HBTC Cache 2. If a cache miss occurs, then we invalidate all the stored tag-check results
Conventional VS. Phased VS. HBTC Conventional Phased HBTC Reuse Cache Hit No Reuse Cache Miss
WP Recode Reg. Pred. Result Branch Inst. Addr. Tag Check Result Address for writing WP Table Branch-Inst. Addr. Target Addr. I-Cache PC BTB (Branch Target Buffer) Not Taken Taken Branch-Inst. Addr. Target Addr. Branch Prediction Result Entry of the WP Table valid n of way pointers WP valid flag WP Reg. Mode Controller Mode Miss? HBTC SA I-$ Architecture PBAreg
Mode Transition GOtoNM I-Cache miss or BTB replacement or RAS access or Branch misprediction Valid OM BTB Hit GOtoNM Invalid PC and Pred.-result PBAreg NM TM GOtoNM All WPs are invalidated! HBTC I-$ Operation Normal Mode (NM):w/ Tag checks Omitting Mode (OM):w/o Tag checks (Reuse) Tracing Mode (TM):w/ Tag checks (tag-check results are preserved into the WPRreg, and are stored into the WP-table on the next BTB hit )
HBTC I-$ Operation Example Mode Transition Valid OM BTB Hit WPRreg PBAreg GOtoNM From I-Cache Invalid WP Table NM TM GOtoNM Inst. Addr. A Target Addr. T N Branch Target Buffer PC Inst. Addr. B Target Addr. 4-way I-Cache Pred. (T or N) 2 1 0 3 WPreg Mode Controller
HBTC I-$ Operation Example Mode Transition Valid OM BTB Hit WPRreg PBAreg GOtoNM From I-Cache Invalid WP Table NM TM GOtoNM Inst. Addr. A Target Addr. T N A Branch Target Buffer PC Inst. Addr. B Target Addr. 4-way I-Cache Taken 2 1 0 3 WPreg Mode Controller
HBTC I-$ Operation Example Mode Transition Valid OM BTB Hit WPRreg PBAreg GOtoNM From I-Cache Invalid WP Table NM TM GOtoNM Inst. Addr. A Target Addr. T N A Branch Target Buffer PC Inst. Addr. B Target Addr. 4-way I-Cache Taken 2 1 0 3 NO valid WPs are detected! WPreg Mode Controller
HBTC I-$ Operation Example Mode Transition PC and Branch prediction result are saved! Valid OM BTB Hit WPRreg PBAreg GOtoNM From I-Cache Invalid T A WP Table NM TM GOtoNM Inst. Addr. A Target Addr. T N A Branch Target Buffer PC Inst. Addr. B Target Addr. 4-way I-Cache Pred. (T or N) 2 1 0 3 NO valid WPs are detected! WPreg Mode Controller
HBTC I-$ Operation Example Mode Transition Tag-Comparison result is stored into the WPRreg! Valid OM BTB Hit WPRreg PBAreg 1 GOtoNM Invalid T A WP Table NM TM GOtoNM Inst. Addr. A Target Addr. T N Branch Target Buffer Conventional Accesses! PC Inst. Addr. B Target Addr. 4-way I-Cache Pred. (T or N) 2 1 0 3 WPreg Mode Controller
HBTC I-$ Operation Example Mode Transition Tag-Comparison result is stored into the WPRreg! Valid OM BTB Hit WPRreg PBAreg 3 GOtoNM Invalid T A WP Table NM TM GOtoNM Inst. Addr. A Target Addr. T N Branch Target Buffer Conventional Accesses! PC Inst. Addr. B Target Addr. 4-way I-Cache Pred. (T or N) 2 1 0 3 WPreg Mode Controller
HBTC I-$ Operation Example Mode Transition Tag-Comparison result is stored into the WPRreg! Valid OM BTB Hit WPRreg PBAreg 0 GOtoNM Invalid T A WP Table NM TM GOtoNM Inst. Addr. A Target Addr. T N Branch Target Buffer Conventional Accesses! PC Inst. Addr. B Target Addr. 4-way I-Cache Pred. (T or N) 2 1 0 3 WPreg Mode Controller
HBTC I-$ Operation Example Mode Transition The WPRreg is stored into the WP-Table entry pointed by the PBAreg! Valid OM BTB Hit WPRreg PBAreg GOtoNM From I-Cache Invalid T A WP Table NM TM GOtoNM Inst. Addr. A Target Addr. BTB Hit! T N B Branch Target Buffer PC Inst. Addr. B Target Addr. 4-way I-Cache Pred. (T or N) 2 1 0 3 WPreg Mode Controller
HBTC I-$ Operation Example Mode Transition Valid OM BTB Hit WPRreg PBAreg GOtoNM From I-Cache Invalid WP Table NM TM GOtoNM Inst. Addr. A Target Addr. T N A Branch Target Buffer PC Inst. Addr. B Target Addr. 4-way I-Cache Taken 2 1 0 3 WPreg Mode Controller
HBTC I-$ Operation Example Mode Transition Valid OM BTB Hit WPRreg PBAreg GOtoNM From I-Cache Invalid WP Table NM TM GOtoNM Inst. Addr. A Target Addr. T N A Branch Target Buffer PC Inst. Addr. B Target Addr. 4-way I-Cache Taken 2 1 0 3 Valid WPs are detected! WPreg Mode Controller
HBTC I-$ Operation Example Mode Transition Valid OM BTB Hit WPRreg PBAreg GOtoNM From I-Cache Invalid WP Table NM TM GOtoNM Inst. Addr. A Target Addr. T N Branch Target Buffer Tag-Comparison Reuse PC Inst. Addr. B Target Addr. 4-way I-Cache 1 Pred. (T or N) 2 1 0 3 WPreg Mode Controller
HBTC I-$ Operation Example Mode Transition Valid OM BTB Hit WPRreg PBAreg GOtoNM From I-Cache Invalid WP Table NM TM GOtoNM Inst. Addr. A Target Addr. T N Branch Target Buffer Tag-Comparison Reuse PC Inst. Addr. B Target Addr. 4-way I-Cache 3 Pred. (T or N) 2 1 0 3 WPreg Mode Controller
HBTC I-$ Operation Example Mode Transition Valid OM BTB Hit WPRreg PBAreg GOtoNM From I-Cache Invalid WP Table NM TM GOtoNM Inst. Addr. A Target Addr. T N Branch Target Buffer Tag-Comparison Reuse PC Inst. Addr. B Target Addr. 4-way I-Cache 0 Pred. (T or N) 2 1 0 3 WPreg Mode Controller
HBTC I-$ Operation Example Mode Transition Valid OM BTB Hit WPRreg PBAreg GOtoNM From I-Cache Invalid WP Table NM TM GOtoNM Inst. Addr. A Target Addr. T N Branch Target Buffer No valid WPs in the WPreg! PC Inst. Addr. B Target Addr. 4-way I-Cache ? Pred. (T or N) 2 1 0 3 WPreg Mode Controller
HBTC I-$ Operation Example Mode Transition Valid OM BTB Hit WPRreg PBAreg GOtoNM From I-Cache Invalid WP Table NM TM GOtoNM Inst. Addr. A Target Addr. T N Branch Target Buffer Conventional Accesses! PC Inst. Addr. B Target Addr. 4-way I-Cache Pred. (T or N) 2 1 0 3 WPreg Mode Controller
way0 way1 way2 way3 way0 way1 way2 way3 Advantages and Disadvantages Omitting Mode (OM) Normal Mode (NM) / Tracing Mode (TM) • Eliminate unnecessary energy consumption w/o performance degradation (during OM)! • BTB energy overhead due to WP-table read-accesses • BTB access conflict for invalidating all WPs (causes 1 stall cycle) • BTB access conflict to record WPs (causes 1 stall cycle)
Benchmark Programs SPECint95 099.go, 124.m88ksim, 126.gcc, 129.compress,130.li, 132.ijpeg SPECfp95 102.swim Mediabench mpeg2encode, mpeg2decode, adpcm_enc, adpcm_dec Evaluation– Environment – • OOO simulation by SimpleScalar • 16 KB 4-way I-cache (32 B line size) • For others, default parameters were used • Cache Energy Model based on [Kamble97] • (including the WP-table read-energy overhead) • Assume that the BTB is accessed only when branch or jump instructions are executed (instructions are pre-decoded) [Kamble97] M.B.Kamble and K.Ghose, ”Analytical Energy Dissipation Models For Low Power Caches,” ISLPED97
# of WPs = 4 1.2 1.2 1.0 1.0 0.8 0.8 0.6 Normalized Energy (Joule) 0.6 Normalized Exe. Time (cycle) 0.4 0.4 0.2 0.2 0.0 0.0 099.go 126.gcc 130.li 102.swim adpcm(d) mpeg2(d) 124.m88ksim 129.comp. 132.ijpeg adpcm(e) mpeg2(e) Evaluation– Energy and Performance – 62% 0.2% • 62% of Ecache reduction with 0.2% of Exe. Time increase • Even if in the worst case, about 20% of Ecache reduction
Cache Miss Penalty Evaluation– Effect of WP invalidation penalty – 126.gcc 099.go Norm. Exe. Time (cycle) mpeg2(d) 132.ijpeg WP Invalidation Penalty (cycle) • If the penalty is equal to or smaller than 4 clock cycles, the performance overhead is trivial. • The performance overhead grows after the penalty is more than 4 clock cycles.
Evaluation– Effect of The Number of WPs – w/o Pre-Decoding w/ Pre-Decoding 1.2 126.gcc Energy for Cache Access 1.0 Energy Overhead of BTB 0.8 0.6 Normalized Energy (Joule) 0.4 0.2 0.0 1 2 4 8 16 32 1 2 4 8 16 32 # of Way Pointer • Increasing the number of WPs makes it possible to reuse many tag-check results • But, it produces BTB access energy overhead
Evaluation– Effect of Cache Associativity – mpeg2decode Conventional HBTC Eothers Etag Edata,bl Edata,prectl Energy (Joule) 1 2 4 8 16 32 64 1 2 4 8 16 32 64 Associativity • Conv.: Ecache grows with the increase in assiciativity • HBTC: Ecache is reduced with the increase in associativity (n<=4), after that, It starts to increase (n>4)
Conclusions History-Based Tag-Comparison Instruction Cache • Recodes tag-check results generated by the I-cache into the extended BTB • Attempts to reuse them in order to eliminate unnecessary way activation • Achieves 62% of I-cache energy reduction with only 0.2% of performance degradation! Future work • Analyze energy consumption based on real chip design.
Buck Up Slides (History-based Tag-Comparison Cache)
Evaluation– Comparison with IS Approach – Interline Sequential approach History-Based Look-up Cache Combination of IS and HBL 0.8 0.7 0.6 0.5 Normalized Tag-Compare Count 0.4 0.3 0.2 0.1 0.0 099.go 126.gcc 130.li 102.swim adpcm(d) mpeg2(d) 124.m88ksim 129.comp. 132.ijpeg adpcm(e) mpeg2(e)
Evaluation– Effects of Cache Associativity – Eothers Etag Edata,bl Edata,prectl 099.go Conventional HBL Cache Energy (Joule) 1 2 4 8 16 32 64 1 2 4 8 16 32 64 Associativity 0.8um CMOS* ** *) M.B.Kamble and K.ghose, “Energy-Efficiency of VLSI Caches: A Comparative Study,” 10th Int. Conf. On VLSI Design **) S.J.E.Wilton and N.P.Jouppi, “An Enhanced Access and Cycle Time Model for On-Chip Caches,” WRL Research Report 93/5
Evaluation– Effects of Cache Associativity – Eothers Etag Edata,bl Edata,prectl 126.gcc Conventional HBL Cache Energy (Joule) 1 2 4 8 16 32 64 1 2 4 8 16 32 64 Associativity 0.8um CMOS* ** *) M.B.Kamble and K.ghose, “Energy-Efficiency of VLSI Caches: A Comparative Study,” 10th Int. Conf. On VLSI Design **) S.J.E.Wilton and N.P.Jouppi, “An Enhanced Access and Cycle Time Model for On-Chip Caches,” WRL Research Report 93/5
Evaluation– Effects of Cache Associativity – Eothers Etag Edata,bl Edata,prectl 132.ijpeg Conventional HBL Cache Energy (Joule) 1 2 4 8 16 32 64 1 2 4 8 16 32 64 Associativity 0.8um CMOS* ** *) M.B.Kamble and K.ghose, “Energy-Efficiency of VLSI Caches: A Comparative Study,” 10th Int. Conf. On VLSI Design **) S.J.E.Wilton and N.P.Jouppi, “An Enhanced Access and Cycle Time Model for On-Chip Caches,” WRL Research Report 93/5
Evaluation– Effects of Cache Associativity – Eothers Etag Edata,bl Edata,prectl mpeg2decode Conventional HBL Cache Energy (Joule) 1 2 4 8 16 32 64 1 2 4 8 16 32 64 Associativity 0.8um CMOS* ** *) M.B.Kamble and K.ghose, “Energy-Efficiency of VLSI Caches: A Comparative Study,” 10th Int. Conf. On VLSI Design **) S.J.E.Wilton and N.P.Jouppi, “An Enhanced Access and Cycle Time Model for On-Chip Caches,” WRL Research Report 93/5
Evaluation– Effects of # of WPs – w/ Pre-Decoding (BTB access occurs only at branch, or jump, executions) 1.0 126.gcc 132.ijpeg 0.8 0.6 Normalized Energy (Joule) 0.4 0.2 0.0 1 2 4 8 16 32 1 2 4 8 16 32 # of Way Pointer Energy for Cache Access Energy Overhead at BTB
Evaluation– Effects of # of WPs – w/o Pre-Decoding (BTB access occurs for all instructions) 1.0 126.gcc 132.ijpeg 0.8 0.6 Normalized Energy (Joule) 0.4 0.2 0.0 1 2 4 8 16 32 1 2 4 8 16 32 # of Way Pointer Energy for Cache Access Energy Overhead at BTB
Cache Miss Penalty Evaluation– Effect of WP invalidation penalty – BTB Replacement Cache Miss 126.gcc Normalized Exe. Time (cycle) Breakdown of WP invalidations 099.go mpeg2(d) 132.ijpeg 099.go 126.gcc 130.li 102.swim adpcm(d) mpeg2(d) 124.m88ksim 129.comp.132.ijpeg adpcm(e) mpeg2(e) WP Invalidation Penalty (cycle) • If the penalty is equal to or smaller than 4 clock cycles, the performance overhead is trivial. • The performance overhead grows after the penalty is more than 4 clock cycles.