180 likes | 277 Views
Highly-Associative Caches for Low-Power Processors. Michael Zhang Krste Asanovic {rzhang|krste}@lcs.mit.edu. Motivation. Cache uses 30-60% processor energy in embedded systems. Example: 43% for StrongArm-1 Many academic studies on cache
E N D
Highly-Associative Caches for Low-Power Processors Michael Zhang Krste Asanovic {rzhang|krste}@lcs.mit.edu
Motivation • Cache uses 30-60% processor energy in embedded systems. • Example: 43% for StrongArm-1 • Many academic studies on cache • [Albera, Bahar, ’98] – Power and performance trade-offs • [Amrutur, Horowitz, ‘98,’00] – Speed and power scaling • [Bellas, Hajj, Polychronopoulos, ’99] – Dynamic cache management • [Ghose, Kamble,’99] – Power reduction through sub-banking, etc. • [Inoue, Ishihara, Murakami,’99] – Way predicting set-associative cache • [Kin,Gupta, Mangione-Smith, ’97] – Filter cache • [Ko, Balsara, Nanda, ’98] – Multilevel caches for RISC and CISC • [Wilton, Jouppi, ’94] – CACTI cache model • Many Industrial Low-Power Processors use CAM (content-addressable-memory) • ARM3 – 64-way set-associative – [Furber et. al. ’89] • StrongArm – 32-way set-associative – [Santhanam et. al. ’98] • Intel XScale – 32-way set-associative – ’01 • CAM: Fast and Energy-Efficient
Talk Outline • Structural Comparison • Area and Delay Comparison • Energy Comparison • Related work • Conclusion
Tag Status Data Tag Status Data =? =? Tag Index Offset Set-Associative RAM-tag Cache • Not energy-efficient • All ways are read out • Two-phase approach • More energy-efficient • 2X latency
Cache BUS 128 I/O 32 gwl BUS lwl lwl Tag SRAM Cells Data SRAM Cells Address Decoder Data SRAM Cells Offset Dec. Offset Dec. Tag Comp Sense Amps Sense Amps addr offset offset Set-Associative RAM-tag Sub-bank • Not energy-efficient • All ways are read out • Two-phase approach • More energy-efficient • 2X latency • Sub-banking • 1 sub-bank = 1 way • Low-swing Bitlines • Only for reads, writes performed full-swing • Wordline Gating
Tag Status Data HIT? HIT? CAM-tag Cache Tag Status Data • Only one sub-bank activated • Associativity within sub-bank • Easy to implement high associativity HIT? Word Tag Bank Offset
128 32 gwl lwl lwl I/O BUS SRAM Cells SRAM Cells CAM-tag Array Offset Dec. Offset Dec. Sense Amps Sense Amps tag offset offset CAM-tag Cache Sub-bank • Only one sub-bank activated • Associativity within sub-bank • Easy to implement high associativity
SBit_b Bit Bit_b SBit WL SRAM Mismatch Match SBit_b SBit_b Bit Bit Bit_b Bit_b SBit SBit match WL WL XOR match match 1 1 1 0 1 1 0 0 1 0 1 0 CAM Functionality and Energy Usage • CAM Energy Dissipation • Search Lines • Match Lines • Drivers 10-T CAM Cell With Separate Write/Search Lines And Low-Swing Match Line
32x64 RAM Array 2x12x32 CAM Array CAM-tag Cache Sub-bank Layout 1-KB Cache Sub-bank implemented in 0.25 m CMOS technology • 10% area overhead over RAM-tag cache
Global Wordline Decoding Local Wordline Decoding gwl lwl Decoded offset Tag Comp. Tag bits Tag readout Data out Data readout Tag bits Local Wordline Decoding Tag bits broadcasting gwl lwl Tag Comp. Decoded offset Data out Data readout Delay Comparison RAM tag Cache Critical Path: Index Bits CAM tag Cache Critical Path: Tag bits Within 3% of each other
Hit Energy Comparison Hit Energy per Access for 8KB Cache in pJ Associativity and Implementation
LZW pegwit ijpeg perl gcc m88ksim Miss Rate Results
Total Access Energy (pegwit) Pegwit – High miss rate for high associativity Total Energy per Access for 8KB Cache in pJ Miss Energy Expressed in Multiples of 32-bit Read Access Energy
Total Access Energy (perl) Perl – Very low miss rate for high associativity Total Energy per Access for 8KB Cache in pJ Miss Energy Expressed in Multiples of 32-bit Read Access Energy
Other Advantages of CAM-tag • Hit signal generated earlier • Simplifies pipelines • Simplified store operation • Wordline only enabled during a hit • Stores can happen in a single cycle • No write buffer necessary
Related Work • CACTI and CACTI2 • [Wilton and Jouppi ’94],[Reinman and Jouppi, ’99] • Accurate delay and energy estimate • Results within 10% • Energy estimate not suited for low-power designs • Typical Low-power features not included in CACTI • Sub-banking • Low-swing bitlines • Wordline gating • Separate CAM search line • Low-swing match lines • Energy Estimation 10X greater than our model for one CAM-tag cache sub-bank • Our results closely agree with [Amruthur and Horowitz, 98]
Conclusion • CAM tags – high performance and low-power • Energy consumption of 32-way CAM < 2-way RAM • Easy to implement highly-associative tags • Low area overhead (10%) • Comparable access delay • Better CPI by reducing miss rate
Thank You! http://www.cag.lcs.mit.edu/scale/