420 likes | 659 Views
Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching. Somayeh Sardashti and David A. Wood University of Wisconsin-Madison. Where does energy go?. Communication vs. Computation. Keckler Micro 2011. ~200X .
E N D
Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching SomayehSardashti and David A. Wood University of Wisconsin-Madison
Communication vs. Computation Keckler Micro 2011 ~200X Improving cache utilization is critical for energy-efficiency!
Compressed Cache: Compress and Compact Blocks • Higher effective cache size • Small area overhead • Higher system performance • Lower system energy • Previous work limit compression effectiveness: • Limited number of tags • High internal fragmentation • Energy expensive re-compaction
Decoupled Compressed Cache (DCC) Saving system energy by improving LLC utilization through cache compression. Non-Contiguous Sub-Blocks Decoupled Super-Blocks • Previous work limit compression effectiveness: • Limited number of tags • High Internal Fragmentation • Energy expensive re-compaction
Decoupled Compressed Cache (DCC) Saving system energy by improving LLC utilization through cache compression. Outperform 2X LLC 1.08X LLC area 14% higher performance 12% lower energy
Outline • Motivation • Compressed caching • Our Proposals: Decoupled compressed cache • Experimental Results • Conclusions
Uncompressed Caching A fixed one-to-one tag/data mapping Data Tags A B C A B C
Compressed Caching Compact compressed blocks, to make room. Add more tags to increase effective capacity. Compress cache blocks. Data Tags A C B C B A A B
Compression (1) Compression: how to compress blocks? • There are different compression algorithms. • Not the focus of this work. • But, which algorithm matters! Compressor C C 64 bytes 20 bytes
Compression Potentials 3.9 We use C-PACK+Z for the rest of the talk! 2.8 1.5 Cycles to Decompress Compression Algorithm Compression Ratio = Original Size / Compressed Size High compression ratio potentially large normalized effective cache capacity.
Compaction (2) Compaction: how to store and find blocks? • Critical to achieve the compression potentials. • This work focuses on compaction. Fixed Sized Compressed Cache (FixedC)[Kim’02, WMPI, Yang Micro 02] Internal Fragmentation! Data Tags A B C A B C
Compaction (2) Compaction: how to store and find blocks? Variable Sized Compressed Cache (VSC) [Alameldeen, ISCA 2002] Data Tags B A C D D Sub-block B A C
Previous Compressed Caches (Limit 1) Limited Tag/Metadata • High Area Overhead by Adding 4X/more Tags (Limit 2) Internal Fragmentation • Low Cache Capacity Utilization Potential: 3.9 3.1 10B 16B 2.6 2.3 2.0 A A 1.7 Normalized Effective Capacity = LLC Number of Valid Blocks / MAX Number of (Uncompressed) Blocks
(Limit 3) Energy-Expensive Re-Compaction VSC requires energy-expensive re-compaction. B needs 2 sub-blocks Update B Data Tags B D A C A B C B 3X higher LLC dynamic energy! D
Outline • Motivation • Compressed caching • Our Proposals: Decoupled compressed cache • Experimental Results • Conclusions
Decoupled Compressed Cache (1) Exploiting Spatial Locality • Low Area Overhead (2) Decoupling tag/data mapping • Eliminate energy expensive re-compaction • Reduce internal fragmentation (3) Co-DCC: Dynamically co-compacting super-blocks • Further reduce internal fragmentation
(1) Exploiting Spatial Locality Neighboring blocks co-reside in LLC. 89%
(1) Exploiting Spatial Locality DCC tracks LLC blocks at Super-Block granularity. Data Tags D E A B C A B C D E 4X Tags 2X Tags Quad (Q): A, B, C, D Singleton (S): E state A state B state C state D Super-Block Tag Q Super Tags Q S Up to 4X blocks with low area overheads!
(2) Decoupling tag/data mapping DCC decouples mapping to eliminate re-compaction. Update B Flexible Allocation Super Tags A B C D E B1 B2 Q S Quad (Q): A, B, C, D Singleton (S): E Quad (Q): A, B, C, D Singleton (S): E
(2) Decoupling tag/data mapping Back pointers identify the owner block of each sub-block. Back Pointers Super Tags Data C D E A B1 B2 Q S Quad (Q): A, B, C, D Singleton (S): E Quad (Q): A, B, C, D Singleton (S): E Tag ID Blk ID
(3) Co-compacting super-blocks Co-DCC dynamically co‑compacts super-blocks. • Reducing internal fragmentation B A D B ABCD C A sub-block Quad (Q): A, B, C, D
Outline • Motivation • Compressed caching • Our Proposals: Decoupled compressed cache • Experimental Results • Conclusions
Experimental Methodology • Integrated DCC with AMD Bulldozer Cache. • We model the timing and allocation constraints of sequential regions at LLC in detail. • No need for an alignment network. • Verilog implementation and synthesis of the tag match and sub-block selection logic. • One additional cycle of latency due to sub-block selection.
Experimental Methodology • Full-system simulation with a simulator based on GEMS. • Wide range of applications with different level of cache sensitivities: • Commercial workloads: apache, jbb, oltp, zeus • Spec-OMP: ammp, applu, equake, mgrid, wupwise • Parsec: blackscholes, canneal, freqmine • Spec 2006 mixes (m1-m8): bzip2, libquantum-bzip2, libquantum, gcc, astar-bwaves, cactus-mcf-milc-bwaves, gcc-omnetpp-mcf-bwaves-lbm-milc-cactus-bzip, omnetpp-lbm
Effective LLC Capacity Normalized LLC Area 2 2X Baseline Normalized Effective LLC Capacity Co-DCC 1 VSC DCC FixedC Baseline 1 2 3
(Co-)DCC Performance 0.96 0.95 0.93 0.90 0.86 (Co-)DCC boost system performance significantly.
(Co-)DCC Energy Consumption 0.97 0.96 0.93 0.91 0.88 (Co-)DCC reduce system energyby reducing number of accesses to the main memory.
Summary • Analyze the limits of compressed caching • Limited number of tags • Internal fragmentation • Energy-expensive re-compaction • Decoupled Compressed Cache • Improving performance and energy of compressed caching • Decoupled super-blocks • Non-contiguous sub-blocks • Co-DCC further reduces internal fragmentation • Practical designs [details in the paper]
Backup • (De-)Compression overhead • DCC data array organization with AMD Bulldozer • DCC Timing • DCC Lookup • Applications • Co-DCC design • LLC effective capacity • LLC miss rate • Memory dynamic energy • LLC dynamic energy
DCC Lookup • Access Super Tags and Back Pointers in parallel • Find the matched Back Pointers • Read corresponding sub-blocks and decompress Read C Back Pointers Data E A B D C1 C0 1 1 1 1 1 1 Quad (Q): A, B, C, D Singleton (S): E Q 1 0 Super Tags S
Applications Sensitive to Cache Capacity and Latency Sensitive to Cache Latency Cache Insensitive Sensitive to Cache Capacity