Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching

Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching SomayehSardashti and David A. Wood University of Wisconsin-Madison

Where does energy go?

Communication vs. Computation Keckler Micro 2011 ~200X Improving cache utilization is critical for energy-efficiency!

Compressed Cache: Compress and Compact Blocks • Higher effective cache size • Small area overhead • Higher system performance • Lower system energy • Previous work limit compression effectiveness: • Limited number of tags • High internal fragmentation • Energy expensive re-compaction

Decoupled Compressed Cache (DCC) Saving system energy by improving LLC utilization through cache compression. Non-Contiguous Sub-Blocks Decoupled Super-Blocks • Previous work limit compression effectiveness: • Limited number of tags • High Internal Fragmentation • Energy expensive re-compaction

Decoupled Compressed Cache (DCC) Saving system energy by improving LLC utilization through cache compression. Outperform 2X LLC 1.08X LLC area 14% higher performance 12% lower energy

Outline • Motivation • Compressed caching • Our Proposals: Decoupled compressed cache • Experimental Results • Conclusions

Uncompressed Caching A fixed one-to-one tag/data mapping Data Tags A B C A B C

Compressed Caching Compact compressed blocks, to make room. Add more tags to increase effective capacity. Compress cache blocks. Data Tags A C B C B A A B

Compression (1) Compression: how to compress blocks? • There are different compression algorithms. • Not the focus of this work. • But, which algorithm matters! Compressor C C 64 bytes 20 bytes

Compression Potentials 3.9 We use C-PACK+Z for the rest of the talk! 2.8 1.5 Cycles to Decompress Compression Algorithm Compression Ratio = Original Size / Compressed Size High compression ratio  potentially large normalized effective cache capacity.

Compaction (2) Compaction: how to store and find blocks? • Critical to achieve the compression potentials. • This work focuses on compaction. Fixed Sized Compressed Cache (FixedC)[Kim’02, WMPI, Yang Micro 02] Internal Fragmentation! Data Tags A B C A B C

Compaction (2) Compaction: how to store and find blocks? Variable Sized Compressed Cache (VSC) [Alameldeen, ISCA 2002] Data Tags B A C D D Sub-block B A C

Previous Compressed Caches (Limit 1) Limited Tag/Metadata • High Area Overhead by Adding 4X/more Tags (Limit 2) Internal Fragmentation • Low Cache Capacity Utilization Potential: 3.9 3.1 10B 16B 2.6 2.3 2.0 A A 1.7 Normalized Effective Capacity = LLC Number of Valid Blocks / MAX Number of (Uncompressed) Blocks

(Limit 3) Energy-Expensive Re-Compaction VSC requires energy-expensive re-compaction. B needs 2 sub-blocks Update B Data Tags B D A C A B C B 3X higher LLC dynamic energy! D

Decoupled Compressed Cache (1) Exploiting Spatial Locality • Low Area Overhead (2) Decoupling tag/data mapping • Eliminate energy expensive re-compaction • Reduce internal fragmentation (3) Co-DCC: Dynamically co-compacting super-blocks • Further reduce internal fragmentation

(1) Exploiting Spatial Locality Neighboring blocks co-reside in LLC. 89%

(1) Exploiting Spatial Locality DCC tracks LLC blocks at Super-Block granularity. Data Tags D E A B C A B C D E 4X Tags 2X Tags Quad (Q): A, B, C, D Singleton (S): E state A state B state C state D Super-Block Tag Q Super Tags Q S Up to 4X blocks with low area overheads!

(2) Decoupling tag/data mapping DCC decouples mapping to eliminate re-compaction. Update B Flexible Allocation Super Tags A B C D E B1 B2 Q S Quad (Q): A, B, C, D Singleton (S): E Quad (Q): A, B, C, D Singleton (S): E

(2) Decoupling tag/data mapping Back pointers identify the owner block of each sub-block. Back Pointers Super Tags Data C D E A B1 B2 Q S Quad (Q): A, B, C, D Singleton (S): E Quad (Q): A, B, C, D Singleton (S): E Tag ID Blk ID

(3) Co-compacting super-blocks Co-DCC dynamically co‑compacts super-blocks. • Reducing internal fragmentation B A D B ABCD C A sub-block Quad (Q): A, B, C, D

Experimental Methodology • Integrated DCC with AMD Bulldozer Cache. • We model the timing and allocation constraints of sequential regions at LLC in detail. • No need for an alignment network. • Verilog implementation and synthesis of the tag match and sub-block selection logic. • One additional cycle of latency due to sub-block selection.

Experimental Methodology • Full-system simulation with a simulator based on GEMS. • Wide range of applications with different level of cache sensitivities: • Commercial workloads: apache, jbb, oltp, zeus • Spec-OMP: ammp, applu, equake, mgrid, wupwise • Parsec: blackscholes, canneal, freqmine • Spec 2006 mixes (m1-m8): bzip2, libquantum-bzip2, libquantum, gcc, astar-bwaves, cactus-mcf-milc-bwaves, gcc-omnetpp-mcf-bwaves-lbm-milc-cactus-bzip, omnetpp-lbm

Effective LLC Capacity Normalized LLC Area 2 2X Baseline Normalized Effective LLC Capacity Co-DCC 1 VSC DCC FixedC Baseline 1 2 3

(Co-)DCC Performance 0.96 0.95 0.93 0.90 0.86 (Co-)DCC boost system performance significantly.

(Co-)DCC Energy Consumption 0.97 0.96 0.93 0.91 0.88 (Co-)DCC reduce system energyby reducing number of accesses to the main memory.

Summary • Analyze the limits of compressed caching • Limited number of tags • Internal fragmentation • Energy-expensive re-compaction • Decoupled Compressed Cache • Improving performance and energy of compressed caching • Decoupled super-blocks • Non-contiguous sub-blocks • Co-DCC further reduces internal fragmentation • Practical designs [details in the paper]

Backup • (De-)Compression overhead • DCC data array organization with AMD Bulldozer • DCC Timing • DCC Lookup • Applications • Co-DCC design • LLC effective capacity • LLC miss rate • Memory dynamic energy • LLC dynamic energy

(De-)Compression Overhead

DCC Data Array OrganizationAMD Bulldozer

DCC Timing

DCC Lookup • Access Super Tags and Back Pointers in parallel • Find the matched Back Pointers • Read corresponding sub-blocks and decompress Read C Back Pointers Data E A B D C1 C0 1 1 1 1 1 1 Quad (Q): A, B, C, D Singleton (S): E Q 1 0 Super Tags S

Applications Sensitive to Cache Capacity and Latency Sensitive to Cache Latency Cache Insensitive Sensitive to Cache Capacity

Co-DCC Design

LLC Effective Cache Capacity

LLC Miss Rate

Memory Dynamic Energy

LLC Dynamic Energy

Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching