Line Distillation: Increasing Cache Capacity by Filtering Unused Words in Cache Lines

Line Distillation: Increasing Cache Capacity by Filtering Unused Words in Cache Lines Moinuddin K. Qureshi M. Aater Suleman Yale N. Patt HPCA 2007

Introduction • Caches are organized at linesize granularity  Helps when spatial locality is high  Unused words when spatial locality is low • Unused words occupy space without contributing to cache hits • Filtering unused words allows cache to store more cache lines

Problem: Not all words are useful Cache line (64B) divided into 8 words of 8B each (1 MB 8-way L2 cache) Words used per line (avg) On average less than 60% words used (4.7/8)

Goal: Improving cache performance • Smaller linesize can result in fewer unused words • Smaller linesize degrades cache performance • Linesize of 32B increases MPKI for 14 of 16 benchmarks • Average MPKI increases by 25% Goal: Improving cache performance by filtering unused words Insight: Words usage stabilizes as line traverses from MRU to LRU

MRU Recency Stack Pos 1 Pos 2 Pos 3 Pos 4 78% Pos 5 Pos 6 5% LRU Line Distillation (LDIS): Evict unused words when line crosses certain recency 6% 11% Insight Footprint = 8-bits per line that tracks word usage Max recency position before footprint update Most footprint updates occur early in recency stack

Outline • Background • Line Distillation • Experimental Evaluation • Interaction with Compression • Related Work and Summary

Line from memory WOC LOC valid bits footprint PROCESSOR ICACHE DCACHE (sectored) Framework for LDIS Distill Cache L2 Cache Line Organized Cache Word Organized Cache

Evict A[1:6] Install A0,A7 Distill Cache (Operation) • Four cases: • Cache Miss: Access to line D • LOC Hit: Access to line B • WOC Hit: Access to line A (word A0) • Hole Miss: Access to line A (word A1) Traditional cache (4-way) Words used? MRU LRU A0,A7 C B D A (A0,A7 used) LOC WOC Invalidate all words of A in WOC. Fetch A from Memory and install in LOC Same as traditional cache Send A0 and A7 to L1 and valid bits Install Line D in LOC and update LRU state

A0 E0 F0 G0 B0 H0 C0 D0 X0 X4 X5 X6 X1 X7 X2 X3 Median Threshold Filtering A line with many used words can evict several lines from WOC WOC Line X has all 8 words used 8 Lines evicted from WOC Increase lines in WOC by not installing lines for which used words > threshold “K” K = median words used in LOC line (computed at runtime)

Methodology • Configuration: • L2 cache: 1MB 8-way 64B linesize • (Distill cache gives 6 ways to LOC and 2 ways to WOC) • Out-of-order processor with 16KB 2-way L1s • 400 cycle memory • Benchmarks: • 15 SPEC2K benchmarks + health from olden suite • (A 250M instruction slice using SimPoint for SPEC2K)

LDIS (No MT) LDIS (with MT) Results (%) Reduction in L2 MPKI LDIS (MT) reduces MPKI by 25%

Set A Set B Set C Set D Set E Set F Set G Set H Distill cache ATD-LRU SCTR Set A Set B - + Set C Set D Set B Set E Set E Set F Set G Set G Set H Reverter Circuit (RC) • Tournament selection: Distill cache vs. traditional cache • Dynamic set sampling with 32 sets [Qureshi+ ISCA’06] (storage overhead of ATD: 1KB) For sets A, C, D, F, H: if (SCTR > 75%) Enable LDIS if (SCTR < 25%) Disable LDIS

Results with RC LDIS (MT, No RC) LDIS (MT,RC) (%) Reduction in L2 MPKI RC disables LDIS when it increases MPKI. LDIS (MT,RC) reduces MPKI by 30%

Overheads • Storage • Tags for WOC + footprint bits: 12.2% overhead • Latency • Tag-access (LOC+WOC) increases by one cycle • WOC hits incur two cycles to rearrange words • Power • Additional power of WOC tag-store

IPC Results (%) IPC Improvement LDIS improves average IPC by 12%

Compression vs. LDIS • Several proposals to increase capacity via compression • Compression and LDIS fundamentally different • Compression exploits redundancy in stored data • LDIS leverages unused words for spare capacity • Footprint Aware Compression(FAC) combines both • FAC compresses used words before installing in WOC

Results for FAC (%) Reduction in L2 MPKI 50 40 30 20 10 0 Compression FAC LDIS Compression and LDIS interact positively. FAC reduces MPKI by 50%

Related work • Spatial-Temporal Cache -Gonzales+ [ICS’95] • Spatial Locality Prediction –Johnson+ [ISCA’97] • Variable Linesize Cache –Veidenbaum+ [ICS’99] • Spatial Footprint Prediction –Kumar+ [ISCA’98], Pujara+ [HPCA’06] • Spatial Pattern Prediction -Chen+ [HPCA’05] LDIS is particularly suited for large caches and outperforms predictor-based techniques without requiring separate structure for tracking spatial footprint

Contributions • Line Distillation: Filter unused words without a separate footprint predictor • Distill cache: Utilize extra capacity created by LDIS • Median Threshold Filtering and Reverter Circuit: Improve performance and robustness of LDIS Result: LDIS (MT+RC) reduces MPKI by 30% • Footprint Aware Compression: LDIS + compression Result: FAC reduces MPKI by 50%

Questions

Result comparing capacity

Line Size vs. MPKI

Distribution of Hit-Miss

Average words usage (detailed)

Result for 3 types of LDIS

Replacement • LRU in LOC • WOC needs variable sized replacement • Only power-of-two sizes allowed in WOC • Placement constrained to alignment boundary • Random selection in case of multiple candidates

Background (pictorial)

Result LDIS vs. FAC (detailed)

Comparison with SFP

Appendix A: Other SPEC Benchmarks

Appendix B: Cache Size vs. Density

Summary • Many words in cache lines remain unused • Unused words unlikely to be accessed in less recent part of LRU stack  Line Distillation (LDIS) • Distill-cache utilizes extra capacity created by LDIS • LDIS reduces MPKI by 30% and improves IPC by 12% • “Footprint Aware Compression” combines LDIS and compression to reduce MPKI by 50%

Line Distillation: Increasing Cache Capacity by Filtering Unused Words in Cache Lines