Adaptive Cache Compression for High-Performance Processors

Adaptive Cache Compression for High-Performance Processors Alaa Alameldeen and David Wood University of Wisconsin-Madison Wisconsin Multifacet Project http://www.cs.wisc.edu/multifacet

Overview • Design of high performance processors • Processor speed improves faster than memory • Memory latency dominates performance • Need more effective cache designs • On-chip cache compression • Increases effective cache size • Increases cache hit latency • Does cache compression help or hurt? Alaa Alameldeen – Adaptive Cache Compression

Does Cache Compression Help or Hurt? Alaa Alameldeen – Adaptive Cache Compression

Does Cache Compression Help or Hurt? • Adaptive Compression determines when compression is beneficial Alaa Alameldeen – Adaptive Cache Compression

Outline • Motivation • Cache Compression Framework • Compressed Cache Hierarchy • Decoupled Variable-Segment Cache • Adaptive Compression • Evaluation • Conclusions Alaa Alameldeen – Adaptive Cache Compression

Instruction Fetcher Load-Store Queue L1 I-Cache (Uncompressed) L1 D-Cache (Uncompressed) Uncompressed Line Bypass Decompression Pipeline L1 Victim Cache Compression Pipeline From Memory To Memory L2 Cache (Compressed) Compressed Cache Hierarchy Alaa Alameldeen – Adaptive Cache Compression

Decoupled Variable-Segment Cache • Objective: pack more lines into the same space Tag Area Data Area Address A Address B • 2-way set-associative with 64-byte lines • Tag Contains Address Tag, Permissions, LRU (Replacement) Bits Alaa Alameldeen – Adaptive Cache Compression

Decoupled Variable-Segment Cache • Objective: pack more lines into the same space Tag Area Data Area Address A Address B Address C Address D Add two more tags Alaa Alameldeen – Adaptive Cache Compression

Decoupled Variable-Segment Cache • Objective: pack more lines into the same space Tag Area Data Area Address A Address B Address C Address D Add Compression Size, Status, More LRU bits Alaa Alameldeen – Adaptive Cache Compression

Decoupled Variable-Segment Cache • Objective: pack more lines into the same space Tag Area Data Area Address A Address B Address C Address D Divide Data Area into 8-byte segments Alaa Alameldeen – Adaptive Cache Compression

Decoupled Variable-Segment Cache • Objective: pack more lines into the same space Tag Area Data Area Address A Address B Address C Address D Data lines composed of 1-8 segments Alaa Alameldeen – Adaptive Cache Compression

Decoupled Variable-Segment Cache • Objective: pack more lines into the same space Tag Area Data Area Addr A uncompressed 3 Addr B compressed 2 Addr C compressed 6 Addr D compressed 4 Tag is present but line isn’t Compression Status Compressed Size Alaa Alameldeen – Adaptive Cache Compression

Outline • Motivation • Cache Compression Framework • Adaptive Compression • Key Insight • Classification of L2 accesses • Global compression predictor • Evaluation • Conclusions Alaa Alameldeen – Adaptive Cache Compression

Benefit(Compression) > Cost(Compression) No Yes Do not compress future lines Compress future lines Adaptive Compression • Use past to predict future • Key Insight: • LRU Stack [Mattson, et al., 1970] indicates for each reference whether compression helps or hurts Alaa Alameldeen – Adaptive Cache Compression

Cost/Benefit Classification LRU Stack Data Area Addr A uncompressed 3 Addr B compressed 2 Addr C compressed 6 Addr D compressed 4 • Classify each cache reference • Four-way SA cache with space for two 64-byte lines • Total of 16 available segments Alaa Alameldeen – Adaptive Cache Compression

An Unpenalized Hit LRU Stack Data Area Addr A uncompressed 3 Addr B compressed 2 Addr C compressed 6 Addr D compressed 4 • Read/Write Address A • LRU Stack order = 1 ≤ 2  Hit regardless of compression • Uncompressed Line  No decompression penalty • Neither cost nor benefit Alaa Alameldeen – Adaptive Cache Compression

A Penalized Hit LRU Stack Data Area Addr A uncompressed 3 Addr B compressed 2 Addr C compressed 6 Addr D compressed 4 • Read/Write Address B • LRU Stack order = 2 ≤ 2  Hit regardless of compression • Compressed Line  Decompression penalty incurred • Compression cost Alaa Alameldeen – Adaptive Cache Compression

An Avoided Miss LRU Stack Data Area Addr A uncompressed 3 Addr B compressed 2 Addr C compressed 6 Addr D compressed 4 • Read/Write Address C • LRU Stack order = 3 > 2  Hit only because of compression • Compression benefit: Eliminated off-chip miss Alaa Alameldeen – Adaptive Cache Compression

An Avoidable Miss LRU Stack Data Area Addr A uncompressed 3 Addr B compressed 2 Addr C compressed 6 Addr D compressed 4 • Read/Write Address D • Line is not in the cache but tag exists at LRU stack order = 4 • Missed only because some lines are not compressed • Potential compression benefit Sum(CSize) = 15 ≤ 16 Alaa Alameldeen – Adaptive Cache Compression

An Unavoidable Miss LRU Stack Data Area Addr A uncompressed 3 Addr B compressed 2 Addr C compressed 6 Addr D compressed 4 • Read/Write Address E • LRU stack order > 4  Compression wouldn’t have helped • Line is not in the cache and tag does not exist • Neither cost nor benefit Alaa Alameldeen – Adaptive Cache Compression

Compression Predictor • Estimate: Benefit(Compression) – Cost(Compression) • Single counter : Global Compression Predictor (GCP) • Saturating up/down 19-bit counter • GCP updated on each cache access • Benefit: Increment by memory latency • Cost: Decrement by decompression latency • Optimization: Normalize to decompression latency = 1 • Cache Allocation • Allocate compressed line if GCP  0 • Allocate uncompressed lines if GCP < 0 Alaa Alameldeen – Adaptive Cache Compression

Outline • Motivation • Cache Compression Framework • Adaptive Compression • Evaluation • Simulation Setup • Performance • Conclusions Alaa Alameldeen – Adaptive Cache Compression

Simulation Setup • Simics full system simulator augmented with: • Detailed OoO processor simulator [TFSim, Mauer, et al., 2002] • Detailed memory timing simulator [Martin, et al., 2002] • Workloads: • Commercial workloads: • Database servers: OLTP and SPECJBB • Static Web serving: Apache and Zeus • SPEC2000 benchmarks: • SPECint: bzip, gcc, mcf, twolf • SPECfp: ammp, applu, equake, swim Alaa Alameldeen – Adaptive Cache Compression

System configuration • A dynamically scheduled SPARC V9 uniprocessor • Configuration parameters: Alaa Alameldeen – Adaptive Cache Compression

Simulated Cache Configurations • Always: All compressible lines are stored in compressed format • Decompression penalty for all compressed lines • Never: All cache lines are stored in uncompressed format • Cache is 8-way set associative with half the number of sets • Does not incur decompression penalty • Adaptive: Our adaptive compression scheme Alaa Alameldeen – Adaptive Cache Compression

Performance SpecINT SpecFP Commercial Alaa Alameldeen – Adaptive Cache Compression

Performance Alaa Alameldeen – Adaptive Cache Compression

Performance 35% Speedup 18% Slowdown Alaa Alameldeen – Adaptive Cache Compression

Performance Bug in GCP update Adaptive performs similar to the best of Always and Never Alaa Alameldeen – Adaptive Cache Compression

Effective Cache Capacity Alaa Alameldeen – Adaptive Cache Compression

Cache Miss Rates Misses Per 1000 Instructions 0.09 2.52 12.28 14.38 Penalized Hits Per Avoided Miss 6709 489 12.3 4.7 Alaa Alameldeen – Adaptive Cache Compression

Adapting to L2 Sizes Misses Per 1000 Instructions 104.8 36.9 0.09 0.05 Penalized Hits Per Avoided Miss 0.93 5.7 6503 326000 Alaa Alameldeen – Adaptive Cache Compression

Conclusions • Cache compression increases cache capacity but slows down cache hit time • Helps some benchmarks (e.g., apache, mcf) • Hurts other benchmarks (e.g., gcc, ammp) • Our Proposal: Adaptive compression • Uses (LRU) replacement stack to determine whether compression helps or hurts • Updates a single global saturating counter on cache accesses • Adaptive compression performs similar to the better of Always Compress and Never Compress Alaa Alameldeen – Adaptive Cache Compression

Backup Slides • Frequent Pattern Compression (FPC) • Decoupled Variable-Segment Cache • Classification of L2 Accesses • (LRU) Stack Replacement • Cache Miss Rates • Adapting to L2 Sizes – mcf • Adapting to L1 Size • Adapting to Decompression Latency – mcf • Adapting to Decompression Latency – ammp • Phase Behavior – gcc • Phase Behavior – mcf • Can We Do Better Than Adaptive? Alaa Alameldeen – Adaptive Cache Compression

Decoupled Variable-Segment Cache • Each set contains four tags and space for two uncompressed lines • Data area divided into 8-byte segments • Each tag is composed of: • Address tag • Permissions • CStatus : 1 if the line is compressed, 0 otherwise • CSize: Size of compressed line in segments • LRU/replacement bits Same as uncompressed cache Alaa Alameldeen – Adaptive Cache Compression

Frequent Pattern Compression • A significance-based compression algorithm • Related Work: • X-Match and X-RL Algorithms [Kjelso, et al., 1996] • Address and data significance-based compression [Farrens and Park, 1991, Citron and Rudolph, 1995, Canal, et al., 2000] • A 64-byte line is decompressed in five cycles • More details in technical report: • “Frequent Pattern Compression: A Significance-Based Compression Algorithm for L2 Caches,” Alaa R. Alameldeen and David A. Wood, Dept. of Computer Sciences Technical Report CS-TR-2004-1500, April 2004 (available online). Alaa Alameldeen – Adaptive Cache Compression

Frequent Pattern Compression (FPC) • A significance-based compression algorithm combined with zero run-length encoding • Compresses each 32-bit word separately • Suitable for short (32-256 byte) cache lines • Compressible Patterns: zero runs, sign-ext. 4,8,16-bits, zero-padded half-word, two SE half-words, repeated byte • A 64-byte line is decompressed in a five-stage pipeline • More details in technical report: • “Frequent Pattern Compression: A Significance-Based Compression Algorithm for L2 Caches,” Alaa R. Alameldeen and David A. Wood, Dept. of Computer Sciences Technical Report CS-TR-2004-1500, April 2004 (available online). Alaa Alameldeen – Adaptive Cache Compression

Classification of L2 Accesses • Cache hits: • Unpenalized hit: Hit to an uncompressed line that would have hitwithout compression • Penalized hit: Hit to a compressed line that would have hit without compression • Avoided miss: Hit to a line that would NOT have hit without compression • Cache misses: • Avoidable miss: Miss to a line that would have hit with compression • Unavoidable miss: Miss to a line that would have missed even with compression Alaa Alameldeen – Adaptive Cache Compression

(LRU) Stack Replacement • Differentiate penalized hits and avoided misses? • Only hits to top half of the tags in the LRU stack are penalized hits • Differentiate avoidable and unavoidable misses? • Is not dependent on LRU replacement • Any replacement algorithm for top half of tags • Any stack algorithm for the remaining tags Alaa Alameldeen – Adaptive Cache Compression

Cache Miss Rates Alaa Alameldeen – Adaptive Cache Compression

Adapting to L2 Sizes Misses Per 1000 Instructions 98.9 88.1 12.4 0.02 Penalized Hits Per Avoided Miss 11.6 4.4 12.6 2x106 Alaa Alameldeen – Adaptive Cache Compression

Adapting to L1 Size Alaa Alameldeen – Adaptive Cache Compression

Adapting to Decompression Latency Alaa Alameldeen – Adaptive Cache Compression

Phase Behavior Predictor Value (K) Cache Size (MB) Alaa Alameldeen – Adaptive Cache Compression

Can We Do Better Than Adaptive? • Optimal is an unrealistic configuration: Always with no decompression penalty Alaa Alameldeen – Adaptive Cache Compression

Adaptive Cache Compression for High-Performance Processors

Adaptive Cache Compression for High-Performance Processors

Presentation Transcript

Adaptive Insertion Policies for High-Performance Caching

Cache performance

Cache Utilization-Aware Scheduling for Multicore Processors

Adaptive Management of Cache Block Replication for High-Performance CMP

ECE 569 High Performance Processors and Systems

ECE 569 High Performance Processors and Systems

Cache Performance

Clustered Data Cache Designs for VLIW Processors

ARC (Adaptive Replacement Cache)

Adaptive Insertion Policies for High-Performance Caching

High Performance Processors and Systems

Cache performance

Cache Memory Design for Network Processors

Challenges for High Performance Processors

Advanced Topic: High Performance Processors

Adaptive Denoising for Video Compression

Cache Performance

Very High Performance Cache Based Techniques for Iterative Methods

Cache Replacement in Modern Processors

Programming for Cache Performance

High Performance Processors

Cache Coherence Techniques for Multicore Processors