500 likes | 705 Views
Adaptive Cache Compression for High-Performance Processors. Alaa Alameldeen and David Wood University of Wisconsin-Madison Wisconsin Multifacet Project http://www.cs.wisc.edu/multifacet. Overview. Design of high performance processors Processor speed improves faster than memory
E N D
Adaptive Cache Compression for High-Performance Processors Alaa Alameldeen and David Wood University of Wisconsin-Madison Wisconsin Multifacet Project http://www.cs.wisc.edu/multifacet
Overview • Design of high performance processors • Processor speed improves faster than memory • Memory latency dominates performance • Need more effective cache designs • On-chip cache compression • Increases effective cache size • Increases cache hit latency • Does cache compression help or hurt? Alaa Alameldeen – Adaptive Cache Compression
Does Cache Compression Help or Hurt? Alaa Alameldeen – Adaptive Cache Compression
Does Cache Compression Help or Hurt? Alaa Alameldeen – Adaptive Cache Compression
Does Cache Compression Help or Hurt? Alaa Alameldeen – Adaptive Cache Compression
Does Cache Compression Help or Hurt? • Adaptive Compression determines when compression is beneficial Alaa Alameldeen – Adaptive Cache Compression
Outline • Motivation • Cache Compression Framework • Compressed Cache Hierarchy • Decoupled Variable-Segment Cache • Adaptive Compression • Evaluation • Conclusions Alaa Alameldeen – Adaptive Cache Compression
Instruction Fetcher Load-Store Queue L1 I-Cache (Uncompressed) L1 D-Cache (Uncompressed) Uncompressed Line Bypass Decompression Pipeline L1 Victim Cache Compression Pipeline From Memory To Memory L2 Cache (Compressed) Compressed Cache Hierarchy Alaa Alameldeen – Adaptive Cache Compression
Decoupled Variable-Segment Cache • Objective: pack more lines into the same space Tag Area Data Area Address A Address B • 2-way set-associative with 64-byte lines • Tag Contains Address Tag, Permissions, LRU (Replacement) Bits Alaa Alameldeen – Adaptive Cache Compression
Decoupled Variable-Segment Cache • Objective: pack more lines into the same space Tag Area Data Area Address A Address B Address C Address D Add two more tags Alaa Alameldeen – Adaptive Cache Compression
Decoupled Variable-Segment Cache • Objective: pack more lines into the same space Tag Area Data Area Address A Address B Address C Address D Add Compression Size, Status, More LRU bits Alaa Alameldeen – Adaptive Cache Compression
Decoupled Variable-Segment Cache • Objective: pack more lines into the same space Tag Area Data Area Address A Address B Address C Address D Divide Data Area into 8-byte segments Alaa Alameldeen – Adaptive Cache Compression
Decoupled Variable-Segment Cache • Objective: pack more lines into the same space Tag Area Data Area Address A Address B Address C Address D Data lines composed of 1-8 segments Alaa Alameldeen – Adaptive Cache Compression
Decoupled Variable-Segment Cache • Objective: pack more lines into the same space Tag Area Data Area Addr A uncompressed 3 Addr B compressed 2 Addr C compressed 6 Addr D compressed 4 Tag is present but line isn’t Compression Status Compressed Size Alaa Alameldeen – Adaptive Cache Compression
Outline • Motivation • Cache Compression Framework • Adaptive Compression • Key Insight • Classification of L2 accesses • Global compression predictor • Evaluation • Conclusions Alaa Alameldeen – Adaptive Cache Compression
Benefit(Compression) > Cost(Compression) No Yes Do not compress future lines Compress future lines Adaptive Compression • Use past to predict future • Key Insight: • LRU Stack [Mattson, et al., 1970] indicates for each reference whether compression helps or hurts Alaa Alameldeen – Adaptive Cache Compression
Cost/Benefit Classification LRU Stack Data Area Addr A uncompressed 3 Addr B compressed 2 Addr C compressed 6 Addr D compressed 4 • Classify each cache reference • Four-way SA cache with space for two 64-byte lines • Total of 16 available segments Alaa Alameldeen – Adaptive Cache Compression
An Unpenalized Hit LRU Stack Data Area Addr A uncompressed 3 Addr B compressed 2 Addr C compressed 6 Addr D compressed 4 • Read/Write Address A • LRU Stack order = 1 ≤ 2 Hit regardless of compression • Uncompressed Line No decompression penalty • Neither cost nor benefit Alaa Alameldeen – Adaptive Cache Compression
A Penalized Hit LRU Stack Data Area Addr A uncompressed 3 Addr B compressed 2 Addr C compressed 6 Addr D compressed 4 • Read/Write Address B • LRU Stack order = 2 ≤ 2 Hit regardless of compression • Compressed Line Decompression penalty incurred • Compression cost Alaa Alameldeen – Adaptive Cache Compression
An Avoided Miss LRU Stack Data Area Addr A uncompressed 3 Addr B compressed 2 Addr C compressed 6 Addr D compressed 4 • Read/Write Address C • LRU Stack order = 3 > 2 Hit only because of compression • Compression benefit: Eliminated off-chip miss Alaa Alameldeen – Adaptive Cache Compression
An Avoidable Miss LRU Stack Data Area Addr A uncompressed 3 Addr B compressed 2 Addr C compressed 6 Addr D compressed 4 • Read/Write Address D • Line is not in the cache but tag exists at LRU stack order = 4 • Missed only because some lines are not compressed • Potential compression benefit Sum(CSize) = 15 ≤ 16 Alaa Alameldeen – Adaptive Cache Compression
An Unavoidable Miss LRU Stack Data Area Addr A uncompressed 3 Addr B compressed 2 Addr C compressed 6 Addr D compressed 4 • Read/Write Address E • LRU stack order > 4 Compression wouldn’t have helped • Line is not in the cache and tag does not exist • Neither cost nor benefit Alaa Alameldeen – Adaptive Cache Compression
Compression Predictor • Estimate: Benefit(Compression) – Cost(Compression) • Single counter : Global Compression Predictor (GCP) • Saturating up/down 19-bit counter • GCP updated on each cache access • Benefit: Increment by memory latency • Cost: Decrement by decompression latency • Optimization: Normalize to decompression latency = 1 • Cache Allocation • Allocate compressed line if GCP 0 • Allocate uncompressed lines if GCP < 0 Alaa Alameldeen – Adaptive Cache Compression
Outline • Motivation • Cache Compression Framework • Adaptive Compression • Evaluation • Simulation Setup • Performance • Conclusions Alaa Alameldeen – Adaptive Cache Compression
Simulation Setup • Simics full system simulator augmented with: • Detailed OoO processor simulator [TFSim, Mauer, et al., 2002] • Detailed memory timing simulator [Martin, et al., 2002] • Workloads: • Commercial workloads: • Database servers: OLTP and SPECJBB • Static Web serving: Apache and Zeus • SPEC2000 benchmarks: • SPECint: bzip, gcc, mcf, twolf • SPECfp: ammp, applu, equake, swim Alaa Alameldeen – Adaptive Cache Compression
System configuration • A dynamically scheduled SPARC V9 uniprocessor • Configuration parameters: Alaa Alameldeen – Adaptive Cache Compression
Simulated Cache Configurations • Always: All compressible lines are stored in compressed format • Decompression penalty for all compressed lines • Never: All cache lines are stored in uncompressed format • Cache is 8-way set associative with half the number of sets • Does not incur decompression penalty • Adaptive: Our adaptive compression scheme Alaa Alameldeen – Adaptive Cache Compression
Performance SpecINT SpecFP Commercial Alaa Alameldeen – Adaptive Cache Compression
Performance Alaa Alameldeen – Adaptive Cache Compression
Performance 35% Speedup 18% Slowdown Alaa Alameldeen – Adaptive Cache Compression
Performance Bug in GCP update Adaptive performs similar to the best of Always and Never Alaa Alameldeen – Adaptive Cache Compression
Effective Cache Capacity Alaa Alameldeen – Adaptive Cache Compression
Cache Miss Rates Misses Per 1000 Instructions 0.09 2.52 12.28 14.38 Penalized Hits Per Avoided Miss 6709 489 12.3 4.7 Alaa Alameldeen – Adaptive Cache Compression
Adapting to L2 Sizes Misses Per 1000 Instructions 104.8 36.9 0.09 0.05 Penalized Hits Per Avoided Miss 0.93 5.7 6503 326000 Alaa Alameldeen – Adaptive Cache Compression
Conclusions • Cache compression increases cache capacity but slows down cache hit time • Helps some benchmarks (e.g., apache, mcf) • Hurts other benchmarks (e.g., gcc, ammp) • Our Proposal: Adaptive compression • Uses (LRU) replacement stack to determine whether compression helps or hurts • Updates a single global saturating counter on cache accesses • Adaptive compression performs similar to the better of Always Compress and Never Compress Alaa Alameldeen – Adaptive Cache Compression
Backup Slides • Frequent Pattern Compression (FPC) • Decoupled Variable-Segment Cache • Classification of L2 Accesses • (LRU) Stack Replacement • Cache Miss Rates • Adapting to L2 Sizes – mcf • Adapting to L1 Size • Adapting to Decompression Latency – mcf • Adapting to Decompression Latency – ammp • Phase Behavior – gcc • Phase Behavior – mcf • Can We Do Better Than Adaptive? Alaa Alameldeen – Adaptive Cache Compression
Decoupled Variable-Segment Cache • Each set contains four tags and space for two uncompressed lines • Data area divided into 8-byte segments • Each tag is composed of: • Address tag • Permissions • CStatus : 1 if the line is compressed, 0 otherwise • CSize: Size of compressed line in segments • LRU/replacement bits Same as uncompressed cache Alaa Alameldeen – Adaptive Cache Compression
Frequent Pattern Compression • A significance-based compression algorithm • Related Work: • X-Match and X-RL Algorithms [Kjelso, et al., 1996] • Address and data significance-based compression [Farrens and Park, 1991, Citron and Rudolph, 1995, Canal, et al., 2000] • A 64-byte line is decompressed in five cycles • More details in technical report: • “Frequent Pattern Compression: A Significance-Based Compression Algorithm for L2 Caches,” Alaa R. Alameldeen and David A. Wood, Dept. of Computer Sciences Technical Report CS-TR-2004-1500, April 2004 (available online). Alaa Alameldeen – Adaptive Cache Compression
Frequent Pattern Compression (FPC) • A significance-based compression algorithm combined with zero run-length encoding • Compresses each 32-bit word separately • Suitable for short (32-256 byte) cache lines • Compressible Patterns: zero runs, sign-ext. 4,8,16-bits, zero-padded half-word, two SE half-words, repeated byte • A 64-byte line is decompressed in a five-stage pipeline • More details in technical report: • “Frequent Pattern Compression: A Significance-Based Compression Algorithm for L2 Caches,” Alaa R. Alameldeen and David A. Wood, Dept. of Computer Sciences Technical Report CS-TR-2004-1500, April 2004 (available online). Alaa Alameldeen – Adaptive Cache Compression
Classification of L2 Accesses • Cache hits: • Unpenalized hit: Hit to an uncompressed line that would have hitwithout compression • Penalized hit: Hit to a compressed line that would have hit without compression • Avoided miss: Hit to a line that would NOT have hit without compression • Cache misses: • Avoidable miss: Miss to a line that would have hit with compression • Unavoidable miss: Miss to a line that would have missed even with compression Alaa Alameldeen – Adaptive Cache Compression
(LRU) Stack Replacement • Differentiate penalized hits and avoided misses? • Only hits to top half of the tags in the LRU stack are penalized hits • Differentiate avoidable and unavoidable misses? • Is not dependent on LRU replacement • Any replacement algorithm for top half of tags • Any stack algorithm for the remaining tags Alaa Alameldeen – Adaptive Cache Compression
Cache Miss Rates Alaa Alameldeen – Adaptive Cache Compression
Adapting to L2 Sizes Misses Per 1000 Instructions 98.9 88.1 12.4 0.02 Penalized Hits Per Avoided Miss 11.6 4.4 12.6 2x106 Alaa Alameldeen – Adaptive Cache Compression
Adapting to L1 Size Alaa Alameldeen – Adaptive Cache Compression
Adapting to Decompression Latency Alaa Alameldeen – Adaptive Cache Compression
Adapting to Decompression Latency Alaa Alameldeen – Adaptive Cache Compression
Phase Behavior Predictor Value (K) Cache Size (MB) Alaa Alameldeen – Adaptive Cache Compression
Phase Behavior Predictor Value (K) Cache Size (MB) Alaa Alameldeen – Adaptive Cache Compression
Can We Do Better Than Adaptive? • Optimal is an unrealistic configuration: Always with no decompression penalty Alaa Alameldeen – Adaptive Cache Compression