210 likes | 336 Views
Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin-Madison. Outline. Introduction Motivation Adaptive Cache Compression Evaluation Methodology Reported performance Review conclusion
E N D
Adaptive Cache Compression for High-Performance ProcessorsAlaa R. Alameldeen and David A.WoodComputer Sciences Department, University of Wisconsin-Madison
Outline • Introduction • Motivation • Adaptive Cache Compression • Evaluation Methodology • Reported performance • Review conclusion • Critics/Suggestions
Introduction Increasing performance gap between processors and memory calls for faster memory access. • Cache memories – reduce average memory latency • Cache Compression – improves performance of cache memories • Adaptive Cache Compression – Theme of this discussion
Motivation • Cache compression can improve effectiveness of cache memories (increase effective cache capacity) • Increasing effective cache capacity reduces miss rate • Performance will improve !
Adaptive Cache Compression An Overview • Dynamically optimize cache performance • Use the past to predict the future • How likely is compression going to help, hurt, or make no difference to next reference? • Feedback from previous compression helps to decide whether to compress the next write to cache
Adaptive Cache CompressionImplementation • 2-level cache hierarchy • L1 cache (data, instruction) • uncompressed • L2 cache is unified and • optionally compressed • Decompression/ • Compression used/skipped • as necessary Pros: L1 cache performance not affected Cons: Compression/Decompression introduces latency
Adaptive Cache CompressionL2 cache detail • 8-way set associative • Use a compression information tag stored with each address tag • 32 segments (8 bytes each) in each set • An uncompressed line comprises 8 segments (4 uncompressed lines max in each set) • Compressed lines are 1 to 7 segments in length • Max number of lines in each set =8 • Least recently used (LRU) lines evicted • Compacting may be used to make room for a new line
Adaptive Cache Compression:To compress or not to compress? • While compression eliminates L2 misses, it increases the latency of L2 hits (more frequent). • However, penalty for L2 misses is usually large and extra latency due to decompression is usually small. • Compression helps if: ( avoided L2 misses ) x (L2 miss penalty) ( penalized L2 hits ) x ( decompression penalty) > Example: For a 5 cycle decompression penalty and 400 cycle cycle L2 miss penalty, compression wins if it eliminates at least one L2 miss for every 400/5=80 penalized L2 hits
Adaptive Cache CompressionClassification of Cache References • Classifications of hits • Unpenalized hit (e.g. reference to address A) • Penalized hit (e.g. reference to address C) • Avoided miss (e.g. reference to address E) • Classifications of misses • Avoidable miss ( e.g. reference to address G) • Unavoidable miss ( e.g. reference to address H) Evicted
Adaptive Cache CompressionHardware use in decision-making • Global Compression Predictor • estimates the recent cost or benefit of compression • On a penalized hit, the controller biases against compression by decrementing the counter ( subtractedvalue=decompression penalty) • On an avoided or avoidable miss, the controller increments the counter by the L2 miss penalty. • The controller uses the GCP when allocating a line in the L2 cache • Positive value -> compression has helped, so now compress • Negative value -> compression has been penalizing, so don’t compress • Size of GCP determines sensitivity to changes • In this paper, 19-bit used ( saturates at 262143 or -262144 )
Adaptive Cache CompressionSensitivity • Effectiveness depend on the workload’s size, cache’s size and latencies • Sensitive to L2 cache size (effective for small L2 cache) • Sensitive to L1 cache size (observe trade-offs) • Adapting to benchmark phase - changes in phase behaviour may hurt adaptive policy - takes time to adapt
Evaluation Methodology • Host system: dynamically-scheduled SPARC V9 uniprocessor • Target system: superscalar processor with out-of-order execution • Simulation Parameters:
Evaluation Methodology (continued) • Simulator: Simics full-system simulator, extended with a detailed processor simulator (TFSim), and a detailed memory system timing simulator. • Workloads: • multi-threaded commercial workloads from the Wisconsin Commercial workload suite • eight of the SPECcpu2000 benchmarks • Integer benchmarks (bzip, gcc, mcf, twolf) • Floating benchmarks (ammp, applu, equake, swim) Workloads selected to cover a wide range of compressibility properties, miss rates and working set sizes.
Evaluation methodology (continued) • To understand the utility of adaptive compression, 2 extreme policies ( Never compress, and always compress were compared with ) • ‘Never’ strives to reduce hit latency • ‘Always’ strives to reduce miss rate • ‘Adaptive’ strives to optimize.
Reported Performance(Average cache capacity) Figure: Average cache capacity during benchmark runs (4MB uncompressed)
Reported Performance (cache miss rate) Figure: L2 cache miss rate (normalized to “Never” miss rate)
Reported Performance (Runtime) Figure: Runtime for the three compression alternatives (normalized to “Never”)
Reported Performance(sensitivity of adaptive compression to benchmark phase changes) Top: temporal changes in Global Compression Predictor values Bottom: effective cache size
Review Conclusion • Compressing all compressible cache lines only improves memory-intensive applications. Applications with low miss rate / compressibility suffer. • Optimization achieved by adaptive scheme are: • Up to 26% speedup (over uncompressed scheme) for memory-intensive, highly-compressible benchmarks • Performance degradation for other benchmarks < 0.4%
Critics/Suggestions • Data inconsistency:17% improvement in performance for memory-intensive commercial workloads claimed on page 2 but 26% claimed on page 11. • Miscalculation on page 4 • The sum of the compressed sizes at stack depths 1 through 7 totals 29. • However, this miss cannot be avoided because the sum of compressed sizes exceeds the total number of segments (i.e. 35 > 32 ) . • All in all, the proposed technique doesn’t seem to enhance performance significantly with respect to ‘always’.