ICCD 2010 Amsterdam, the Netherlands

ICCD 2010 Amsterdam, the Netherlands Improving Cache Performance by Combining Cost-Sensitivity and Locality Principles in Cache Replacement Algorithms Rami Sheikh North Carolina State University Mazen Kharbutli Jordan Univ. of Science and Technology

Outline

Motivation • The processor-memory performance gap. • L2 cache performance is very crucial. • Traditionally, L2 cache replacement algorithms focus on improving the hit rate. • But, cache misses have different costs. • Better to take the cost of a miss into consideration. • Processor’s ability to (partially) hide the L2 cache miss latency differs between misses. • Depends on: dependency chain, miss bursts ..etc.

Motivation • Issued Instructions per Miss Histogram.

Contributions • A novel, effective, but simple cost estimation method. • Based on the number of instructions a processor manages to issue during the miss latency. • A reflection of the processor’s ability to hide the miss latency. Number of issued instructions during the miss Small Large High cost miss/block Low cost miss/block

Contributions • LACS: Locality-Aware Cost-Sensitive Cache Replacement Algorithm. • Integrates our novel cost estimation method with a locality algorithm (e.g. LRU). • Attempts to reserve high cost blocks in the cache while their locality is still high. • On a cache miss, a low-cost block is chosen for eviction. • Excellent performance improvement at feasible cost. • Performance improvement: 15% average and up to 85%. • Effective in uniprocessors and CMPs. • Effective for different cache configurations.

Outline

Related Work • Cache replacement algorithms traditionally attempt to reduce the cache miss rate. • Belady’s OPT algorithm [Belady 1966]. • Dead block predictors [Kharbutli 2008 ..etc]. • OPT emulators [Rajan 2007]. • Cache misses are not uniform and have different costs [Srinivasan 1998, Puzak 2008]. • A new class of replacement algorithms. • Miss cost can be latency, power consumption, penalty ..etc.

Related Work • Jeong and Dubois [1999, 2003, 2006]: • In the context of CC-NUMA multiprocessors. • Cost of miss mapping to remote memory higher than if mapping to local memory. • LACS estimates cost based on processor’s ability to tolerate the miss latency not the miss latency value itself. • Jeong et al. [2008]: • In the context of uniprocessors. • Next access predicted: Load (high cost); Store (low cost). • All load misses treated equally. • LACS does not treat load misses equally (different costs). • A store miss may have a high cost.

Related Work • Srinivasan et al. [2001]: • Critical blocks preserved in special critical cache. • Criticality estimated from load’s dependence chain. • No significant improvement under realistic configurations. • LACS does not track the dependence chain. Uses a simpler cost heuristic. • LACS achieves considerable performance improvement under realistic configurations.

Related Work • Qureshi et al. [2006]: • Based on Memory-level Parallelism (MLP). • Cache misses occur in isolation (high cost) or concurrently (low cost). • Suffers from pathological cases. Integrated with a tournament predictor to choose between it and LRU (SBAR). • LACS does not slow down any of the 20 benchmarks in our study. • LACS outperforms MLP-SBAR in our study.

Outline

LACS Storage Organization P IIC (32 bits) Total Storage Overhead ≈ 48 KB 9.4% of a 512KB Cache 4.7% of a 1MB Cache L1$ L2$ IIRs (32 bits each) MSHR Prediction Table Each entry: 6-bit hashed tag, 5-bit cost, 1-bit confidence (8K sets x 4 ways x 1.5 bytes/entry = 48 KB)

Outline

LACS Implementation • On an L2 cache miss on block B in set S: MSHR[B].IIR = IIC

LACS Implementation • On an L2 cache miss on block B in set S: • Identify all low cost blocks in set S. • If there is at least one, choose a victim randomly from among them. • Otherwise, the LRU block is the victim. • Block X is a low cost block if: • X.cost > threshold, and • X.conf == 1

LACS Implementation • On an L2 cache miss on block B in set S: • When miss returns, calculate B’s new cost: • newCost = IIC – MSHR[B].IIR • Update B’s table info: • if(newCost ≈ B.cost) B.conf=1, else B.conf=0 • B.cost = newCost

Outline

Evaluation Environment • Evaluation using SESC: a detailed, cycle-accurate, execution-driven simulator. • 20 of the 26 SPEC2000 benchmarks are used. • Reference input sets. • 2 billion instructions simulated after skipping the first 2 billion instructions. • Benchmarks divided into two groups (GrpA, GrpB). • GrpA: L2 cache performance-constrained - ammp, applu, art, equake, gcc, mcf, mgrid, swim, twolf, and vpr. • L2 cache: 512 KB, 8-way, WB, LRU.

Outline

Evaluation • Performance Improvement: • L2 Cache Miss Rates:

Evaluation • Fraction of LRU blocks reserved by LACS that get re-used: Low-cost blocks in the cache: <20% OPT evicted blocks that were low-cost: 40% to 98%  Strong correlation between blocks evicted by OPT and their cost. • L2 Cache Miss Rates:

Evaluation • Performance improvement in a CMP architecture:

Evaluation • Sensitivity to cache parameters:

Outline

Conclusion • LACS’s Exquisite Features: • Novelty • New metric for measuring cost-sensitivity. • Combines Two Principles • Locality and cost-sensitivity. • Performance Improvements at Feasible Cost • 15% average speedup in L2 cache performance-constrained benchmarks. • Effective in uniprocessor and CMP architectures. • Effective for different cache configurations.

Thank You ! Questions?

ICCD 2010 Amsterdam, the Netherlands