1 / 27

ICCD 2010 Amsterdam, the Netherlands

ICCD 2010 Amsterdam, the Netherlands. Improving Cache Performance by Combining Cost-Sensitivity and Locality Principles in Cache Replacement Algorithms. Rami Sheikh North Carolina State University. Mazen Kharbutli Jordan Univ. of Science and Technology. Outline. Motivation.

allene
Download Presentation

ICCD 2010 Amsterdam, the Netherlands

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ICCD 2010 Amsterdam, the Netherlands Improving Cache Performance by Combining Cost-Sensitivity and Locality Principles in Cache Replacement Algorithms Rami Sheikh North Carolina State University Mazen Kharbutli Jordan Univ. of Science and Technology

  2. Outline

  3. Motivation • The processor-memory performance gap. • L2 cache performance is very crucial. • Traditionally, L2 cache replacement algorithms focus on improving the hit rate. • But, cache misses have different costs. • Better to take the cost of a miss into consideration. • Processor’s ability to (partially) hide the L2 cache miss latency differs between misses. • Depends on: dependency chain, miss bursts ..etc.

  4. Motivation • Issued Instructions per Miss Histogram.

  5. Contributions • A novel, effective, but simple cost estimation method. • Based on the number of instructions a processor manages to issue during the miss latency. • A reflection of the processor’s ability to hide the miss latency. Number of issued instructions during the miss Small Large High cost miss/block Low cost miss/block

  6. Contributions • LACS: Locality-Aware Cost-Sensitive Cache Replacement Algorithm. • Integrates our novel cost estimation method with a locality algorithm (e.g. LRU). • Attempts to reserve high cost blocks in the cache while their locality is still high. • On a cache miss, a low-cost block is chosen for eviction. • Excellent performance improvement at feasible cost. • Performance improvement: 15% average and up to 85%. • Effective in uniprocessors and CMPs. • Effective for different cache configurations.

  7. Outline

  8. Related Work • Cache replacement algorithms traditionally attempt to reduce the cache miss rate. • Belady’s OPT algorithm [Belady 1966]. • Dead block predictors [Kharbutli 2008 ..etc]. • OPT emulators [Rajan 2007]. • Cache misses are not uniform and have different costs [Srinivasan 1998, Puzak 2008]. • A new class of replacement algorithms. • Miss cost can be latency, power consumption, penalty ..etc.

  9. Related Work • Jeong and Dubois [1999, 2003, 2006]: • In the context of CC-NUMA multiprocessors. • Cost of miss mapping to remote memory higher than if mapping to local memory. • LACS estimates cost based on processor’s ability to tolerate the miss latency not the miss latency value itself. • Jeong et al. [2008]: • In the context of uniprocessors. • Next access predicted: Load (high cost); Store (low cost). • All load misses treated equally. • LACS does not treat load misses equally (different costs). • A store miss may have a high cost.

  10. Related Work • Srinivasan et al. [2001]: • Critical blocks preserved in special critical cache. • Criticality estimated from load’s dependence chain. • No significant improvement under realistic configurations. • LACS does not track the dependence chain. Uses a simpler cost heuristic. • LACS achieves considerable performance improvement under realistic configurations.

  11. Related Work • Qureshi et al. [2006]: • Based on Memory-level Parallelism (MLP). • Cache misses occur in isolation (high cost) or concurrently (low cost). • Suffers from pathological cases. Integrated with a tournament predictor to choose between it and LRU (SBAR). • LACS does not slow down any of the 20 benchmarks in our study. • LACS outperforms MLP-SBAR in our study.

  12. Outline

  13. LACS Storage Organization P IIC (32 bits) Total Storage Overhead ≈ 48 KB 9.4% of a 512KB Cache 4.7% of a 1MB Cache L1$ L2$ IIRs (32 bits each) MSHR Prediction Table Each entry: 6-bit hashed tag, 5-bit cost, 1-bit confidence (8K sets x 4 ways x 1.5 bytes/entry = 48 KB)

  14. Outline

  15. LACS Implementation • On an L2 cache miss on block B in set S: MSHR[B].IIR = IIC

  16. LACS Implementation • On an L2 cache miss on block B in set S: • Identify all low cost blocks in set S. • If there is at least one, choose a victim randomly from among them. • Otherwise, the LRU block is the victim. • Block X is a low cost block if: • X.cost > threshold, and • X.conf == 1

  17. LACS Implementation • On an L2 cache miss on block B in set S: • When miss returns, calculate B’s new cost: • newCost = IIC – MSHR[B].IIR • Update B’s table info: • if(newCost ≈ B.cost) B.conf=1, else B.conf=0 • B.cost = newCost

  18. Outline

  19. Evaluation Environment • Evaluation using SESC: a detailed, cycle-accurate, execution-driven simulator. • 20 of the 26 SPEC2000 benchmarks are used. • Reference input sets. • 2 billion instructions simulated after skipping the first 2 billion instructions. • Benchmarks divided into two groups (GrpA, GrpB). • GrpA: L2 cache performance-constrained - ammp, applu, art, equake, gcc, mcf, mgrid, swim, twolf, and vpr. • L2 cache: 512 KB, 8-way, WB, LRU.

  20. Outline

  21. Evaluation • Performance Improvement: • L2 Cache Miss Rates:

  22. Evaluation • Fraction of LRU blocks reserved by LACS that get re-used: Low-cost blocks in the cache: <20% OPT evicted blocks that were low-cost: 40% to 98%  Strong correlation between blocks evicted by OPT and their cost. • L2 Cache Miss Rates:

  23. Evaluation • Performance improvement in a CMP architecture:

  24. Evaluation • Sensitivity to cache parameters:

  25. Outline

  26. Conclusion • LACS’s Exquisite Features: • Novelty • New metric for measuring cost-sensitivity. • Combines Two Principles • Locality and cost-sensitivity. • Performance Improvements at Feasible Cost • 15% average speedup in L2 cache performance-constrained benchmarks. • Effective in uniprocessor and CMP architectures. • Effective for different cache configurations.

  27. Thank You ! Questions?

More Related