1 / 32

Utility-Based Partitioning of Shared Caches

Enhance cache performance through Utility-Based Cache Partitioning to allocate cache resources efficiently based on application benefits. The method considers utility metrics, scalable partitioning algorithms, and replacement policies for better cache utilization. Results show significant average performance improvements in speedup, throughput, and fairness metrics. Scalability challenges are addressed with a scalable partitioning algorithm. Greedy and Lookahead algorithms are discussed for optimized cache partitioning based on utility curves.

bonitad
Download Presentation

Utility-Based Partitioning of Shared Caches

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Utility-Based Partitioning of Shared Caches Moinuddin K. QureshiYale N. Patt International Symposium on Microarchitecture (MICRO) 2006

  2. Introduction • CMP and shared caches are common • Applications compete for the shared cache • Partitioning policies critical for high performance • Traditional policies: • Equal (half-and-half)  Performance isolation. No adaptation • LRU  Demand based. Demand ≠ benefit (e.g. streaming)

  3. Background Utility Uab = Misses with a ways – Misses with b ways Low Utility Misses per 1000 instructions High Utility Saturating Utility Num ways from 16-way 1MB L2

  4. Motivation equake vpr Improve performance by giving more cache to the application that benefits more from cache UTIL Misses per 1000 instructions (MPKI) LRU Num ways from 16-way 1MB L2

  5. Outline • Introduction and Motivation • Utility-Based Cache Partitioning • Evaluation • Scalable Partitioning Algorithm • Related Work and Summary

  6. PA UMON2 UMON1 Framework for UCP Shared L2 cache I$ I$ Core1 Core2 D$ D$ Main Memory Three components: • Utility Monitors (UMON) per core • Partitioning Algorithm (PA) • Replacement support to enforce partitions

  7. (MRU)H0 H1 H2…H15(LRU) + + + + Set A Set A Set B Set B Set C Set C ATD Set D Set D Set E Set E Set F Set F Set G Set G Set H Set H Utility Monitors (UMON) • For each core, simulate LRU policy using ATD • Hit counters in ATD to count hits per recency position • LRU is a stack algorithm: hit counts  utility E.g. hits(2 ways) = H0+H1 MTD

  8. Set A Set B Set B Set E Set C Set G Set D Set E Set F Set G Set H (MRU)H0 H1 H2…H15(LRU) + + + + Set A Set A Set B Set B Set C Set C Set D Set D Set E Set E Set F Set F Set G Set G Set H Set H Dynamic Set Sampling (DSS) • Extra tags incurhardware and power overhead • DSS reduces overhead [Qureshi+ ISCA’06] • 32 sets sufficient (analytical bounds) • Storage < 2kB/UMON MTD ATD UMON (DSS)

  9. Partitioning algorithm • Evaluate all possible partitions and select the best • With aways to core1 and (16-a) ways to core2: Hitscore1 = (H0 + H1 + … + Ha-1) ---- from UMON1 Hitscore2 = (H0 + H1 + … + H16-a-1) ---- from UMON2 • Select a that maximizes (Hitscore1 + Hitscore2) • Partitioning done once every 5 million cycles

  10. ways_occupied < ways_given Yes No Victim is the LRU line from miss-causing app Victim is the LRU line from other app Way Partitioning Way partitioning support: [Suh+ HPCA’02, Iyer ICS’04] • Each line has core-id bits • On a miss, count ways_occupied in set by miss-causing app

  11. Outline • Introduction and Motivation • Utility-Based Cache Partitioning • Evaluation • Scalable Partitioning Algorithm • Related Work and Summary

  12. Benchmarks: Two-threaded workloads divided into 5 categories 1.0 1.2 1.6 1.4 1.8 2.0 Weighted speedup for the baseline Used 20 workloads (four from each type) Methodology Configuration: Two cores: 8-wide, 128-entry window, private L1s L2: Shared, unified, 1MB, 16-way, LRU-based Memory: 400 cycles, 32 banks

  13. Metrics • Three metrics for performance: • Weighted Speedup (default metric) perf = IPC1/SingleIPC1+ IPC2/SingleIPC2 correlates with reduction in execution time • Throughput  perf = IPC1 +IPC2 can be unfair to low-IPC application • Hmean-fairness  perf = hmean(IPC1/SingleIPC1, IPC2/SingleIPC2)  balances fairness and performance

  14. Results for weighted speedup UCP improves average weighted speedup by 11%

  15. Results for throughput UCP improves average throughput by 17%

  16. Results for hmean-fairness UCP improves average hmean-fairness by 11%

  17. Effect of Number of Sampled Sets 8 sets 16 sets 32 sets All sets Dynamic Set Sampling (DSS) reduces overhead, not benefits

  18. Outline • Introduction and Motivation • Utility-Based Cache Partitioning • Evaluation • Scalable Partitioning Algorithm • Related Work and Summary

  19. Scalability issues • Time complexity of partitioning low for two cores(number of possible partitions ≈ number of ways) • Possible partitions increase exponentially with cores • For a 32-way cache, possible partitions: • 4 cores  6545 • 8 cores  15.4 million • Problem NP hard  need scalable partitioning algorithm

  20. Greedy Algorithm [Stone+ ToC ’92] • GA allocates 1 block to the app that has the max utility for one block. Repeat till all blocks allocated • Optimal partitioning when utility curves are convex • Pathological behavior for non-convex curves Misses per 100 instructions Num ways from a 32-way 2MB L2

  21. Problem with Greedy Algorithm In each iteration, the utility for 1 block: U(A) = 10 misses U(B) = 0 misses Misses All blocks assigned to A, even if B has same miss reduction with fewer blocks Blocks assigned Problem: GA considers benefit only from the immediate block. Hence it fails to exploit huge gains from ahead

  22. Lookahead Algorithm • Marginal Utility (MU) = Utility per cache resource MUab = Uab/(b-a) • GA considers MU for 1 block. LA considers MU for all possible allocations • Select the app that has the max value for MU. Allocate it as many blocks required to get max MU • Repeat till all blocks assigned

  23. Lookahead Algorithm (example) Iteration 1: MU(A) = 10/1 block MU(B) = 80/3 blocks B gets 3 blocks Time complexity ≈ways2/2 (512 ops for 32-ways) Misses Next five iterations: MU(A) = 10/1 block MU(B) = 0 A gets 1 block Blocks assigned Result: A gets 5 blocks and B gets 3 blocks (Optimal)

  24. Results for partitioning algorithms Four cores sharing a 2MB 32-way L2 LRU UCP(Greedy) UCP(Lookahead) UCP(EvalAll) Mix1 (gap-applu-apsi-gzp) Mix4 (mcf-art-eqk-wupw) Mix3 (mcf-applu-art-vrtx) Mix2 (swm-glg-mesa-prl) LA performs similar to EvalAll, with low time-complexity

  25. Outline • Introduction and Motivation • Utility-Based Cache Partitioning • Evaluation • Scalable Partitioning Algorithm • Related Work and Summary

  26. Related work Performance Low High Suh+ [HPCA’02] Perf += 4% Storage += 32B/core UCP Perf += 11% Storage += 2kB/core Low Overhead X Zhou+ [ASPLOS’04] Perf += 11% Storage += 64kB/core High UCP is both high-performance and low-overhead

  27. Summary • CMP and shared caches are common • Partition shared caches based on utility, not demand • UMON estimates utility at runtime with low overhead • UCP improves performance: • Weighted speedup by 11% • Throughput by 17% • Hmean-fairness by 11% • Lookahead algorithm is scalable to many cores sharing a highly associative cache

  28. Questions

  29. DSS Bounds with Analytical Model Us = Sampled mean (Num ways allocated by DSS) Ug = Global mean (Num ways allocated by Global) P = P(Us within 1 way of Ug) By Cheb. inequality: P ≥ 1 – variance/n n = number of sampled sets In general, variance ≤ 3 back

  30. Phase-Based Adapt of UCP

  31. Galgel – concave utility galgel twolf parser

  32. LRU as a stack algorithm

More Related