210 likes | 342 Views
Aamer Jaleel, Eric Borch, Malini Bhandaru, Simon Steely Jr., Joel Emer In International Symposium on Microarchitecture (MICRO) , December 2010 Presented by: Yingying Tian. Achieving Non-Inclusive Cache Performance with Inclusive Caches Temporal Locality Aware (TLA) Cache Management Policies.
E N D
Aamer Jaleel, Eric Borch, Malini Bhandaru, Simon Steely Jr., Joel Emer In International Symposium on Microarchitecture (MICRO), December 2010 Presented by: Yingying Tian Achieving Non-Inclusive Cache Performancewith Inclusive Caches Temporal Locality Aware (TLA) Cache Management Policies
High Performing Cache Hierarchy in CMPs • Cache Hierarchy • Multiple interacting caches on chip • Tradeoff between cache latency and hit rate • Chip-Multi Processors (CMPs) widen the gap between processor and memory speeds • Goal: efficient and high performing • cache hierarchy
Key issue: Inclusion or Not? Size of the cache hierarchy v.s. Simplicity of the cache coherence *Some materials are taken from original presentation slides
Inclusive Caches • Simplify cache coherence • Waste cache capacity (= size of the LLC) • Inclusion property causes invalidation of blocks that keep high temporal locality in core caches – back invalidate problem hundreds of cycles memory access penalty
Back-Invalidate Problem • Inclusion property: all the higher-level caches be a subset of the last-level cache (LLC). • Back-invalidation: When a block is evicted from the LLC, inclusion is enforced by invalidating that block from all the caches in the hierarchy. -- Inclusion Victim • Small caches filter temporal locality inclusion victims keep temporal locality -- Hot Inclusion Victim
Back-Invalidate Problem (Cont.) a b a • Consider following access pattern in a 2-level inclusive cache hierarchy: … a, b, a, c, a, d, a, e, a, f… L1: a b a L2: a b c a MRU LRU b a c b a a c d a Next Reference to ‘a’ misses. While ‘a’ keeps high temporal locality in L1. Reference ‘e’ misses and evicts ‘a’ from hierarchy c b a d c b a a d e d d c b a e d c b 6
Back-Invalidate Problem (Cont.) Intel Core i7– 1:8 cache ratio, inclusive LLCs. AMD Phenom Ⅱ-- 1:4 cache ratio, non-inclusive LLCs.
Goal: to implement efficient and high performingcache hierarchy • by eliminating hot inclusion victims to improveinclusive cache performance Temporal Locality Aware Cache Management Polices
Outline • Background and motivation • Problem description • Temporal Locality Aware (TLA) Cache Management Policy Suite • Evaluation • Conclusion
3 Temporal Locality Aware (TLA) Cache Management Policies: • Temporal Locality Hints (TLH) • Early Core Invalidation (ECI) • Query Based Selection (QBS)
Temporal Locality Hints (TLH) conveys the temporal locality of hot blocks in core caches by sending hints to the LLC on each • hit of core caches to update the replacement state of that block in LLC. • Significantly reduce the number of inclusion victims • The number of requests to the LLC is extremely large and does not scale well with increasing number of cores (even with filter optimizations) • Limit study
Early Core Invalidation (ECI) • derives the temporal locality of a block before its becomes LRU in the LLC. The LLC chooses the block located at [LRU-1] position and invalidates it in the core caches while keeping it in the LLC • by observing the core’s subsequent request, the LLC derives the temporal locality • occurs on each LLC miss
Early Core Invalidation (ECI) cont. • Early-invalidated block – ECI block • ECI block is hot in certain core cache re-requested by that core cache L1 miss but LLC hit, move back to MRU in LLC to keep the temporal locality • ECI block is not hot (not re-requested or re-requested after a long time) evicted from the LLC on next LLC miss in the corresponding set • Lower traffic solution (# of LLC misses is much smaller) • low-accurate prediction (predict the ECI block is hot in core caches) what if the ECI block is hot, but not that hot?
Query Based Selection (QBS) • infers the temporal locality of a block in the LLC by query the core caches on each LLC miss • The LLC selects a replacement candidate and queries all core caches if this block is present in certain core caches. • Only replace the block that is not present in any core caches. • If the QBS block is present in certain core cache. The LLC updates the corresponding replacement state to MRU and re-select, re-query another replacement candidate.
Query Based Selection (QBS) Cont. • The QBS victim selection process is hidden by memory latency. • The cache controller can limit the number of queries issued on an LLC miss. • Based on the experiments, sending 2 queries is sufficient to achieve performance benefits. • Performs similar to a non-inclusive cache hierarchy. • The on-chip communication overhead is extremely large. [not mentioned in the paper]
An example (. . . a, b, a, c, a, d, a, e, a, f, a, . . . . )
Experimental Methodology • CMP$im: x86 simulator • Baseline: 2-core CMP, 3 level inclusive cache hierarchy • L1 I/D: 4-way, 32KB, 64B block size, 1 cycle access latency • L2: 8-way, 256KB, 64B block size, 10 cycles access latency, non-inclusive • L3 (LLC): Shared, 16-way, 2MB, 24 cycles access latency, enforce inclusion • Main memory: 150 cycles access latency • Benchmarks: 15 benchmarks selected from SPEC CPU 2006 benchmark suite based on program behaviors (core cache fitting, LLC fitting, LLC thrashing, 5 benchmarks of each) • Total workloads: 105 2-core workloads. (15 choose 2)
Performance 5.2% 6.1% 3.4% 6.1% 6.6% 6.1%
Performance (Cont.) QBS performs similar to non-inclusive caches for all cache ratios
Performance (Cont.) Scalability of QBS in 2-core, 4-core and 8-core CMPs (1:4 cache size ratio)
Conclusion • Temporal Locality Aware Cache Management • Retains benefit of inclusion while minimizing back-invalidate problem • TLA managed inclusive cache = performance of non-inclusive cache Thanks! Questions?