ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors

ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors Mohammad Hammoud, Sangyeun Cho, and Rami Melhem Presenter: Socrates Demetriades Dept. of Computer Science University of Pittsburgh

Tiled CMP Architectures • Tiled CMP Architectures have recently been advocated as a scalable design. • They replicate identical building blocks (tiles) connected over a switched network on-chip (NoC). • A tile typically incorporates a private L1 cache and an L2 cache bank. • A traditional practice of CMP caches is a one that logically shares the physically distributed L2 banks  • Shared Scheme

L2 miss Shared Scheme • The home tile of a cache block B is designated by the HS bits of B’s physical address. • Tile T1 requests B. • B is fetched from the main memory and mapped at its home tile (together with its dir info). • Pros: • High capacity utilization. • Simple Coherence Enforcement (Only for L1).

Shared Scheme: Latency Problem (Cons) • Access latencies to L2 banks differ depending on the distances between requester cores and target banks. • This design is referred to as a Non Uniform Cache Architecture  • NUCA

NUCA Solution: Block Migration • T0 requests block B. • Move accessed blocks closer to the requesting cores  Block Migration • B is migrated from T15 to T0. • T0 requests B. Local hit • Total Hops = 14 Total Hops = 0 • HS of B = 1111 (T15)

NUCA Solution: Block Migration • T3 requests B (hops = 6). • T0 requests B (hops = 8). • T8 requests B (hops = 8). • Assume B is migrated to T3. • T3 requests B (hops = 0). • T0 requests B (hops = 11). • T8 requests B (hops = 13). • Though T0 saved 6 hops, in total there is a loss of 2 hops. • Total Hops = 22 Total Hops = 24 • HS of B = 0110 (T6)

Our work • Collect information about tiles (sharers) that have accessed a block B. • Depend on the past to predict the future: a core that accessed a block in the past is likely to access it again in the future. • Migrate B to a tile (host) that minimizes the overall number of NoC hops needed.

Talk roadmap • Predicting optimal host location • Locating Migratory Blocks • Cache-the-cache-tag policy. • Replacement policy upon migration • Swap-with-the-lru policy. • Quantitative Evaluation • Conclusion and future works

Predicting Optimal Host Location • Keeping a cache block B at its home tile might not be optimal. • The best host location of B is not known until runtime. • Adaptive Controlled Migration (ACM): • Keep a pattern for the accessibility of B. • At runtime (after a specific migration frequency level is reached for B) compute the best host to migrate B by finding the one that minimizes the total latency cost between the sharers of B

ACM: A Working Example Tiles 0 and 6 are sharers: • Case 1: Tile 3 is a host. • Case 2: Tile 15 is a host. • Case 3: Tile 2 is a host. • Case 4: Tile 0 is the host. Select T0 Total Latency Cost = 14 Total Latency Cost = 22 Total Latency Cost = 10 Total Latency Cost = 8

Locating Migratory Blocks • After a cache block B is migrated, the HS bits of B’s physical address can’t be used anymore to locate B at a subsequent access. • Assume B has been migrated from its home tile T4 to a new host tile T7. • T3 requests B: L2 miss. • A tag can be kept at T4 to point to T7. • Scenario: 3-way cache-to-cache transfer (T3, T4, and T7) • Deficiencies: • Useless migration. • Fails to exploit distance locality False L2 Miss B at T7 • HS of B = 0100 (T4)

Locating Migratory Blocks: cache-the-cache-tag Policy • Idea: cache the tag of block B at the requester’s tile (within a data structure referred to as MT table). • T3 requests B. It looks up its MT table before reaching B’s home tile. • MT miss: 3-way communication (first access). • T3 caches B’s tag at its MT table. • T3 requests B. It looks up its MT table before reaching B’s home tile. • MT hit: direct fetch (second-and up-accesses) MT Hit MT Miss • HS of B = 0100 (T4)

Locating Migratory Blocks: cache-the-cache-tag Policy • The MT table of a tile T can now hold 2 types of tags: • A tag for each block B whose home tile is T and had been migrated to another tile (local entry). • Tags to keep track of the locations of the migratory blocks that have been recently accessed by T but whose home tile is not T (remote entry). • The MT table replacement policy: • An invalid tag. • The LRU remote entry. • The MT remote and local tags of B are kept consistent via extending the local entry of B at B’s home tile by a bit mask that indicates which tiles have cached corresponding remote entries.

Replacement Policy Upon Migration: swap-with-lru Policy • After the ACM algorithm predicts the optimal host, H, for a block B, a decision is to be made regarding which block to replace at H upon migrating B. • Idea: Swap B with the LRU block at H (swap-with-the-lru policy). • The LRU block at H could be: • A migratory one. • A non-migratory one. • The swap-with-the-lru policy is very effective especially for workloads that have working sets which are large relative to L2 banks (bears similarity to victim replication but more robust)

Quantitative Evaluation: Methodology and Benchmarks. • We simulate a 16-way tiled CMP. • Simulator: Simics 3.0.29 (Solaris OS) • Cache line size: 64 Bytes. • L1 I-D sizes/ways/latency: 16KB/2 ways/1 cycles. • L2 size/ways/latency: 512KB per bank/16 ways/6 cycles. • Latency per hop: 5 cycles. • Memory latency: 300 cycles. • Migration frequency level: 10 • Benchmarks:

Quantitative Evaluation: Single-threaded and Multiprogramming Results Poor Capacity Utilization Maintains Efficient Capacity Utilization • VR successfully offsets the miss rate from fast replica hits for all the single-threaded • benchmarks. • VR fails to offset the L2 miss increase of MIX1 and MIX2. • For single-threaded workloads: ACM generates on average 20.5% and 3.7% better AAL • than S and VR, respectively. • For multiprogramming workloads: ACM generates on average 2.8% and 31.3% better • AAL than S and VR

Quantitative Evaluation: Multithreaded Results • An increase in the degree of sharing suggests that the capacity occupied by replicas • could increase significantly leading to a decrease in the effective L2 cache size. • ACM exhibits AALs that are on average 27% and 37.1% better than S and VR, respectively.

Quantitative Evaluation: Avg. Memory Access Cycles Per 1K Instr. • ACM performs on average 18.6% and 2.6% better than S for the single-threaded and multiprogramming workloads, respectively. • ACM performs on average 20.7% better than S for multithreaded workloads. • VR performs on average 15.1% better than S, and 38.4% worse than S for the single- • threaded and multiprogramming workloads, respectively. • VR performs on average 19.6% worse than S for multithreaded workloads.

Quantitative Evaluation: ACM Scalability Poor Capacity Utilization • As the number of tiles on a CMP platform increases, the NUCA problem exacerbates. • ACM is independent of the underlying platform and always selects hosts that • minimize AAL. • More Exposure to the NUCA problem translates effectively to a larger benefit from ACM. • For the simulated benchmarks: with 16-way CMP, ACM improves AAL by 11.6% over S. • With 32-way CMP, ACM improves AAL by 56.6% on average over S.

Quantitative Evaluation: Sensitivity to MT table Sizes. • With half (50%) and quarter (25%) MT table sizes as compared to the regular L2 cache • bank size, ACM increases AAL by 5.9% and 11.3% over the base one (100% - or identical to • the L2 cache bank size).

Quantitative Evaluation: Sensitivity to L2 Cache Sizes. • AAL maintains improvement of 39.7% over S. • VR fails to demonstrate stability.

Conclusion • This work proposes ACM, a strategy to manage CMP NUCA caches. • ACM offers: • Better average L2 access latency over traditional NUCA (20.4% on average). • Maintains L2 miss rate of NUCA. • ACM proposes a robust location strategy (cache-the-cache-tag) that can work for any NUCA migration scheme. • ACM reveals the usefulness of migration technique in CMP context.

Future works • Improve ACM prediction mechanism. • Currently: Cores are treated equally (we consider only the case with 0-1 weights assigning 1 for a core that accessed block B and 0 for a one that didn’t). • Improvement: Reflect the non-uniformity in cores access weights (trade off between access weights and storage overhead). • Propose an adaptive mechanism for selecting migration frequency levels.

Thank you! ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors M. Hammoud, S. Cho, and R. Melhem Special thank to Socrates Demetriades Dept. of Computer Science University of Pittsburgh

ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors