CMP L2 Cache Management

CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z. Chishti, M. Powell, and T. Vijaykumar ASR: Adaptive Selective Replication for CMP Caches, B. Beckman, M. Marty, and D. Wood

Outline • Motivation • Related Work (1) – Non-uniform Caches • CMP-NuRAPID • Related Work (2) – Replication Schemes • ASR

Motivation • Two options for L2 caches in CMPs • Shared: high latency because of wire delay • Private: more misses because of replications • Need hybrid L2 caches • Take in mind • On-chip communication is fast • On-chip capacity is limited

NUCA • Non-Uniform Cache Architecture • Place frequently-accessed data closest to the core to allow fast access • Couple tag and data placement • Can only place one or two ways in each set close to the processor

NuRAPID • Non-uniform access with Replacement And Placement usIng Distance associativity • Decouple the set-associative way number from data placement • Divide the cache data array into d-groups • Use forward and reverse pointers • Forward: from tag to data • Reverse: from data to tag • One to one?

CMP-NuRAPID - Overview • Hybrid private tag • Shared data organization • Controlled Replication – CR • In-Situ Communication – ISC • Capacity Stealing – CS

CMP-NuRAPID – Structure • Need carefully chosen d-group preference

CMP-NuRAPID – Data and Tag Array • Tag arrays snoop on bus to maintain coherence • The data array is accessed through a crossbar

CMP-NuRAPID – Controlled Replication • For read-only sharing • First use no copy, save capacity • Second copy, reduce future access latency • In total, avoid off-chip misses

CMP-NuRAPID – Time Issues • Start to read before the invalidation and end after the invalidation • Mark the tag for the block being read from a farther d-group busy • Start to read after the invalidation begins and end before the invalidation completes • Put an entry in the queue that holds the order of the bus transaction before sending a read request to a farther d-group

CMP-NuRAPID – In-situ Communication • For read-write sharing • Communication state • Write-through for all C blocks in L1 cache

CMP-NuRAPID – Capacity Stealing • Demote less-frequently-used data to unused frames in the d-groups closer to the cores with less capacity demands • Placement and Promotion • Place all private blocks in the d-group closest to the initiating core • Promote the block directly to the closest d-group for the core

CMP-NuRAPID – Capacity Stealing • Demotion and Replacement • Demote the block to the next-fastest d-group • Replace in the order of invalid, private, and shared • Doesn’t this kind of demotion pollute another core’s fastest d-group?

CMP-NuRAPID - Methodology • Simics • 4-core CMP • 8 MB, 8-way CMP-NuRAPID with 4 single-ported d-groups • Both multithreaded and multiprogrammed workloads

CMP-NuRAPID – Multithreaded

CMP-NuRAPID – Multiprogrammed

Replication Schemes • Cooperative Caching • Private L2 caches • Restrict replication under certain criteria • Victim Replication • Share L2 cache • Allow replication under certain criteria • Both have static replication policies • How about dynamic?

ASR - Overview • Adaptive Selective Replication • Dynamic cache block replication • Replicate blocks when the benefits exceed the costs • Benefits: lower L2 hit latency • Costs: More L2 misses

ASR – Sharing Types • Shingle Requestor • Blocks are accessed by a single processor • Shared Read-Only • Blocks are read, but not written, by multiple processors • Shared Read-Write • Blocks are accessed by multiple processors, with at least one write • Focus on replicating shared read-only blocks • High locality • Little Capacity • Large portion of requests

ASR - SPR • Selective Probabilistic Replication • Assume private L2 caches and selectively limits replication on L1 evictions • Use probabilistic filtering to make local replication decisions

ASR – Balancing Replication

ASR – Replication Control • Replication levels • C: Current • H: Higher • L: Lower • Cycles • H: Hit cycles-per-instruction • M: Miss cycles-per-instruction

ASR – Replication Control

ASR – Replication Control • Wait until there are enough events to ensure a fair cost/benefit comparison • Wait until four consecutive evaluation intervals predict the same change before change the replication level

ASR – Designs Supported by SPR • SPR-VR • Add 1-bit per L2 cache block to identify replicas • Disallow replications when the local cache set is filled with owner blocks with identified sharers • SPR-NR • Store a 1-bit counter per remote processor for each L2 block • Remove the shared bus overhead (How?) • SPR-CC • Model the centralized tag structure using an idealized distributed tag structure

ASR - Methodology • Two CMP configurations – Current and Future • 8 processors • Writeback, write-allocate cache • Both commercial and scientific workloads • Use throughput as metrics

ASR – Memory Cycles

ASR - Speedup

Conclusion • Hybrid is better • Dynamic is better • Need tradeoff • How does it scale…

CMP L2 Cache Management