Adaptive Mechanisms and Policies for Managing Cache Hierarchies in Chip Multiprocessors

Evan Speight, Hazim Shaﬁ, Lixin Zhang, and Ram Rajamony ISCA 2005 Adaptive Mechanisms and Policies for ManagingCache Hierarchies in Chip Multiprocessors

Billions of transistors per chip Massive multicore chips in future (32 cores)‏ Private L1 and L2 caches for each core, forming a tile Severe limitations in performance due to power budgets Caches occupy large area on chip, hence good performance management of cache hierarchy highly desirable Current trend of CMP

CMP cache solutions proposed This paper proposes following solutions • Use of L3 cache as victim cache for evicted lines from L2 • Avoid writebacks of clean lines • Use peer L2 cache for victimised lines while writing back • Maintain reuse history for replaced lines for selective snarfing These techniques provided an average 13% increase in performance of commercial workloads as shown later

Baseline Architecture used for the study

Issues with blind L3 writeback • Dirty lines have to be written back • Writing back clean lines will reduce subsequent accesses to the line • Write back is however unnecessary if the line resides in another L2 or the L3 cache • Such excessive writeback puts pressure on on-chip and off-ship bandwidth • Hence writebacks need to be regulated

Selective Writeback • Use of history table to hint at presence of line in L3 • The table is associated with each L2 cache • The table is updated/accessed on each writeback of a clean line. • The table size is much smaller than cache size • LRU method used to decide which lines history is maintained

Selective Writeback Mechanism Line being written back, not present in L3 Write back to L3, update WBHT If line has entry in WBHT, writeback is squashed On subsequent replacement of the line, WBHT is checked Note that accuracy of WBHT only affects performance , not correctness

Potential Issues • L3 will replace lines due to capacity misses • If such a replaced line has WBHT entry, L2 will not write back to L3 • On subsequent access to the lines, it will have to be fetched from memory • Due to size limit of WBHT, an entry may be removed even though line is present in L3 • Write back queue occupied while WBHT is accessed

What if peer cache not used for writeback? • More writeback penalty as L3 is off chip, hence more access time • On subsequent accesses, L3 latency comes into picture • Power consumption is more for off-chip accesses, hence places more constraints on overall design Thus use of peer L2 caches is desirable for writing back

Factors to account for using peer caches • Minimise negative interference at recipient peer L2 cache • Make sure that useful lines only are retained on chip in peer L2 caches • Modifications needed in cache coherence protocols • Keep cache controller hardware less complex

Mechanism for peer caches use • Identify lines to be evicted in peer caches. Invalid lines preferred!! • If not, choose shared lines for replacement • Use a table to indicate and select which lines are likely to be reused • If peer caches have the line in clean state, squash the writeback as a snoop response

Mechanism to guess line reuse On a writeback, allocate entry in reuse table On a subsequent miss of the line, set the “use” bit if it has entry in the table If use bit is set, initialize a snarf by peer L2 caches On subsequent writeback of the line consult the reuse table

Simulation Environment • IBM's Mambo cache hierarchy simulator used • Coherence protocol used similar to IBM Power 4 • Simulate varying outstanding load/write misses that can occur simultaneously • Applications simulated: • Transaction Processing (TP)‏ • Commercial Processing Workload (CPW2)‏ • NotesBench • Trade2

System Parameters

Effects of Write Back History Table

Runtime Improvements Runtime Improvement Over Base- line of Write Back History Table Runtime Improvement of Updating. All WBHTs Using L3 Snoop Response

Effect of Varying WBHT Size Figure 4. Normalized Runtime of Varying L2 WBHT Sizes Normalized to 512-Entry WBHT System

Effect of L2 Snarfing Runtime Improvement Over Baseline of Allowing L2 Snarﬁng

Improvements by L2 snarfing and combined mechanisms Runtime of Varying L2 Snarf Table Sizes Normalized to 512-Entry Snarf Table System Runtime Improvement Over Baseline of Combined Tables

Conclusion • These simple adaptive mechanisms have a positive effect on performance • The effect of combining both techniques is not additive • Even small history tables can remove more than half the unnecessary writebacks • L2 snarfing resulted in less off chip accesses

Adaptive Mechanisms and Policies for Managing Cache Hierarchies in Chip Multiprocessors

Adaptive Mechanisms and Policies for Managing Cache Hierarchies in Chip Multiprocessors

Presentation Transcript

“ Nahalal: Cache Organization for Chip Multiprocessors ” New LSU Policy

Cache Coherence Schemes for Multiprocessors

SOS: A Software-Oriented Distributed Shared Cache Management Approach for Chip Multiprocessors

StimulusCache : Boosting Performance of Chip Multiprocessors with Excess Cache

Transparent Dynamic Binding with Fault-Tolerant Cache Coherence Protocol for Chip Multiprocessors

Parallel External Memory Model for Private-cache Chip Multiprocessors

Power Control for Chip Multiprocessors

Cooperative Caching for Chip Multiprocessors

Cooperative Caching for Chip Multiprocessors

StimulusCache : Boosting Performance of Chip Multiprocessors with Excess Cache

An Adaptive Shared/Private NUCA Cache Partitioning Scheme for Chip Multiprocessors

Adaptive Insertion Policies for Managing Shared Caches

Lecture: SMT, Cache Hierarchies

Lecture: SMT, Cache Hierarchies

Core-Selectability in Chip-Multiprocessors

Lecture: SMT, Cache Hierarchies