200 likes | 322 Views
Evan Speight, Hazim Shafi, Lixin Zhang, and Ram Rajamony ISCA 2005. Adaptive Mechanisms and Policies for Managing Cache Hierarchies in Chip Multiprocessors. Billions of transistors per chip Massive multicore chips in future (32 cores) Private L1 and L2 caches for each core, forming a tile
E N D
Evan Speight, Hazim Shafi, Lixin Zhang, and Ram Rajamony ISCA 2005 Adaptive Mechanisms and Policies for ManagingCache Hierarchies in Chip Multiprocessors
Billions of transistors per chip Massive multicore chips in future (32 cores) Private L1 and L2 caches for each core, forming a tile Severe limitations in performance due to power budgets Caches occupy large area on chip, hence good performance management of cache hierarchy highly desirable Current trend of CMP
CMP cache solutions proposed This paper proposes following solutions • Use of L3 cache as victim cache for evicted lines from L2 • Avoid writebacks of clean lines • Use peer L2 cache for victimised lines while writing back • Maintain reuse history for replaced lines for selective snarfing These techniques provided an average 13% increase in performance of commercial workloads as shown later
Issues with blind L3 writeback • Dirty lines have to be written back • Writing back clean lines will reduce subsequent accesses to the line • Write back is however unnecessary if the line resides in another L2 or the L3 cache • Such excessive writeback puts pressure on on-chip and off-ship bandwidth • Hence writebacks need to be regulated
Selective Writeback • Use of history table to hint at presence of line in L3 • The table is associated with each L2 cache • The table is updated/accessed on each writeback of a clean line. • The table size is much smaller than cache size • LRU method used to decide which lines history is maintained
Selective Writeback Mechanism Line being written back, not present in L3 Write back to L3, update WBHT If line has entry in WBHT, writeback is squashed On subsequent replacement of the line, WBHT is checked Note that accuracy of WBHT only affects performance , not correctness
Potential Issues • L3 will replace lines due to capacity misses • If such a replaced line has WBHT entry, L2 will not write back to L3 • On subsequent access to the lines, it will have to be fetched from memory • Due to size limit of WBHT, an entry may be removed even though line is present in L3 • Write back queue occupied while WBHT is accessed
What if peer cache not used for writeback? • More writeback penalty as L3 is off chip, hence more access time • On subsequent accesses, L3 latency comes into picture • Power consumption is more for off-chip accesses, hence places more constraints on overall design Thus use of peer L2 caches is desirable for writing back
Factors to account for using peer caches • Minimise negative interference at recipient peer L2 cache • Make sure that useful lines only are retained on chip in peer L2 caches • Modifications needed in cache coherence protocols • Keep cache controller hardware less complex
Mechanism for peer caches use • Identify lines to be evicted in peer caches. Invalid lines preferred!! • If not, choose shared lines for replacement • Use a table to indicate and select which lines are likely to be reused • If peer caches have the line in clean state, squash the writeback as a snoop response
Mechanism to guess line reuse On a writeback, allocate entry in reuse table On a subsequent miss of the line, set the “use” bit if it has entry in the table If use bit is set, initialize a snarf by peer L2 caches On subsequent writeback of the line consult the reuse table
Simulation Environment • IBM's Mambo cache hierarchy simulator used • Coherence protocol used similar to IBM Power 4 • Simulate varying outstanding load/write misses that can occur simultaneously • Applications simulated: • Transaction Processing (TP) • Commercial Processing Workload (CPW2) • NotesBench • Trade2
Runtime Improvements Runtime Improvement Over Base- line of Write Back History Table Runtime Improvement of Updating. All WBHTs Using L3 Snoop Response
Effect of Varying WBHT Size Figure 4. Normalized Runtime of Varying L2 WBHT Sizes Normalized to 512-Entry WBHT System
Effect of L2 Snarfing Runtime Improvement Over Baseline of Allowing L2 Snarfing
Improvements by L2 snarfing and combined mechanisms Runtime of Varying L2 Snarf Table Sizes Normalized to 512-Entry Snarf Table System Runtime Improvement Over Baseline of Combined Tables
Conclusion • These simple adaptive mechanisms have a positive effect on performance • The effect of combining both techniques is not additive • Even small history tables can remove more than half the unnecessary writebacks • L2 snarfing resulted in less off chip accesses