430 likes | 448 Views
Explore a dynamic set balancing scheme to enhance large cache utilization by rebalancing hot and cooler regions, reducing conflict misses, and improving hit/miss paths. Detailed design observations, migration strategies, and simulation results are presented. Developed at IIT Kanpur.
E N D
Cooling the Hot Sets:Improved Space Utilization in Large Caches via Dynamic Set Balancing Mainak Chaudhuri, IIT Kanpur mainakc@iitk.ac.in
Talk in one slide • Closed-addressed hashing used in traditional cache designs with a fixed collision chain length (known as associativity) • Clustering of physical addresses to a few hot sets is a well-known phenomenon • Non-uniform set utilization leads to high volume of conflict misses • First proposal on a fully dynamic scheme to re-balance sets by migrating blocks from “hot regions” to “cooler regions” Balanced $ (IIT, Kanpur)
Sketch • Observations • Design detail • Destination of migration • Locating migrated blocks • Hit/Miss critical path • Selective migration • Throttling migration • Retaining migrated blocks • Scaling to CMPs • Simulation results • Summary Balanced $ (IIT, Kanpur)
Observation#1 Balanced $ (IIT, Kanpur)
Observation#2, 3 Balanced $ (IIT, Kanpur)
Sketch • Observations • Design detail • Destination of migration • Locating migrated blocks • Hit/Miss critical path • Selective migration • Throttling migration • Retaining migrated blocks • Scaling to CMPs • Simulation results • Summary Balanced $ (IIT, Kanpur)
Design detail • Overview • The basic idea is to migrate evicted blocks to sets with smaller fill count • Involves the following sub-problems • Identify a good receiver set quickly • Locate migrated blocks efficiently • Offer dynamic control of hit/miss critical path • Optimizations worth exploring • Selective migration (not all blocks are important) • Bound migrations from a particular set • Retain migrated blocks (the difficult part) Balanced $ (IIT, Kanpur)
Sketch • Observations • Design detail • Destination of migration • Locating migrated blocks • Hit/Miss critical path • Selective migration • Throttling migration • Retaining migrated blocks • Scaling to CMPs • Simulation results • Summary Balanced $ (IIT, Kanpur)
Destination of migration • Associate a saturating counter C(s) with each set s and a global counter G • Increment C(s) on a refill into s • When C(s) reaches a value equal to the associativity, increment G • When G reaches a value equal to the number of sets, reset G and C(s) for all s • Size C(s) so that it can count up to k times the associativity (we set k to 4) Balanced $ (IIT, Kanpur)
Destination of migration • Divide the sets into clusters of sets and associate a saturating counter D(u) with each cluster u • Increment D(u) whenever C(s) is incremented for some s in u • Reset D(u) when all C(s) are reset • Have a comparator tree to compute the minimum among all D(u) whenever an increment takes place (scalable?) • Have a second comparator tree to compute the minimum among all C(s) within the minimum u found by the first tree; the set t with this minimum is the target of migration provided C(s) > C(t) for source set s Balanced $ (IIT, Kanpur)
Sketch • Observations • Design detail • Destination of migration • Locating migrated blocks • Hit/Miss critical path • Selective migration • Throttling migration • Retaining migrated blocks • Scaling to CMPs • Simulation results • Summary Balanced $ (IIT, Kanpur)
Locating migrated blocks • The migrated tags are duplicated in a migration tag cache (MTC) • MTC is organized as a direct-mapped table • Each entry has a tag, a target set index, a forward pointer to an MTC entry, a backward pointer to an MTC entry, a head bit, and a tail bit • Starting at an index of the MTC, one can follow the forward pointers in a linked list until the tail bit is encountered • One tag list in the MTC corresponds to the migrated tags from a particular parent set in the main cache Balanced $ (IIT, Kanpur)
Locating migrated blocks • Tag lookup protocol • With each set s in the main cache, a head pointer H(s) to the MTC is maintained; H(s) points to the index of MTC where the list of migrated tags belonging to set s begins • The main cache is looked up first as usual • On a miss, H(s) is read out and an MTC walk is initiated at index H(s) • Note that on reset, the MTC is organized as a free list; a new migration from set s allocates an MTC entry, links it at the head of the list starting at H(s), and updates H(s) Balanced $ (IIT, Kanpur)
Locating migrated blocks • Tag lookup protocol • On an MTC hit, the block is swapped with the LRU block in the parent set to improve future hit latency (behaves like a folded victim cache) • It is necessary to avoid false hits • Now the same set may contain the same tag multiple times • Each tag is extended by log(A) bits where A is the associativity; the target way of a migrated tag is stored along with the tag Balanced $ (IIT, Kanpur)
Locating migrated blocks • Replacement of migrated blocks • A migrated block may get replaced due to primary or secondary replacements • A primary migrated block replacement is again migrated to a different target set; this case is easy to handle because it requires only MTC entry modification • But to get to the MTC entry, one needs to maintain a direct MTC entry pointer MEP(t) with each migrated tag t in the main cache Balanced $ (IIT, Kanpur)
Locating migrated blocks • Replacement of migrated blocks • A secondary migrated block replacement evicts the block from the cache • This requires delinking the tag from its list • Efficient delinking is possible only in doubly-linked lists and this is why we need a backward pointer with each MTC entry • Also, this may need updating the H(s) field in the parent set s • To be able to get to the parent set, each MTC entry needs to store the parent set index Balanced $ (IIT, Kanpur)
Locating migrated blocks • Summary of structures added till now • Per set s: one saturating counter C(s), one head pointer H(s) and VALID(H(s)) • Per tag t: MTC entry pointer MEP(t) and VALID(MEP(t)), extra way bits W(t) • Per MTC entry m: migrated tag MT(m) including the extra way bits, target set index TS(m), parent set index PS(m), forward pointer FPTR(m), backward pointer BPTR(m), head/tail bits HT(m) • Per set cluster u: saturating counter D(u) • A global saturating counter • Two comparator trees Balanced $ (IIT, Kanpur)
Sketch • Observations • Design detail • Destination of migration • Locating migrated blocks • Hit/Miss critical path • Selective migration • Throttling migration • Retaining migrated blocks • Scaling to CMPs • Simulation results • Summary Balanced $ (IIT, Kanpur)
Hit/Miss critical path • Reducing the MTC walk latency • Proposal#1: Make MTC dual-ported so that a list can be walked from both ends (a win-win situation); halves hit as well as miss paths • Add a tail pointer T(s) to each set (along with H(s)) so that the tail of a list can be accessed directly • Proposal#2: Maintain the summary of migrated tags from a set s in a small filter F(s) attached to s • Query F(s) first before walking MTC; a negative response from F(s) means the tag is definitely not there in MTC; optimizes the miss path only Balanced $ (IIT, Kanpur)
Hit/Miss critical path • Reducing the MTC walk latency • We experimented with a simple design of a 60-bit F(s) with great success • Divide the 60 bits into nine segments: each of the lower eight segments is seven bits wide and the upper segment is four bits wide • When a tag t is queried, the lower three bits of t identifies one of the lower eight segments of F(s) • Let the contents of the identified segment be f[6:0] and the contents of the upper segment be g[3:0] Balanced $ (IIT, Kanpur)
Hit/Miss critical path • Reducing the MTC walk latency • The filter says “yes” if and only if (f[6:0] AND t[9:3]) == t[9:3] and (g[3:0] AND t[13:10]) == t[13:10] • A newly migrated tag t is hashed into F(s) by ORing t[9:3] into the identified segment and ORing t[13:10] with the upper segment • F(s) is not updated if a migrated tag is removed (not possible to update) • On a false positive from F(s), all the migrated tags for the set s will have to be visited anyway; at this time F(s) is cleared and rebuilt Balanced $ (IIT, Kanpur)
Sketch • Observations • Design detail • Destination of migration • Locating migrated blocks • Hit/Miss critical path • Selective migration • Throttling migration • Retaining migrated blocks • Scaling to CMPs • Simulation results • Summary Balanced $ (IIT, Kanpur)
Selective migration • Not all blocks are important • Unnecessary migrations waste energy and may hurt performance by using up MTC space • Ideally, we want to migrate the most frequently missing blocks • Usually, these blocks are associated with the hot sets • The idea, therefore, should be to identify the hot sets and migrate only the blocks evicted from the hot sets Balanced $ (IIT, Kanpur)
Selective migration • Identifying hot sets • Associate a saturating counter R(s) with each set s to count the number of external refills to the set • Whenever some R(s) reaches its maximum value, all R(s) are reset (leader-decides rule) • Maintain the total refill count across all sets in a register TRC and the maximum refill count across all sets in another register MaxRC; let average refill count be ARC = TRC >> log(|S|) • Definition: A set s is hot if and only if R(s) > ARC + (MaxRC – ARC) >> delta • Delta is dynamically incremented Balanced $ (IIT, Kanpur)
Sketch • Observations • Design detail • Destination of migration • Locating migrated blocks • Hit/Miss critical path • Selective migration • Throttling migration • Retaining migrated blocks • Scaling to CMPs • Simulation results • Summary Balanced $ (IIT, Kanpur)
Throttling migration • If a set becomes very hot, it may start migrating a large number of blocks • While this may appear desirable, monotonically increasing expected MTC walk cost outweighs the benefits soon • We impose a limit on the length of the migrated tag list belonging to a particular set • However, a static limit may not work; so the limit is dynamically increased by monitoring the volume of rejected migrations due to too short a length limit • Each set s now maintains a list length register LLR(s) Balanced $ (IIT, Kanpur)
Sketch • Observations • Design detail • Destination of migration • Locating migrated blocks • Hit/Miss critical path • Selective migration • Throttling migration • Retaining migrated blocks • Scaling to CMPs • Simulation results • Summary Balanced $ (IIT, Kanpur)
Retaining migrated blocks • Number of misses between two misses to the same block is often very high • Points to the danger of losing the migrated blocks before they get reused • We need to design a replacement policy that gives lower replacement priority to the migrated blocks because these are the blocks we really want to retain • Classify the sets into high-hit and low-hit sets • For high-hit sets continue with baseline policy (LRU in our case) • For low-hit sets, consider the non-migrated blocks before the migrated ones Balanced $ (IIT, Kanpur)
Retaining migrated blocks • Associate a hit counter HC(s) with each set s • Reset HC(s) when the refill counter is reset • Count a hit on a migrated block as a hit in the parent set • Classify a set as low-hit if and only if HC(s) ≤ hR(s) and R(s) > r for some constant h > 1 and r < associativity • We fix h to 4 and r to 1/8th of associativity • More research is needed on better retention schemes • This is going to play a big role Balanced $ (IIT, Kanpur)
Sketch • Observations • Design detail • Destination of migration • Locating migrated blocks • Hit/Miss critical path • Selective migration • Throttling migration • Retaining migrated blocks • Scaling to CMPs • Simulation results • Summary Balanced $ (IIT, Kanpur)
Scaling to CMPs • Assume that the CMP caches will be banked • All the policies can be applied to each bank or a subset of close-by banks independently • No cross-bank (or cross-switch) migration • Use cross-bank migration only for proximity enhancement (more detail in second talk) • The entire design scales seamlessly to larger caches • In our simulations, we assume that a pair of banks share a switch on a ring and cross-bank migration is allowed only within a pair Balanced $ (IIT, Kanpur)
Sketch • Observations • Design detail • Destination of migration • Locating migrated blocks • Hit/Miss critical path • Selective migration • Throttling migration • Retaining migrated blocks • Scaling to CMPs • Simulation results • Summary Balanced $ (IIT, Kanpur)
Simulation results • Single-threaded and multi-threaded applications • Single-threaded runs are done on 2 MB 16-way L2 caches • Multi-threaded runs are done on 8 cores sharing a 4 MB 16-way L2 cache • Each core has private L1 caches • The MTC is sized to hold half the tags compared to the main cache • Space overhead of about 56 KB per 1 MB bank Balanced $ (IIT, Kanpur)
Simulation results Balanced $ (IIT, Kanpur)
Simulation results Balanced $ (IIT, Kanpur)
Simulation results Balanced $ (IIT, Kanpur)
Simulation results Balanced $ (IIT, Kanpur)
Simulation results Balanced $ (IIT, Kanpur)
Simulation results Balanced $ (IIT, Kanpur)
Simulation results Balanced $ (IIT, Kanpur)
Sketch • Observations • Design detail • Destination of migration • Locating migrated blocks • Hit/Miss critical path • Selective migration • Throttling migration • Retaining migrated blocks • Scaling to CMPs • Simulation results • Summary Balanced $ (IIT, Kanpur)
Summary • Huge potential for improving performance and saving energy with slightly over 5% extra storage • Logic simplifications need to be explored further Balanced $ (IIT, Kanpur)
Cooling the Hot Sets:Improving Space Utilization in Large Caches viaDynamic Set Balancing THANK YOU! Mainak Chaudhuri, IIT Kanpur mainakc@iitk.ac.in