200 likes | 337 Views
Benefits of Early Cache Miss Determination. Memik G., Reinman G., Mangione-Smith, W.H. Proceedings of High Performance Computer Architecture Pages: 307 – 316 Feb. 2003 On seminar book: 254. Abstract.
E N D
Benefits of Early Cache Miss Determination Memik G., Reinman G., Mangione-Smith, W.H. Proceedings of High Performance Computer Architecture Pages: 307 – 316 Feb. 2003 On seminar book: 254
Abstract • As the performance gap between the processor and the memory subsystem increases, designers are forced to develop new latency techniques. Arguably, the most common technique is to utilize multi-level caches. Each new generation of processors is equipped with higher levels of memory hierarchy with increasing sizes at each level. In this paper, we propose 5 differenttechniques that willreduce the data access times and powerconsumption in processors with multi-level caches. Using the information about the blocks placed into and replaced from the caches, the techniques quickly determine whether an accessat any cache level will be a miss. The accesses that are identified to miss are aborted. The structures used to recognize misses are much smaller than the cache structures. Consequently the data access times and power consumption is reduced. Using SimpleScalar simulator, we study the performance of these techniques for a processor with 5 cache levels. The best technique is able to abort 53.1% of the misses on average in SPEC2000 applications. Using these techniques, the execution time of the applications are reduced by up to 12.4% (5.4% on average), and the power consumption of the caches is reduced by as much as 11.6% (3.8% on average).
What’s the Problem • The fraction of data accesstime and cachepower consumptioncaused by cache missesincreases as the number of levels is increased in multi-level cache system • A great deal of the time and cache power are spent foraccessing caches that miss • On average, In processor with 5 levels of cache • The misses cause 25.5% of the data access time • The misses cause 18% of the cache power consumption • Motivating the exploration of technique to minimize the effects of cache misses
Introduction • Motivating example • If the data will be supplied by the nth level cache • All the cache levels before n will be accesses causing unnecessary delay and power consumption • The proposed technique of this paper • Identify miss andbypass the access to the cache that will miss • Store partial information about the blocks in a cache to identify whether the cache accessmay hit or definitely miss If these misses are known in advance and not performed • The delay of data access will be reduced and the cache power consumption by the misses can be prevented
Mostly No Machine (MNM) Overview • When the address is given to the MNM • Miss signals for each cache level (except L1) is generated • The miss signals are propagated with the access through the cache levels • Two possible locations where the MNM can be realized • The ith missbit dictates whether the access at level i should be performed or bypassed (a) Parallel MNM L1 cache and MNM are accessed in parallel - Advantage : No MNM delay - Disadvantage: MNM consumes more power • (b) Serial MNM • The MNM is accessed only after • the L1cache misses • Advantage : MNM consumes less power • Disadvantage: higher data access time • (Increased by the delay of MNM)
Modification of Cache to Incorporate the MNM • Modification of cache structure • Extend each cache structure with logic • To detectthe miss signal and bypass the access if necessary • Each cache has to send the information to MNM about • The blocks that are replaced from the cache • This is needed for the bookkeepingrequired at the MNM In serial MNM, to synchronize the access and miss signal The request generated by theL1 is sent to MNM, which forwards the request to the L2
Benefits of the MNM Technique • Average data access time without MNM • Average data access time with MNM • Cache_hit_time : time to access data at a cache • Cache_miss_time : time to detect a miss in a cache 1 • Abort the access to the cache when the MNM identifies a miss • Prevent the time to access cache that will miss => improves data access time
Assumptions of the MNM Techniques • Portion of the address used by the MNM • Store block address to instead store exact bytes that are stored in a cache • MNM don’t assume the inclusion property of caches • EX: If cache level i contains a block b, block b is not necessarily contained in cache level i+1 • MNM checks for the misses on cache level i+1, even if it can’t identify a miss in cache level i • EX: If MNM identifies the miss in L3, but couldn’t identify at L2, first the L2 cache will be accessed
1. Replacements MNM (RMNM) • Replacements MNM • Stores addresses that are replaced from the cache • Therefore, access to the address will miss • Information about the replaced blocks is stored in an RMNM cache • RMNM cache has a block size of (n-i) bits • n : # of separate caches • i : # of level 1 cache ●Each bit in the block corresponds to each level of cache, except the L1 cache When ith bit is set, that means the block is replaced from the Li cache
1. Replacements MNM (RMNM) • Scenario for a 2-level cache • pl. : place block into cache • repl. : replace block from cache • Since there are only two levels of caches • Each RMNM block contains a single bit indicating Hit/Miss for L2 cache Block 0x2FC0 replace from L2 cache and place into RMNM cache Find block 0x2FC0 in RMNM cache, then identify L2 cache miss
2. Sum MNM (SMNM) • Sum MNM • Store hash values for block addressesin cache • When a block is placed into cache, the block address is hashed and the resulting hash value is stored • The specific hash function If the hash value of the access match any of the hash values of the existing cache blocks ●Then : The access is performed ●Else : Miss is captured, bypass cache access Gather information about the bit values on the address that are high
2. Sum MNM (SMNM) • The SMNM configuration is denoted by sum_width x replication • Sum_width : sum_width at each checker • Replication : # of parallel checkers implemented • SMNM example : SMNM_10x2 If there are multiple checkers The first one examines : the least significant bits The second one examines : the bits starting from the 7th rightmost bit The third one examines : the bits starting from the 13th rightmost bit 2 parallel checker, each check different portion of block address (the bit strings length = 10) the bits starting from the 7th rightmost bit
3. Table MNM (TMNM) • Table MNM • Store the least significant N bits for block address in cache • The values are stored in the TMNM table, an array of size 2N • Locations corresponding to the address stored in cache • Are set to ‘0’ , the remaining locations are set to ‘1’ The least significant N bits of the access are used to address TMNM table The value stored at the corresponding location is used as miss signal • Example TMNM for N = 6 • The cache in the example only has 2 block • When the request comes to MNM, the corresponding bit position is read • The location is high, which means the access will bemisses and can be bypassed 1
3. Table MNM (TMNM) • There can be several block addresses that map to the same bit position in the TMNM table • Therefore the values at TMNM table are counters instead of single bit • When a block is placed into cache • The corresponding counter is incremented, unless it is saturated • When a block is replaced from cache • The corresponding counter is decremented, unless it is saturated • The TMNM configuration is denoted by TMNM_n x replication • N :# of bits checked by each table (Store least significant N bits for block address in cache) • Replication : # of tables examining different positions of the address
4. Common Address MNM (CMNM) • Common address MNM • Capture the common value at the block address by examining the most significant bits of the address • Virtual tag finder has K registers • Store the most significant portion of the cache block • During an access • The most significant (32-m) bits of the address are compared to the values in virtual tag finder M bits index If it matches any of the existing values The index of the matching register is attached to the remaining m bits of the examined address And used to address CMNM table
4. Common Address MNM (CMNM) • When an address is checked, there are two ways to identify a miss • First, the (32-m) most significantbits of the address are entered to the virtual tag finder • If it doesn’t match any of the register values in the virtual tag finder • The access is marked as a miss • Second, if a register matches the address, used theindexattaches with the remaining m bits of the address to access CMNM table • If the corresponding position has value ‘1’ • Again a miss is indicated • The CMNM configuration is denoted by CMNM_k x m • k : # of registers in the virtual tag finder • m : least significant m bits of the examined address
Discussion of the MNM Techniques • The MNM techniques • Never incorrectly indicate that bypassing should be used • But don’tdetect all opportunity for bypassing • The miss signal should be reliable • Because the cost of indicating an access will miss when the data is actually in the cache is high • Must perform redundant access to higher level of memory hierarchy • The cost of a hit misindication is relatively less • A redundant tag comparison at the cache ●If the MNM indicates a miss Then, the block certainly doesn’t exist in the cache ●If the MNM output maybe hit Then, the access might still miss in the cache
Improvement in Execution Time • To eliminate the delay of MNM, we perform simulations with the parallel MNM • The HMNM4 technique reduces the execution time by as much as 12.4% and by 5.4% on average • HMNM means hybrid MNM which combines all the techniques to increase the misses identified by the technique • The perfect MNMreduces the execution time by as much as 25.0% and 10.0% on average • The perfect MNMidentifies all the misses, and hence bypasses all the cache miss
Reduction in Cache Power Consumption • To achieve the maximum power reduction, we perform simulations with the serial MNM • The HMNM4reduces the cache power consumption by as much as 11.6% and by 3.8% on average • The perfect MNMreduces the cache power consumption by as much as 37.6% and 10.2% on average
Conclusions • Proposed techniques to identify the misses in different cache levels • When an access is identified to miss, the access is directly bypassed to the next cache level • Thereby, reduce the delayand power consumption associated with the misses • Totally presented 5 different techniques to recognize some of the cache misses • For Hybrid MNM technique • The execution time is reduced by 5.4% on average (ranging from 0.6% to 12.4%) • The cache power consumption is reduced by 3.8% on average (ranging from 0.4% to 11.6%)