420 likes | 521 Views
Using Dead Blocks as a Virtual Victim Cache. Samira Khan, Daniel A. Jiménez , Doug Burger, Babak Falsafi. The Cache Utilization Wall. Performance gap Processors getting faster Memory only getting larger Caches are not efficient Designed for fast lookup Contain too many useless blocks!.
E N D
Using Dead Blocks as a Virtual Victim Cache Samira Khan, Daniel A. Jiménez, Doug Burger, BabakFalsafi
The Cache Utilization Wall • Performance gap • Processors getting faster • Memory only getting larger • Caches are not efficient • Designed for fast lookup • Contain too many useless blocks! We want the cache to be as efficient as possible
Cache Problem: Dead Blocks fill hit hit hit last hit eviction • Live Block • will be referenced again before eviction • Dead Block • from the last reference until evicted dead live MRU LRU Cache set Cache blocks are dead on average 59% of the time
Reducing Dead Blocks: Virtual Victim Cache Put victim blocks in the dead blocks LRU MRU Cache Live block Dead block Victim block Dead blocks all over the cache acts as a victim cache
Contribution: Virtual Victim Cache Contribution: • Skewed dead block predictor • Victim placement and lookup Result: • Improves predictor accuracy by 4.7% • Reduces miss rate by 26% • Improves performance by 12.1%
Introduction • Virtual Victim Cache • Methodology • Results • Conclusion
Virtual Victim Cache Goal: use dead blocks to hold the victim blocks Mechanism Required: • Identify which block is dead • Lookup the victims
Different Dead Block Predictors • Counting Based [ICCD05] • Predicts dead after certain number of accesses • Time Based [ISCA02] • Predicts dead after certain number of cycles • Trace Based [ISCA01] • Predicts the last touch based on PC • Cache Burst Based [MICRO08] • Predicts when block moves out of the MRU
Trace-Based Dead Block Predictor [ISCA 01] • Predicts last touch based on sequence of instructions • Encoding: truncated addition of instruction PCs • Called signature • Predictor table is indexed by signature • 2 bit saturating counters
Trace-Based Dead Block Predictor [ISCA 01] fill hit hit hit last hit eviction Predictor table dead live PC1: ld a fill 1 dead PC2: stb hit PC3: ld a PC sequence PC4: st a hit hit PC5: ld a PC6: ld e PC7: ld f PC8: st a hit, last touch signature =<PC1,PC3,PC4,PC5,PC8>
Skewed Trace Predictor Index = hash(signature) Index1 = hash1(signature) Index2 = hash2(signature) confidence conf1 conf2 dead if confidence >= threshold dead if conf1+conf2 >= threshold Reference trace predictor table Skewed trace predictor table
Skewed Trace Predictor • Uses two different hash functions • Reduces conflict • Improves accuracy Index2=hash2(sigX) sigX Index1= hash1(sigX) conflict Index1= hash1(sigY) Predictor tables sigY Index3=hash2(sigY) Conflict in both tables is less likely
Victim Placement and Lookup in VVC • Place victims in dead blocks of adjacent sets • Any victim can be placed in any set • Have to lookup each set for a hit • Trade off between • number of sets • lookup latency We use only one adjacent set to minimize lookup latency
How to determine adjacent set? • Set that differ by only 1 bit • Far enough not to be a hot set Original set Adjacent set LRU MRU Cache
Victim Lookup • On a miss search the adjacent set • If found, bring it back to its original set miss Search Original set Move to original set Original set Search adjacent set Adjacent set hit LRU MRU Cache
Virtual Victim Cache: Why it Works? • Reduces Conflict Misses • Provides extra associativity to the hot set • Reduces Capacity Misses • Puts the LRU block in a dead block • Fully associative cache would have replaced the LRU block • Increasing live blocks effectively increases capacity • Robust to False Positive Prediction • VVC will find that block in the adjacent set, avoids the miss
Introduction • Virtual Victim Cache • Methodology • Results • Conclusion
Experimental Methodology Simulator: Modified version of Simplescalar Benchmark: Spec CPU2000 and spec CPU2006
Single Thread Speedup 1.3 2.6 1.7 1.7 1.3 0.9 Fully associative cache and 64KB victim cache both are unrealistic design
Single Thread Speedup 1.2 2.6 1.6 1.4 1.7 The accuracy of the predictor is more important in dead block replacement
Speedup for Multiple Threads 0.88 0.88 0.89 0.84 Blocks become less predictable in presence of multiple threads
Tag Array Reads due to VVC Tag array reads in the baseline cache is 3.9% of the total number of the instructions executed , versus 4.9% for the VVC
Conclusion • Skewed predictor improves accuracy by 4.7% • Virtual Victim Cache achieves • 12.1% speedup for single-threaded workloads • 4% speedup for multiple-threaded workloads • Future Work in Dead Block Prediction • Improve accuracy • Reduce overhead
Dead blocks as a Virtual Victim Cache • Placing victim blocks in to adjacent set • Evicted blocks are placed in invalid/predicted dead block of the adjacent set • If no such block is present victim blocks are placed in the LRU block • Then the receiver block is moved to the MRU position • Adaptive insertion is also used • Cache lookup for previously evicted block • original set lookup : miss • adjacent set lookup : hit • Block is refilled from the adjacent to original set • Receiver block in the adjacent set is marked as invalid • One bit keeps track of receiver blocks • Tag match in original accesses ignores the receiver blocks
Predictor Coverage and False Positive Rate Amean 179.art 175.vpr 473.astar 181.mcf 429.mcf 188.ammp 197.parser 450.soplex 255.vortex 187.facerec 456.hmmer 300.twolf 464.h264ref 401.bzip2 256.bzip2 178.galgel
Trace Based Dead Block Predictor Memory instruction sequence going to cache set s Fill action signature tag & data hit action pc m : ld a m+n+o m+n a m hit action Update signature pc n : ld a m+n+o a pc o : st a m+n+o a Update the predictor pc p : ld b m+n+o a pc q : ld c m+n+o a <signature m+n+o> pc r : st d m+n+o a 1 pc s : ld e m+n+o a <signature m> 0 pc t : ld f m+n+o a pc u : ld g evict action <signature m+n> pc v : ld h 0 pc w : ld i
Speedup 2.5 2.6 2.6
Motivation Cache
False Positive Prediction Shared cache contention results in more false positive predictions
Predictor Table Hardware Budget With 8KB predictor, VVC achieves 5.4% speedup with original predictor where it achieves 12.1% speedup with skewed predictor
Cache Efficiency VVC improves cache efficiency by 62% for multiple-threaded workloads and by 26% for single-threaded workloads
Introduction • Background • Virtual Victim Cache • Methodology • Results • Conclusion
Introduction • Background • Virtual Victim Cache • Methodology • Results • Conclusion
Experimental Methodology Dead Block Predictor parameter Overhead is 3.4% of the total 2MB L2 cache space
Reducing Dead Blocks: Virtual Victim Cache MRU LRU Cache Dead blocks all over the cache acts as a victim cache
Virtual Victim Cache • Place evicted blocks in dead blocks of other adjacent sets • On a miss search the other adjacent sets for a match • If that block is found in adjacent set, bring it back to its original set Dead blocks across all over the cache acts as a victim cache
Virtual Victim Cache: How it Works? • How to determine adjacent set? • Set that differ by only 1 bit, in our case 4th bit • Far enough not to be a hot set • How to find receiver block in the adjacent set? • Add 1 bit to receiver block • Where to place the receiver block? • Use dynamic insertion policy • Choose either LRU or MRU position