210 likes | 333 Views
MadCache : A PC-aware Cache Insertion Policy. Andrew Nere , Mitch Hayenga , and Mikko Lipasti PHARM Research Group University of Wisconsin – Madison June 20, 2010. Executive Summary.
E N D
MadCache: A PC-aware Cache Insertion Policy Andrew Nere, Mitch Hayenga, and MikkoLipasti PHARM Research Group University of Wisconsin – Madison June 20, 2010
Executive Summary • Problem: Changing hardware and workloads encourage investigation of cache replacement/insertion policy designs • Proposal: MadCache uses PC history to choose cache insertion policy • Last level cache granularity • Individual PC granularity • Performance improvements over LRU • 2.5% improvement IPC (single thread) • 4.5% speedup and 6% speedup improvement (multithreaded)
Motivation • Importance of investigating cache insertion policies • Direct affect on performance • LRU dominated hardware designs for many years • Changing workloads, levels of caches • Shared last-level cache • Cache behavior now depends on multiple running applications • One streaming thread can ruin the cache for everyone
Previous Work • Dynamic insertion policies • DIP – Qureshi et. al – ISCA ’07 • Dueling sets select best of multiple policies • Bimodal Insertion Policy (BIP) offers thrash protection • TADIP – Jaleel et. al – PACT ’08 • Awareness of other threads’ workloads • Utilizing Program Counter information • Exhibit a useful amount of predictable behavior • Dead-block prediction and prefetching – ISCA ’01 • PC-based load miss prediction – MICRO ’95
MadCache Proposal • Problem: With changing hardware and workloads, caches are subject to suboptimal insertion policies • Solution: Use PC information to create a better policy • Adaptive default cache insertion policy • Track PCs to determine the policy on a finer grain than DIP • Filter out streaming PCs Introducing MadCache!
MadCache Design • Tracker Sets • Sample behavior of the cache • Enter the PCs into PC-Predictor Table • Determines default policy of cache • Uses set dueling - Qureshi et. al – ISCA ’07 • LRU and Bypassing Bimodal Insertion Policy (BBIP) • Follower Sets • Majority of the last level cache • Typically follow the default policy • Can override default cache policy (PC-Predictor Table)
Tracker and Follower Sets Reuse Bit Index to PC- Predictor Last Level Cache BBIP Tracker Sets LRU Trackers Sets Follower Sets • Tracker Sets overhead • 1-bit to indicate if line was accessed again • 10/11 bits to index PC-Predictor table
MadCache Design • PC-Predictor Table • Store PCs that have accessed Tracker Sets • Track behavior history using counter • Decrement if an address is used many times in the LLC • Increment if line is evicted and was never reused • Per-PC default policy override • LRU (default) plus BBIP override • BBIP (default) plus LRU override
PC-Predictor Table Default Policy PC-Predictor Table (MSB) Counter PC (miss) Policy + PC (MSB) Counter # Entries (1 + 64 bits) • (6 bits) (9 bits) Hit? 0 1 • Parallel to cache miss, PC + current policy index PC-Predictor • If hit in table, follow the PC’s override policy • If miss in table, follow global default policy
Multi-Threaded MadCache • Thread aware MadCache • Similar structures as single-threaded MadCache • Track based on current policy of other threads • Multithreaded MadCache extensions • Separate tracker sets for each thread • Each thread still tracks LRU and BBIP • PC-Predictor table • Extended number of entries • Indexed by thread-ID, policy, and PC • Set dueling PER THREAD
Multi-threaded MadCache Default Policy PC-Predictor Table TID-0 (MSB) Counter TID + <P0,P1,P2,P3> + PC (MSB) Counter # Entries TID-1 (10 bits) (2 + 4 + 64 bits) • (6 bits) (9 bits) TID-2 TID-3 Last Level Cache Hit? TID-0 BBIP Tracker Sets 0 1 TID-0 LRU Tracker Sets Other Tracker Sets Follower Sets
MadCache – Example Application • Deep Packet Inspection1 • Large match tables (1MB+) commonly used for DFA/XFA regular expression matching • Incoming byte stream from packets causes different table traversals • Table exhibits reuse between packets • Packets mostly streaming (backtracking implementation dependent) 1Evaluating GPUs for Network Packet Signature Matching – ISPASS ‘09
MadCache – Example Application Current Processing Element Match Table Current Processing Element Current Processing Element Packet Current Processing Element Packet Current Processing Element Current Processing Element Packet • Packets mostly streaming • Frequently accessed Match Table contents held in L1/L2 • Less frequently accessed elements in LLC/memory
MadCache – Example Application • DIP • Would favor BIP policy due to packet data streaming • LLC mixture of Match Table and useless packet data • MadCache • Would identify PCs associated with Match Table as useful • LLC populated almst entirely by Match Table Packet Data Table Data DIP LLC MadCache LLC
Experimentation • 15 benchmarks from SPEC CPU2006 • 15 workload mixes for multithreaded experiments • 200 million cycle simulations
Results – Single-threaded IPC normalized to LRU • 2.5% improvement across benchmarks tested • Slight improvement over DIP
Results – Multithreaded Throughputnormalized to LRU • 6% improvement across mixes tested • DIP performs similarly to LRU
Results Weighted speedup normalized to LRU • 4.5% improvement across benchmaks tested • DIP performs similarly to LRU
Future Work • MadderCache? • Optimize size of structures • PC-Predictor Table size • Replace CAM with Hashed PC & Tag • Detailed analysis of benchmarks with MadCache • Extend PC Predictions • Don’t take into account sharers
Conclusions • Cache behavior still evolving • Changing cachelevels, sharing,workloads • MadCache insertion policy uses PC information • PCs exhibit useful amount of predictable behavior • MadCache performance • 2.5% improvement IPC for single-threaded • 4.5% speedup, 6% throughput improvement for 4-threads • Sized to competition bit budget • Preliminary investigations show little impact with reduction in structures