310 likes | 466 Views
Adaptive Insertion Policies for Managing Shared Caches. Aamer Jaleel , William Hasenplaugh, Moinuddin Qureshi, Julien Sebot, Simon Steely Jr., Joel Emer Intel Corporation, VSSAD Aamer.Jaleel@intel.com. International Conference on Parallel Architectures and Compilation Techniques (PACT).
E N D
Adaptive Insertion Policies for Managing Shared Caches Aamer Jaleel, William Hasenplaugh, Moinuddin Qureshi, Julien Sebot, Simon Steely Jr., Joel Emer Intel Corporation, VSSAD Aamer.Jaleel@intel.com International Conference on Parallel Architectures and Compilation Techniques (PACT)
Core 0 Core 0 Core 1 Core 0 Core 1 Core 2 Core 3 FLC FLC FLC FLC FLC FLC FLC LLC LLC MLC MLC MLC MLC LLC Single Core ( SMT ) Dual Core ( ST/SMT ) Quad-Core ( ST/SMT ) Paper Motivation • Shared caches common and more so with increasing # of cores • # concurrent applications contention for shared cache • High Performance Manage shared cache efficiently
soplex Misses Per 1000 Instr (under LRU) h264ref soplex h264ref 0 25 50 75 100 Cache Occupancy Under LRU Replacement (2MB Shared Cache) Problems with LRU-Managed Shared Caches • Conventional LRU policy allocates resources based on rate of demand • Applications that do not benefit from cache cause destructive cache interference
soplex Misses Per 1000 Instr (under LRU) h264ref soplex h264ref 0 25 50 75 100 Cache Occupancy Under LRU Replacement (2MB Shared Cache) Addressing Shared Cache Performance • Conventional LRU policy allocates resources based on rate of demand • Applications that do not benefit from cache cause destructive cache interference • Cache Partitioning: Reserves cache resources based on application benefit rather than rate of demand • HW to detect cache benefit • Changes to existing cache structure • Not scalable to large # of applications Eliminate Drawbacks of Cache Partitioning
Paper Contributions • Problem: For shared caches, conventional LRU policy allocates cache resources based on rate-of demand rather than benefit • Goals:Design a dynamic hardware mechanism that: 1.Provides High Performance by Allocating Cache on a Benefit-basis 2. Is Robust Across Different Concurrently Executing Applications 3. Scales to Large Number of Competing Applications 4. Requires Low Design Overhead • Solution:Thread-Aware Dynamic Insertion Policy (TADIP) that improves average throughput by 12-18% for 2, 4, 8, and 16-core systems with two bytes of storage per HW-thread TADIP, Unlike Cache Partitioning, DOES NOT Attempt to Reserve Cache Space
Review Insertion Policies • “Adaptive Insertion Policies for High-Performance Caching” • Moinuddin Qureshi, Aamer Jaleel, Yale Patt, Simon Steely Jr., Joel Emer • Appeared in ISCA’07
Cache Replacement 101 – ISCA’07 Two components of cache replacement: • Victim Selection: • Which line to replace for incoming line? (E.g. LRU, Random etc) • Insertion Policy: • With what priority is the new line placed in the replacement list? (E.g. insert new line into MRU position) Simple changes to insertion policy can minimize cache thrashing and improves cache performance for memory-intensive workloads
MRU LRU a b c d e f g h Reference to ‘i’ with conventional LRU policy: i a b c d e f g Reference to ‘i’ with LIP: a b c d e f g i Reference to ‘i’ with BIP: if( rand() < b ) Insert at MRU postion else Insert at LRU position Static Insertion Policies – ISCA’07 • Conventional (MRU Insertion) Policy: • Choose victim, promote to MRU • LRU Insertion Policy (LIP): • Choose victim, DO NOT promote to MRU • Unless reused, lines stay at LRU position • Bimodal Insertion Policy (BIP) • LIP does not age older lines • Infrequently insert some misses at MRU • Bimodal Throttle: b • We used b ~= 3% Applications Prefer Either Conventional LRU or BIP…
miss SDM-LRU + SDM-BIP – MSB = 1? miss Follower Sets NO YES USE LRU DO BIP Dynamic Insertion Policy (DIP) via “Set-Dueling” – ISCA’07 HW Required: 10 bits + Combinational Logic • Set Dueling Monitors (SDMs): Dedicated sets to estimate the performance of a pre-defined policy • Divide the cache in three: • SDM-LRU: Dedicated LRU-sets • SDM-BIP: Dedicated BIP-sets • Follower sets • PSEL: n-bit saturating counter • misses to SDM-LRU: PSEL++ • misses to SDM-BIP: PSEL-- • Follower sets insertion policy: • Use LRU: If PSEL MSB = 0 • Use BIP: If PSEL MSB = 1 PSEL • - Based on Analytical and Empirical Studies: • 32 Sets per SDM • 10 bit PSEL counter
Extending DIP to Shared Caches • DIP uses a single policy (LRU or BIP) for all applications competing for the cache • DIP can not distinguish between apps that benefit from cache and those that do not • Example: soplex + h264ref w/2MB cache • DIP learns LRU for both apps • soplex causes destructive interference • Desirable that only h264ref follow LRU and soplex follow BIP soplex Misses Per 1000 Instr (under LRU) h264ref Need a Thread-Aware Dynamic Insertion Policy (TADIP)
Thread Aware Dynamic Insertion Policy (TADIP) • Assume N-core CMP running N apps, what is best insertion policy for each app? (LRU=0, BIP=1) • Insertion policy decision can be thought of as an N-bit binary string: < P0, P1, P2 … PN-1 > • If Px = 1, then for application c use BIP, else use LRU • e.g. 0000 always use conventional LRU, 1111 always use BIP • With N-bit string, 2N possible string combinations. How to find best one??? • Offline Profiling: Input set/system dependent & impractical with large N • Brute Force Search using SDMs: Infeasible with large N Need a PRACTICAL and SCALABLE Implementation of TADIP
Using Set-Dueling As a Practical Approach to TADIP • Unnecessary to exhaustively search all 2N combinations • Some bits of the best binary insertion string can be learned independently • Example: Always use BIP for applications that create interference • Exponential Search Space Linear Search Space • Learn best policy (BIP or LRU) for each app in presence of all other apps Use Per-Application SDMs To Decide: In the presence of other apps, does an app cause destructive interference… If so, use BIP for this app, else use LRU policy
miss + PSEL0 PSEL0 – + PSEL1 PSEL1 – + PSEL2 PSEL2 – + PSEL3 PSEL3 – TADIP Using Set-Dueling Monitors (SDMs) • Assume a cache shared by 4 applications: APP0APP1 APP2 APP3 In the presence of other apps, does APP0 doing LRU or BIP improve cache performance? < P0, P1, P2, P3> < 0, P1, P2, P3 > < 1, P1, P2, P3 > < P0, 0, P2, P3 > < P0, 1, P2, P3 > < P0, P1, 0, P3 > < P0, P1, 1, P3 > < P0, P1, P2, 0 > < P0, P1, P2, 1 > Follower Sets • Pc = MSB( PSELc ) Set-Level View of Cache High-Level View of Cache
miss + PSEL0 – + PSEL1 – + PSEL2 – + PSEL3 – TADIP Using Set-Dueling Monitors (SDMs) • Assume a cache shared by 4 applications: APP0APP1 APP2APP3 • LRU SDMs for each APP • BIP SDMs for each APP • Follower sets • Per-APP PSEL saturating counters • misses to LRU: PSEL++ • misses to BIP: PSEL-- • Follower sets insertion policy: • SDMs of one thread are follower sets of another thread • Let Px = MSB[ PSELx ] • Fill Decision: <P0, P1,P2, P3 > < 0, P1, P2, P3 > < P0, P1, P2, P3> < 1, P1, P2, P3 > < P0, 0, P2, P3 > < P0, 1, P2, P3 > < P0, P1, 0, P3 > < P0, P1, 1, P3 > < P0, P1, P2, 0 > < P0, P1, P2, 1 > Follower Sets • 32 sets per SDM • 10-bit PSEL • Pc = MSB( PSELc ) HW Required: (10*T) bits + CombinationalLogic
Summarizing Insertion Policies TADIP is SCALABLE with Large N
Experimental Setup • Simulator and Benchmarks: • CMP$im – A Pin-based Multi-Core Performance Simulator • 17 representative SPEC CPU2006 benchmarks • Baseline Study: • 4-core CMP with in-order cores (assuming L1-hit IPC of 1) • Three-level Cache Hierarchy: 32KB L1, 256KB L2, 4MB L3 • 15 workload mixes of four different SPEC CPU2006 benchmarks • Scalability Study: • 2-core, 4-core, 8-core, 16-core systems • 50 workload mixes of 2, 4, 8, & 16 different SPEC CPU2006 benchmarks
APKI APKI MPKI MPKI LRU % MRU insertions % MRU insertions BIP Cache Usage Cache Usage Baseline LRU Policy / DIP TADIP soplex + h264ref Sharing 2MB Cache MPKI APKI: accesses per 1000 inst MPKI: misses per 1000 inst SOPLEX H264REF TADIP Improves Throughput by 27% over LRU and DIP
No Gains from DIP TADIP Results – Throughput DIP and TADIP are ROBUST and Do Not Degrade Performance over LRU Making Thread-Aware Decisions is 2x Better than DIP
TADIP Compared to Offline Best Static Policy Static Best almost always better because insertion string with best IPC chosen as “Best Static”. TADIP optimizes for fewer misses. Can use TADIP to optimize other metrics (e.g. IPC) TADIP Better Due to Phase Adaptation TADIP is within 85% of Best Offline Determined Insertion Policy Decision
UCP TADIP Cost Per Thread (bytes) 1920 2 TADIP Vs. UCP ( MICRO’06 ) Utility Based Cache Partitioning (UCP) DIP Out-Performs UCP Without Requiring Any Cache Partitioning Hardware Unlike Cache Partitioning Schemes, TADIP Does NOT Reserve Cache Space TADIP Does Efficient CACHE MANAGEMENT by Changing Insertion Policy
TADIP Results – Sensitivity to Cache Size TADIP Provides Performance Equivalent to Doubling Cache Size
TADIP Results – Scalability Throughput Normalized to Baseline System TADIP Scales to Large Number of Concurrently Executing Applications
Summary • The Problem:For shared caches, conventional LRU policy allocates cache resources based on rate-of demand rather than benefit • Solution: Thread-Aware Dynamic Insertion Policy (TADIP) 1. Provides High Performance by Allocating Cache on a Benefit-Basis - Up to 94%, 64%, 26% and 16% performance on 2, 4, 8, and 16 core CMPs 2. Is Robust Across Different Workload Mixes - Does not significantly hurt performance when LRU works well 3. Scales to Large Number of Competing Applications - Evaluated up to 16-cores in our study 4. Requires Low Design Overhead - < 2 bytes per HW-thread and NO CHANGES to existing cache structure
Journal of Instruction-Level Parallelism1st Data Prefetching Championship (DPC-1) Sponsored by: Intel, JILP, IEEE TC-uARCHConjunction with: HPCA-15 Paper & Abstract Due: December 12th, 2008 Notification: January 16th, 2008 Final Version: January 30th, 2008 More Information and Prefetch Download Kit At: http://www.jilp.org/dpc/
TADIP Results – Weighted Speedup TADIP Provides More Than Two Times Performance of DIP TADIP Improves Performance over LRU by 18%
TADIP Results – Fairness Metric TADIP Improves the Fairness
TADIP In Presence of Prefetching on 4-core CMP TADIP Improves Performance Even In Presence of HW Prefetching
Insertion Policy to Control Cache Occupancy (16-Cores) Sixteen Core Mix with 16MB LLC • Changing insertion policy directly controls the amount of cache resources provided to an application • In figure, only showing only the TADIP selection insertion policy for xalancbmk& sphinx3 • TADIP improves performance by 28% APKI MPKI % MRU insertions Cache Usage Insertion Policy Directly Controls Cache Occupancy
miss miss + + – – miss miss TADIP Using Set-Dueling Monitors (SDMs) • Assume a cache shared by 2 applications: APP0 and APP1 In the presence of other apps, should APP0 do LRU or BIP? < 0 , P1> < P0 , P1 > PSEL0 < 1 , P1> < P0 , 0 > In the presence of other apps, should APP1 do LRU or BIP? PSEL1 < P0 , 1 > Follower Sets • 32 sets per SDM • 9-bit PSEL • Pc = MSB( PSELc ) Set-Level View of Cache High-Level View of Cache
miss miss + + – – miss miss TADIP Using Set-Dueling Monitors (SDMs) • LRU SDMs for each APP • BIP SDMs for each APP • Follower sets • PSEL0, PSEL1: per-APP PSEL • misses to LRU: PSEL++ • misses to BIP: PSEL-- • Follower sets insertion policy: • SDMs of one thread are follower sets of another thread • Let Px = MSB[ PSELx ] • Fill Decision: <P0, P1> • Assume a cache shared by 2 applications: APP0 and APP1 < P0 , P1 > < 0 , P1> PSEL0 < 1 , P1> < P0 , 0 > PSEL1 < P0 , 1 > Follower Sets • 32 sets per SDM • 9-bit PSEL cntr • Pc = MSB( PSELc ) HW Required: (9*T) bits + CombinationalLogic