Adaptive Insertion Policies for Managing Shared Caches

Adaptive Insertion Policies for Managing Shared Caches Aamer Jaleel, William Hasenplaugh, Moinuddin Qureshi, Julien Sebot, Simon Steely Jr., Joel Emer Intel Corporation, VSSAD Aamer.Jaleel@intel.com International Conference on Parallel Architectures and Compilation Techniques (PACT)

Core 0 Core 0 Core 1 Core 0 Core 1 Core 2 Core 3 FLC FLC FLC FLC FLC FLC FLC LLC LLC MLC MLC MLC MLC LLC Single Core ( SMT ) Dual Core ( ST/SMT ) Quad-Core ( ST/SMT ) Paper Motivation • Shared caches common and more so with increasing # of cores • # concurrent applications  contention for shared cache  • High Performance  Manage shared cache efficiently

soplex Misses Per 1000 Instr (under LRU) h264ref soplex h264ref 0 25 50 75 100 Cache Occupancy Under LRU Replacement (2MB Shared Cache) Problems with LRU-Managed Shared Caches • Conventional LRU policy allocates resources based on rate of demand • Applications that do not benefit from cache cause destructive cache interference

soplex Misses Per 1000 Instr (under LRU) h264ref soplex h264ref 0 25 50 75 100 Cache Occupancy Under LRU Replacement (2MB Shared Cache) Addressing Shared Cache Performance • Conventional LRU policy allocates resources based on rate of demand • Applications that do not benefit from cache cause destructive cache interference • Cache Partitioning: Reserves cache resources based on application benefit rather than rate of demand • HW to detect cache benefit • Changes to existing cache structure • Not scalable to large # of applications Eliminate Drawbacks of Cache Partitioning

Paper Contributions • Problem: For shared caches, conventional LRU policy allocates cache resources based on rate-of demand rather than benefit • Goals:Design a dynamic hardware mechanism that: 1.Provides High Performance by Allocating Cache on a Benefit-basis 2. Is Robust Across Different Concurrently Executing Applications 3. Scales to Large Number of Competing Applications 4. Requires Low Design Overhead • Solution:Thread-Aware Dynamic Insertion Policy (TADIP) that improves average throughput by 12-18% for 2, 4, 8, and 16-core systems with  two bytes of storage per HW-thread TADIP, Unlike Cache Partitioning, DOES NOT Attempt to Reserve Cache Space

Review Insertion Policies • “Adaptive Insertion Policies for High-Performance Caching” • Moinuddin Qureshi, Aamer Jaleel, Yale Patt, Simon Steely Jr., Joel Emer • Appeared in ISCA’07

Cache Replacement 101 – ISCA’07 Two components of cache replacement: • Victim Selection: • Which line to replace for incoming line? (E.g. LRU, Random etc) • Insertion Policy: • With what priority is the new line placed in the replacement list? (E.g. insert new line into MRU position) Simple changes to insertion policy can minimize cache thrashing and improves cache performance for memory-intensive workloads

MRU LRU a b c d e f g h Reference to ‘i’ with conventional LRU policy: i a b c d e f g Reference to ‘i’ with LIP: a b c d e f g i Reference to ‘i’ with BIP: if( rand() < b ) Insert at MRU postion else Insert at LRU position Static Insertion Policies – ISCA’07 • Conventional (MRU Insertion) Policy: • Choose victim, promote to MRU • LRU Insertion Policy (LIP): • Choose victim, DO NOT promote to MRU • Unless reused, lines stay at LRU position • Bimodal Insertion Policy (BIP) • LIP does not age older lines • Infrequently insert some misses at MRU • Bimodal Throttle: b • We used b ~= 3% Applications Prefer Either Conventional LRU or BIP…

miss SDM-LRU + SDM-BIP – MSB = 1? miss Follower Sets NO YES USE LRU DO BIP Dynamic Insertion Policy (DIP) via “Set-Dueling” – ISCA’07 HW Required: 10 bits + Combinational Logic • Set Dueling Monitors (SDMs): Dedicated sets to estimate the performance of a pre-defined policy • Divide the cache in three: • SDM-LRU: Dedicated LRU-sets • SDM-BIP: Dedicated BIP-sets • Follower sets • PSEL: n-bit saturating counter • misses to SDM-LRU: PSEL++ • misses to SDM-BIP: PSEL-- • Follower sets insertion policy: • Use LRU: If PSEL MSB = 0 • Use BIP: If PSEL MSB = 1 PSEL • - Based on Analytical and Empirical Studies: • 32 Sets per SDM • 10 bit PSEL counter

Extending DIP to Shared Caches • DIP uses a single policy (LRU or BIP) for all applications competing for the cache • DIP can not distinguish between apps that benefit from cache and those that do not • Example: soplex + h264ref w/2MB cache • DIP learns LRU for both apps • soplex causes destructive interference • Desirable that only h264ref follow LRU and soplex follow BIP soplex Misses Per 1000 Instr (under LRU) h264ref Need a Thread-Aware Dynamic Insertion Policy (TADIP)

Thread Aware Dynamic Insertion Policy (TADIP) • Assume N-core CMP running N apps, what is best insertion policy for each app? (LRU=0, BIP=1) • Insertion policy decision can be thought of as an N-bit binary string: < P0, P1, P2 … PN-1 > • If Px = 1, then for application c use BIP, else use LRU • e.g. 0000  always use conventional LRU, 1111  always use BIP • With N-bit string, 2N possible string combinations. How to find best one??? • Offline Profiling: Input set/system dependent & impractical with large N • Brute Force Search using SDMs: Infeasible with large N Need a PRACTICAL and SCALABLE Implementation of TADIP

Using Set-Dueling As a Practical Approach to TADIP • Unnecessary to exhaustively search all 2N combinations • Some bits of the best binary insertion string can be learned independently • Example: Always use BIP for applications that create interference • Exponential Search Space  Linear Search Space • Learn best policy (BIP or LRU) for each app in presence of all other apps Use Per-Application SDMs To Decide: In the presence of other apps, does an app cause destructive interference… If so, use BIP for this app, else use LRU policy

miss + PSEL0 PSEL0 – + PSEL1 PSEL1 – + PSEL2 PSEL2 – + PSEL3 PSEL3 – TADIP Using Set-Dueling Monitors (SDMs) • Assume a cache shared by 4 applications: APP0APP1 APP2 APP3 In the presence of other apps, does APP0 doing LRU or BIP improve cache performance? < P0, P1, P2, P3> < 0, P1, P2, P3 > < 1, P1, P2, P3 > < P0, 0, P2, P3 > < P0, 1, P2, P3 > < P0, P1, 0, P3 > < P0, P1, 1, P3 > < P0, P1, P2, 0 > < P0, P1, P2, 1 > Follower Sets • Pc = MSB( PSELc ) Set-Level View of Cache High-Level View of Cache

miss + PSEL0 – + PSEL1 – + PSEL2 – + PSEL3 – TADIP Using Set-Dueling Monitors (SDMs) • Assume a cache shared by 4 applications: APP0APP1 APP2APP3 • LRU SDMs for each APP • BIP SDMs for each APP • Follower sets • Per-APP PSEL saturating counters • misses to LRU: PSEL++ • misses to BIP: PSEL-- • Follower sets insertion policy: • SDMs of one thread are follower sets of another thread • Let Px = MSB[ PSELx ] • Fill Decision: <P0, P1,P2, P3 > < 0, P1, P2, P3 > < P0, P1, P2, P3> < 1, P1, P2, P3 > < P0, 0, P2, P3 > < P0, 1, P2, P3 > < P0, P1, 0, P3 > < P0, P1, 1, P3 > < P0, P1, P2, 0 > < P0, P1, P2, 1 > Follower Sets • 32 sets per SDM • 10-bit PSEL • Pc = MSB( PSELc ) HW Required: (10*T) bits + CombinationalLogic

Summarizing Insertion Policies TADIP is SCALABLE with Large N

Experimental Setup • Simulator and Benchmarks: • CMP$im – A Pin-based Multi-Core Performance Simulator • 17 representative SPEC CPU2006 benchmarks • Baseline Study: • 4-core CMP with in-order cores (assuming L1-hit IPC of 1) • Three-level Cache Hierarchy: 32KB L1, 256KB L2, 4MB L3 • 15 workload mixes of four different SPEC CPU2006 benchmarks • Scalability Study: • 2-core, 4-core, 8-core, 16-core systems • 50 workload mixes of 2, 4, 8, & 16 different SPEC CPU2006 benchmarks

APKI APKI MPKI MPKI LRU % MRU insertions % MRU insertions BIP Cache Usage Cache Usage Baseline LRU Policy / DIP TADIP soplex + h264ref Sharing 2MB Cache MPKI APKI: accesses per 1000 inst MPKI: misses per 1000 inst SOPLEX H264REF TADIP Improves Throughput by 27% over LRU and DIP

No Gains from DIP TADIP Results – Throughput DIP and TADIP are ROBUST and Do Not Degrade Performance over LRU Making Thread-Aware Decisions is 2x Better than DIP

TADIP Compared to Offline Best Static Policy Static Best almost always better because insertion string with best IPC chosen as “Best Static”. TADIP optimizes for fewer misses. Can use TADIP to optimize other metrics (e.g. IPC) TADIP Better Due to Phase Adaptation TADIP is within 85% of Best Offline Determined Insertion Policy Decision

UCP TADIP Cost Per Thread (bytes) 1920 2 TADIP Vs. UCP ( MICRO’06 ) Utility Based Cache Partitioning (UCP) DIP Out-Performs UCP Without Requiring Any Cache Partitioning Hardware Unlike Cache Partitioning Schemes, TADIP Does NOT Reserve Cache Space TADIP Does Efficient CACHE MANAGEMENT by Changing Insertion Policy

TADIP Results – Sensitivity to Cache Size TADIP Provides Performance Equivalent to Doubling Cache Size

TADIP Results – Scalability Throughput Normalized to Baseline System TADIP Scales to Large Number of Concurrently Executing Applications

Summary • The Problem:For shared caches, conventional LRU policy allocates cache resources based on rate-of demand rather than benefit • Solution: Thread-Aware Dynamic Insertion Policy (TADIP) 1. Provides High Performance by Allocating Cache on a Benefit-Basis - Up to 94%, 64%, 26% and 16% performance on 2, 4, 8, and 16 core CMPs 2. Is Robust Across Different Workload Mixes - Does not significantly hurt performance when LRU works well 3. Scales to Large Number of Competing Applications - Evaluated up to 16-cores in our study 4. Requires Low Design Overhead - < 2 bytes per HW-thread and NO CHANGES to existing cache structure

Q&A

Journal of Instruction-Level Parallelism1st Data Prefetching Championship (DPC-1) Sponsored by: Intel, JILP, IEEE TC-uARCHConjunction with: HPCA-15 Paper & Abstract Due: December 12th, 2008 Notification: January 16th, 2008 Final Version: January 30th, 2008 More Information and Prefetch Download Kit At: http://www.jilp.org/dpc/

TADIP Results – Weighted Speedup TADIP Provides More Than Two Times Performance of DIP TADIP Improves Performance over LRU by 18%

TADIP Results – Fairness Metric TADIP Improves the Fairness

TADIP In Presence of Prefetching on 4-core CMP TADIP Improves Performance Even In Presence of HW Prefetching

Insertion Policy to Control Cache Occupancy (16-Cores) Sixteen Core Mix with 16MB LLC • Changing insertion policy directly controls the amount of cache resources provided to an application • In figure, only showing only the TADIP selection insertion policy for xalancbmk& sphinx3 • TADIP improves performance by 28% APKI MPKI % MRU insertions Cache Usage Insertion Policy Directly Controls Cache Occupancy

miss miss + + – – miss miss TADIP Using Set-Dueling Monitors (SDMs) • Assume a cache shared by 2 applications: APP0 and APP1 In the presence of other apps, should APP0 do LRU or BIP? < 0 , P1> < P0 , P1 > PSEL0 < 1 , P1> < P0 , 0 > In the presence of other apps, should APP1 do LRU or BIP? PSEL1 < P0 , 1 > Follower Sets • 32 sets per SDM • 9-bit PSEL • Pc = MSB( PSELc ) Set-Level View of Cache High-Level View of Cache

miss miss + + – – miss miss TADIP Using Set-Dueling Monitors (SDMs) • LRU SDMs for each APP • BIP SDMs for each APP • Follower sets • PSEL0, PSEL1: per-APP PSEL • misses to LRU: PSEL++ • misses to BIP: PSEL-- • Follower sets insertion policy: • SDMs of one thread are follower sets of another thread • Let Px = MSB[ PSELx ] • Fill Decision: <P0, P1> • Assume a cache shared by 2 applications: APP0 and APP1 < P0 , P1 > < 0 , P1> PSEL0 < 1 , P1> < P0 , 0 > PSEL1 < P0 , 1 > Follower Sets • 32 sets per SDM • 9-bit PSEL cntr • Pc = MSB( PSELc ) HW Required: (9*T) bits + CombinationalLogic

Adaptive Insertion Policies for Managing Shared Caches

Adaptive Insertion Policies for Managing Shared Caches

Presentation Transcript

Thoughts on Shared Caches

Adaptive Insertion Policies for High-Performance Caching

Managing Group Policies

PIPP: Promotion/Insertion Pseudo-Partitioning of Multi-Core Shared Caches

PIPP: Promotion/Insertion Pseudo-Partitioning of Multi-Core Shared Caches

Bypass and Insertion Algorithms for Exclusive Last-level Caches

Bypass and Insertion Algorithms for Exclusive Last-level Caches

Adaptive Insertion Policies for High-Performance Caching

Managing Wire Delay in CMP Caches

Improved Policies for Drowsy Caches in Embedded Processors

ASR: Adaptive Selective Replication for CMP Caches

Managing Distributed, Shared L2 Caches through OS-Level Page Allocation

General Adaptive Replacement Policies

Utility-Based Partitioning of Shared Caches

Adaptive Mechanisms and Policies for Managing Cache Hierarchies in Chip Multiprocessors

Tips for Managing the Shared Hosting

Managing Wire Delay in CMP Caches

PageNUCA: Selected Policies for Page-grain Locality Management in Large Shared CMP Caches

Utility-Based Partitioning of Shared Caches

Thoughts on Shared Caches

ASR: Adaptive Selective Replication for CMP Caches