ECE8833 Polymorphous and Many-Core Computer Architecture

ECE8833 Polymorphous and Many-Core Computer Architecture Lecture 6 Fair Caching Mechanisms for CMP Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering

Cache Sharing in CMP [Kim, Chandra, Solihin, PACT’04] Processor Core 1 Processor Core 2 L1 $ L1 $ L2 $ …… [Kim, Chandra, Solihin PACT2004] Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU

Cache Sharing in CMP Processor Core 1 Processor Core 2 ←t1 L1 $ L1 $ L2 $ …… Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU

Cache Sharing in CMP Processor Core 1 Processor Core 2 t2→ L1 $ L1 $ L2 $ …… Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU

Cache Sharing in CMP Processor Core 1 Processor Core 2 ←t1 t2→ L1 $ L1 $ L2 $ …… t2’s throughput is significantly reduced due to unfair cache sharing. Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU

Shared L2 Cache Space Contention Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU

time slice t1 t2 t3 t1 t4 time slice t1 t1 t1 t1 t1 t3 t3 t2 t2 t4 Impact of Unfair Cache Sharing • Uniprocessor scheduling • 2-core CMP scheduling • gzip will get more time slices than others if gzip is set to run at higher priority (and it could run slower than others  priority inversion) • It could further slows down the other processes (starvation) • Thus the overall throughput is reduced (uniform slowdown) P1: P2: 7

HIT Counters Value CTR Pos 0 CTR Pos 1 CTR Pos 2 CTR Pos 3 30 20 15 10 Misses = 25 Stack Distance Profiling Algorithm CTR Pos0 CTR Pos1 CTR Pos2 CTR Pos3 HIT Counters Cache Tag MRU LRU [Qureshi+, MICRO-39]

Stack Distance Profiling • A counter for each cache way, C>A is the counter for misses • Show the reuse frequency for each way in a cache • Can be used to predict the misses for associativity smaller than “A” • Misses for 2-way cache for gzip = C>A + Σ Ciwhere i = 3 to 8 • art does not need all the space for likely poor temporal locality • If the given space is halved for art and given to gzip, what happens?

Fairness Metrics [Kim et al. PACT’04] • Uniform slowdown Execution time of ti when it runs alone. Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU

Fairness Metrics [Kim et al. PACT’04] • Uniform slowdown Execution time of ti when it shares cache with others. Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU

Fairness Metrics [Kim et al. PACT’04] • Uniform slowdown • We want to minimize: • Ideally: Try to equalize the ratio of miss increase of each thread Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU

Fairness Metrics [Kim et al. PACT’04] • Uniform slowdown • We want to minimize: • Ideally: Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU

LRU LRU LRU LRU Partitionable Cache Hardware • Modified LRU cache replacement policy • G. E. Suh, et. al., HPCA 2002 Per-thread Counter Current Partition Target Partition P1: 448B P1: 384B P2: 576B P2: 640B P2 Miss Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU

LRU LRU LRU LRU LRU LRU * * LRU LRU Partitionable Cache Hardware • Modified LRU cache replacement policy • G. Suh, et. al., HPCA 2002 Current Partition Target Partition P1: 448B P1: 384B P2: 576B P2: 640B P2 Miss Partition granularity could be as coarse as one entire cache way Current Partition Target Partition P1: 384B P1: 384B P2: 640B P2: 640B Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU

MissRate shared P1: P2: Repartitioning interval Target Partition P1: P2: Dynamic Fair Caching Algorithm MissRate alone Counters to keep miss rates running the process alone (from stack distance profiling) Ex) Optimizing M3 metric P1: P2: Counters to keep dynamic miss rates (running with a shared cache) 10K accesses found to be the best Counters to keep target partition size Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU

MissRate shared MissRate shared P1: P1:20% P2: P2:15% Repartitioning interval Target Partition P1:256KB P2:256KB Dynamic Fair Caching Algorithm MissRate alone 1st Interval P1:20% P2: 5% Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU

MissRate shared P1:20% P2:15% Repartitioning interval Target Partition Target Partition P1:192KB P1:256KB P2:256KB P2:320KB Dynamic Fair Caching Algorithm MissRate alone Repartition! P1:20% P2: 5% Evaluate M3 P1: 20% / 20% P2: 15% / 5% Partition granularity: 64KB Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU

MissRate shared MissRate shared P1:20% P1:20% P2:15% P2:15% Repartitioning interval Target Partition P1:192KB P2:320KB Dynamic Fair Caching Algorithm MissRate alone 2nd Interval P1:20% P2: 5% Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU

MissRate shared MissRate shared P1:20% P1:20% P2:15% P2:10% Repartitioning interval Target Partition Target Partition P1:192KB P1:128KB P2:384KB P2:320KB Dynamic Fair Caching Algorithm MissRate alone Repartition! P1:20% P2: 5% Evaluate M3 P1: 20% / 20% P2: 10% / 5% Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU

MissRate shared MissRate shared MissRate shared P1:25% P1:20% P1:20% P2: 9% P2:10% P2:10% Repartitioning interval Target Partition P1:128KB P2:384KB Dynamic Fair Caching Algorithm MissRate alone 3rd Interval P1:20% P2: 5% Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU

MissRate shared MissRate shared P1:25% P1:20% P2:10% P2: 9% Repartitioning interval Target Partition Target Partition P1:128KB P1:192KB P2:320KB P2:384KB Dynamic Fair Caching Algorithm MissRate alone Do Rollback if: P2: Δ<Trollback Δ=MRold-MRnew Repartition! P1:20% P2: 5% The best Trollback threshold found to be 20% Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU

Generic Repartitioning Algorithm Pick the largest and smallest as a pair for repartitioning Repeat for all candidate processes

Utility-Based Cache Partitioning (UCP)

# of ways given (1 to 16) Running Processes on Dual-Core [Qureshi & Patt, MICRO-39] • LRU: in real runs on avg., 7 ways were allocated to equake and 9 to vpr • UTIL • How much you use (in a set) is how much you will get • Ideally, 3 ways to equake and 13 to vpr # of ways given (1 to 16)

Defining Utility Utility Uab = Misses with a ways – Misses with b ways Low Utility Misses per 1000 instructions High Utility Saturating Utility Num ways from 16-way 1MB L2 Slide courtesy: Moin Qureshi, MICRO-39

PA UMON2 UMON1 Framework for UCP Shared L2 cache I$ I$ Core1 Core2 D$ D$ Main Memory Three components: • Utility Monitors (UMON) per core • Partitioning Algorithm (PA) • Replacement support to enforce partitions Slide courtesy: Moin Qureshi, MICRO-39

(MRU) H0 H1 H2 H3 H15 (LRU) ... + + + + + Utility Monitors (UMON) • For each core, simulate LRU policy using Auxiliary Tag Dir (ATD) • UMON-global (one way-counter for all sets) • Hit counters in ATD to count hits per recency position • LRU is a stack algorithm: hit counts  utility E.g., hits(2 ways) = H0+H1 Set A Set B Set C Set D Set E Set F Set G Set H ATD

(MRU) H0 H1 H2 H3 H15 (LRU) ... + + + + + Utility Monitors (UMON) • Extra tags incur hardware and power overhead • DSS reduces overhead [Qureshi et al. ISCA’06] Set A Set A Set B Set B Set C Set C Set D Set D Set E Set E Set F Set F Set G Set G Set H Set H ATD

(MRU) H0 H1 H2 H3 H15 (LRU) ... + + + + + Utility Monitors (UMON) • Extra tags incur hardware and power overhead • DSS reduces overhead [Qureshi et al. ISCA’06] • 32 sets sufficient based on Chebyshev’s inequality • Sample every 32 sets (simple static) used in the paper • Storage < 2KB/UMON (or 0.17% L2) Set A Set B Set B Set E Set C Set F Set D UMON (DSS) Set E Set F Set G Set H ATD

Partitioning Algorithm (PA) • Evaluate all possible partitions and select the best • With aways to core1 and (16-a) ways to core2: Hitscore1 = (H0 + H1 + … + Ha-1) ---- from UMON1 Hitscore2 = (H0 + H1 + … + H16-a-1) ---- from UMON2 • Select a that maximizes (Hitscore1 + Hitscore2) • Partitioning done once every 5 million cycles • After each partitioning interval • Hit counters in all UMONs are halved • To retain some past information

ways_occupied < ways_given Yes No Victim is the LRU line from miss-causing app Victim is the LRU line from other app Replacement Policy to Reach Desired Partition Use way partitioning [Suh+ HPCA’02, Iyer ICS’04] • Each Line contains core-id bits • On a miss, count ways_occupied in the set by miss-causing app • Binary decision for dual-core (in this paper)

UCP Performance (Weighted Speedup) UCP improves average weighted speedup by 11% (Dual Core)

UPC Performance (Throughput) UCP improves average throughput by 17%

Dynamic Insertion Policy

Conventional LRU MRU LRU Incoming Block Slide Source: Yuejian Xie Slide Source: Yuejian Xie

Conventional LRU MRU LRU Occupies one cache blockfor a long time with no benefit! Slide Source: Yuejian Xie

LIP: LRU Insertion Policy [Qureshi et al. ISCA’07] MRU LRU Incoming Block Slide Source: Yuejian Xie 38

LIP: LRU Insertion Policy [Qureshi et al. ISCA’07] MRU LRU Adapted Slide from Yuejian Xie

LIP: LRU Insertion Policy [Qureshi et al. ISCA’07] MRU LRU LIP is not entirely new, Intel has tried this in 1998 when designing “Timna” (integrating CPU and Gfx accelerator that share L2) Slide Source: Yuejian Xie

BIP: Bimodal Insertion Policy [Qureshi et al. ISCA’07] LIP may not age older lines Infrequently insert lines in MRU position Let e = Bimodal throttle parameter if ( rand() < e ) Insert at MRU position; // LRU replacement policyelse Insert at LRU position; Promote to MRU if reused

DIP BIP LRU 1-ε ε LIP LRU DIP: Dynamic Insertion Policy [Qureshi et al. ISCA’07] • Two types of workloads: LRU-friendly or BIP-friendly • DIP can be implemented by: • Monitor both policies (LRU and BIP) • Choose the best-performing policy • Apply the best policy to the cache Need a cost-effective implementation  “Set Dueling”

miss LRU-sets + BIP-sets – miss Follower Sets MSB = 0? No YES Use LRU Use BIP Set Dueling for DIP [Qureshi et al. ISCA’07] Divide the cache in three: • Dedicated LRU sets • Dedicated BIP sets • Follower sets (winner of LRU,BIP) n-bit saturating counter misses to LRU sets:counter++ misses to BIP sets : counter-- Counter decides policy for follower sets: • MSB = 0, Use LRU • MSB = 1, Use BIP n-bit cntr monitor  choose  apply (using a single counter) Slide Source: Moin Qureshi

Promotion/Insertion Pseudo Partitioning

PIPP [Xie & Loh ISCA’09] • What’s PIPP? • Promotion/Insertion Pseudo Partitioning • Achieving both capacity (UCP) and dead-time management (DIP). • Eviction • LRU block as the victim • Insertion • The core’s quota worth of blocks away from LRU • Promotion • To MRU by only one. Insert Position = 3 (Target Allocation) New Promote To Evict MRU LRU Hit Slide Source: Yuejian Xie 45

PIPP Example Core0’s Block Core1’s Block Core0 quota: 5 blocks Core1 quota: 3 blocks Request D Core1’s quota=3 1 A 2 3 4 B 5 C MRU LRU Slide Source: Yuejian Xie

PIPP Example Core0’s Block Core1’s Block Core0 quota: 5 blocks Core1 quota: 3 blocks Request 6 Core0’s quota=5 1 A 2 3 4 D B 5 MRU LRU Slide Source: Yuejian Xie

PIPP Example Core0’s Block Core1’s Block Core0 quota: 5 blocks Core1 quota: 3 blocks Request 7 Core0’s quota=5 1 A 2 6 3 4 D B MRU LRU Slide Source: Yuejian Xie

PIPP Example Core0’s Block Core1’s Block Core0 quota: 5 blocks Core1 quota: 3 blocks Request D 1 A 2 7 6 3 4 D MRU LRU Slide Source: Yuejian Xie

How PIPP Does Both Management MRU LRU Insert closer to LRU position Slide Source: Yuejian Xie 50

ECE8833 Polymorphous and Many-Core Computer Architecture