430 likes | 728 Views
Locality-Aware Data Replication in the Last-Level Cache. George Kurian 1 , Srinivas Devadas 1 , Omer Khan 2 , 1 Massachusetts Institute of Technology 2 University of Connecticut, Storrs. The Problem. Future multicore processors will have 100s of cores
E N D
Locality-Aware Data Replication in the Last-Level Cache George Kurian1, SrinivasDevadas1, Omer Khan2, 1 Massachusetts Institute of Technology 2 University of Connecticut, Storrs
The Problem • Future multicore processors will have 100s of cores • LLC management key to optimizing performance and energy • Last-level cache (LLC) data locality and off-chip miss rates often show opposing trends # Network Hops = ⅔ * √N • Goal: Intelligent replication at the LLC
LLC Replication Strategy 3 2 1 L1 D L1 I L1 D L1 I L1 D L1 I • Black block shows benefit with replication • E.g., Frequently-read shared data • Core-1 and Core-2 allowed to create replicas • Red block shows NO benefit with replication • E.g., Frequently-written shared data LLC Slice LLC Slice LLC Slice Core Replica Replica L1 D L1 I L1 D L1 I L1 D L1 I Private L1 Caches LLC Slice LLC Slice LLC Slice Directory Compute Pipeline 4 L1 D L1 I L1 D L1 I L1 D L1 I Router L2 Cache (LLC Slice) LLC Slice LLC Slice LLC Slice Home Home
Outline • Motivation • Comparison to Previous Schemes • Design & Implementation • Evaluation • Conclusion
MotivationReuse at the LLC • Reuse: # Accesses to a cache line by a core before eviction or a conflicting access by another core • Note: Private L1 cache hits are filtered out Core 3 5 Accesses 3 L1 D L1 I L1 D L1 I L1 D L1 I LLC Slice LLC Slice LLC Slice Core Compute Pipeline Core 4 Write L2 Cache (LLC Slice) 4 L1 D L1 I L1 D L1 I L1 D L1 I Private L1 Caches LLC Slice LLC Slice LLC Slice Directory Reuse = 5 L1 D L1 I L1 D L1 I L1 D L1 I Router Home LLC Slice LLC Slice LLC Slice
MotivationReuse Determines Replication Benefit • Reuse: # Accesses to a cache line by a core before eviction or a conflicting access by another core • Higher the reuse, higher the efficacy of replication Fig: LLC Access Count vs Reuse
Motivation (cont’d)Reuse Determines Replication Benefit • Reuse: # Accesses to a cache line by a core before eviction or a conflicting access by another core • Higher the reuse, higher the efficacy of replication Replicate Don’t Replicate Fig: LLC Access Count vs Reuse
Motivation (cont’d)Reuse Independent of Cache Line Type • Private data exhibits varying degrees of reuse 1-2 3-9 ≥10 Private Fig: LLC Access Count vs Reuse
Motivation (cont’d)Reuse Independent of Cache Line Type • Instructions mostly exhibit high reuse 1-2 3-9 ≥10 Private Instruction Fig: LLC Access Count vs Reuse
Motivation (cont’d)Reuse Independent of Cache Line Type • Shared read-only data exhibits varying degrees of reuse 1-2 3-9 ≥10 1-2 3-9 ≥10 Private Shared Read-Only Instruction Fig: LLC Access Count vs Reuse
Motivation (cont’d)Reuse Independent of Cache Line Type 1-2 3-9 ≥10 1-2 3-9 ≥10 Private Shared Read-Only • Shared read-write data exhibits varying degrees of reuse Instruction Shared Read-Write Fig: LLC Access Count vs Reuse
Motivation (cont’d)Reuse Independent of Cache Line Type 1-2 3-9 ≥10 1-2 3-9 ≥10 Private Shared Read-Only • Replication must be based onreuseand notcache line classification Instruction Shared Read-Write • Replicate based on Reuse • Instructions • Shared read-only data • Shared read-write data • (even)Private data Fig: LLC Access Count vs Reuse
Locality-Aware ReplicationSalient Features • Locality-based: Based on reuse and not memory classification information • Replicate data with high reuse • Bypass replication mechanisms for low reuse data • Cache-line Level: Reuse measured and replication decision made at cache-line level • Dynamic: Reuse profiled at runtime using highly-accurate hardware counters • Minimal Coherence Protocol Changes: Replication is done at the local LLC slice • Fully Hardware: LLC replication techniques require no modification to operating system
Outline • Motivation • Comparison to Previous Schemes • Design & Implementation • Evaluation • Conclusion
Baseline System Core M Compute Pipeline L1 D-Cache L1 I-Cache M L2 Cache (LLC) Directory M Router • Compute pipeline with private L1-I and L1-D caches • Logically shared physically distributed L2 cache (LLC) with integrated directory • LLC managed using Reactive-NUCA [Hardavellas – ISCA09] • Local placement of private pages, shared pages are striped • ACKwise limited-directory protocol [Kurian– PACT10]
Locality Tracking IntelligenceReplica Reuse Counter … Tag State LRU • Replica Reuse: Tracks cache line usage by a core at the LLC replica • Replica reuse counter is communicated back to directory on eviction or invalidation for classification • NO additional network messages • Storage overhead: 1KB - 0.4% ACKWise Pointers (1 … p) Replica Reuse Mode1 Moden … Home Reuse1 Home Reusen Complete Locality List (1 .. n)
Locality Tracking IntelligenceMode & Home Reuse Counters … Tag State LRU • Modei: Can cache line be replicated at Corei? • Home Reusei: Tracks cache line usage by Coreiat home LLC slice • Complete Locality Classifier: Tracks locality information for all cores and for all LLC cache lines • Storage Overhead: 96KB - 30% • We’ll fix this later ACKWise Pointers (1 … p) Replica Reuse Mode1 Moden … Home Reuse1 Home Reusen Complete Locality List (1 .. n)
Mode TransitionsReplication Intelligence • Replication decision made based on previous cache line reuse behavior • Initially, no replica is created • All requests are serviced at the LLC home Initial No Replica
Mode Transitions • Replication decision made based on previous cache line reuse behavior • Home-Reuse counter: Tracks the # accesses by a core at the LLC home location Initial No Replica
Mode Transitions Home Reuse >= RT Initial • A replica is created if enough reuse is detected at the LLC home • If (Home-Reuse >= Replication-Threshold) Promote to “Replica” mode Create Replica • Replication-Threshold : #Replicas • Replication-Threshold :# Replicas RT: Replication Threshold No Replica Replica
Mode Transitions • Replica-Reuse counter: Tracks the # accesses to the LLC at the replica location Home Reuse >= RT Initial RT: Replication Threshold No Replica Replica
Mode Transitions Replica Reuse >= RT Home Reuse >= RT Initial • Eviction from LLC Replica Location • Triggered by capacity limitations • If (Replica-Reuse >= Replication-Threshold) Stay in “Replica” modeElse Demote to “No-Replica” mode No Replica Replica RT: Replication Threshold Replica Reuse < RT
Mode Transitions (Replica + Home) Reuse >= RT Home Reuse >= RT Initial • Invalidation at LLC Replica Location • Triggered by a conflicting write • If ( [Replica+Home] Reuse >= Replication-Threshold) Stay in “Replica” modeElse Demote to “No-Replica” mode No Replica Replica RT: Replication Threshold (Replica + Home) Reuse < RT
Mode Transitions XReuse >= RT • Conflicting-Write from another core:Reset Home-Reuse counter to ‘0’ Home Reuse >= RT Initial Replica No Replica RT: Replication Threshold XReuse < RT Home Reuse < RT
Mode Transitions Summary • Replication decision made based on previous cache line reuse behavior XReuse >= RT Home Reuse >= RT Initial No Replica Replica RT: Replication Threshold Home Reuse < RT XReuse < RT
Locality Tracking IntelligenceLimitedk Locality Classifier … Core ID1 Core IDk Replica Reuse ACKWise Pointers (1 … p) Tag State LRU • Complete Locality Classifier:Prohibitive storage overhead (30%) • Limited Locality Classifier (k): Mode and Home Reuse information tracked for only k cores • Modes of other cores obtained by majority voting • Smaller k -> Lower overhead • Inactive cores replaced in locality list based on access pattern to accommodate new sharers … Mode1 Modek … Home Reuse1 Home Reusek Limited Locality List (1 .. k)
Limited3Locality Classifier • Mode and Home Reuse tracked for 3 sharers • Limited-3 classifier approximates performance & energy of Complete classifier
Outline • Motivation • Comparison to Previous Schemes • Design & Implementation • Evaluation • Conclusion
Evaluation Methodology • Evaluations done using • Graphite simulator for 64 cores • McPAT/CACTI cache energy models and DSENT network energy models at 11 nm • Evaluated 21 benchmarks from the SPLASH-2 (11), PARSEC (8), Parallel MI-bench (1) and UHPC (1) suites • LLC managements schemes compared: • Static-NUCA (S-NUCA) • Reactive-NUCA (R-NUCA) • Victim Replication (VR) • Adaptive Selective Replication (ASR) [modified] • Locality-Aware Replication (RT-1, RT-3, RT-8)
Replicate Shared Read-Write DataLLC Accesses: BARNES 1-2 3-9 ≥10 1-2 3-9 ≥10 Private Shared Read-Only • Most LLC accesses are reads to widely-shared high-reuseshared read-write data • Important to replicate shared read-write data Instruction Shared Read-Write
Replicate Shared Read-Write DataEnergy Results: BARNES • Locality-aware protocol reduces network router & link energy by replicating shared read-write data locally • Victim replication (VR) obtains limited energy benefits • (Almost) blind replica creation scheme • Simplistic LLC replacement policy • Removing and re-inserting replicas on L1 misses & evictions • Adaptive selective replication (ASR) and Reactive-NUCA do not replicate shared read-write data
Replicate Shared Read-Write DataCompletion Time Results: BARNES • Locality-aware protocol reduces communication time with the LLC home(L1-To-LLC-Home)
Replicate Private Cache LinesPage vs Cache Line Classification: BLACKSCHOLES • Page-level classification incurs false positives • Multiple cores work privately on cache lines in the same page • Page classified shared read-only instead of private • Page-level data placement not optimal • Reactive-NUCA cannot localize most LLC accesses • Replicate private data to localize all LLC accesses
Replicate Private Cache LinesEnergy Results: BLACKSCHOLES • Locality-aware protocol reduces network energy through replication of private cache lines • ASR replicates just shared read-only cache lines • VR obtains limited improvements in energy • Still restricted by replication mechanisms
Replicate All Classes of Cache LinesLLC Accesses: BODYTRACK 1-2 3-9 ≥10 1-2 3-9 ≥10 Private Shared Read-Only • Most LLC accesses are reads to widely-shared high-reuseinstructions, shared read-only and shared read-write data • Best replication policy should optimize handling of all 3 classes of cache lines Instruction Shared Read-Write
Replicate All Classes of Cache LinesEnergy Results: BODYTRACK • R-NUCA replicates instructions, hence obtains small network energy improvements • ASR replicates instructions and shared read-only data and obtains larger energy improvements • The locality-aware protocol replicates shared read-write data as well
Use Optimal Replication ThresholdEnergy Results: STREAMCLUSTER • Perform intelligent replication • RT-1 performs badly due to LLC pollution • RT-8 identifies less replicas, slow to identify useful ones • RT-3 identifies more replicas and faster while not creating LLC pollution • Use optimal replication threshold of 3
Results Summary • We choose a static Replication threshold (RT) of 3 • Energyimproved by 13-21% • Completion Timeimproved by 4-13% Energy Completion Time
Conclusion • Locality-aware instruction and data replication in the last-level cache (LLC) • Spatio-temporal locality profiled dynamically at the cache line level using low-overhead yet highly accurate hardware counters • Enables replication only for lines with high reuse • Requires minimal changes to the baseline cache coherence protocol since replicas are placed locally • Significant energy and performance improvements against state-of-the-art replication mechanisms
See The Paper For … • Exhaustive benchmark case studies • Apps with migratory shared data • Apps with NO benefit from replication • Limited locality classifier study • Sensitivity to number of tracked cores (k) • Cluster-level locality-aware LLC replication study • Sensitivity to cluster size (C)
Locality-Aware Data Replication in the Last-Level Cache George Kurian1, SrinivasDevadas1, Omer Khan2, 1 Massachusetts Institute of Technology 2 University of Connecticut, Storrs