1.46k likes | 1.48k Views
Explore strategies to optimize wire delay in CMP caches for future technology, maximizing cache capacity and minimizing access latency. This dissertation examines CMP workload characterization, adaptive selective replication, transmission line caches, and the benefits of combining techniques.
E N D
Managing Wire Delay in CMP Caches Brad Beckmann Dissertation Defense Multifacet Project http://www.cs.wisc.edu/multifacet/ University of Wisconsin-Madison 8/15/06
L2 Bank L2 Bank Current CMP: AMD Athlon 64 X2 CPU 0 CPU 1 2 CPUs 2 L2 Cache Banks
CPU 0 CPU 1 L1 D$ L1 I$ L1 I$ L1 D$ L2 Bank L2 Bank L2 Bank L2 Bank L2 Bank L2 Bank L2 Bank L2 Bank CPU 5 CPU 6 CPU 3 CPU 4 CPU 7 CPU 2 L1 D$ L1 I$ L1 D$ L1 I$ L1 D$ L1 I$ L1 I$ L1 D$ L1 D$ L1 I$ L1 I$ L1 D$ CMP Cache Trends future technology (< 45 nm) today technology (~90 nm)
Maximize Cache Capacity 40+ Cycles A Slow Access Latency Baseline: CMP-Shared L1 I $ L1 I $ L2 Bank L2 Bank CPU 3 CPU 4 L1 D $ L1 D $ L1 I $ L1 I $ L2 Bank L2 Bank CPU 2 CPU 5 L1 D $ L1 D $ L1 I $ L1 I $ L2 Bank L2 Bank CPU 1 CPU 6 L1 D $ L1 D $ L1 I $ L1 I $ L2 Bank L2 Bank CPU 0 CPU 7 L1 D $ L1 D $
Fast Access Latency A Lower Effective Capacity A A Baseline: CMP-Private L1 I $ L1 I $ CPU 3 CPU 4 Private Private L1 D $ L1 D $ L2 L2 L1 I $ L1 I $ Private CPU 2 CPU 5 Private L1 D $ L1 D $ L2 L2 L1 I $ L1 I $ Private Private CPU 1 CPU 6 L1 D $ L1 D $ L2 L2 Thesis: both Fast Access & High Capacity L1 I $ L1 I $ Private Private CPU 0 CPU 7 L2 L1 D $ L1 D $ L2
#1 #2 #3 #4 #5 Thesis Contributions • Characterizing CMP workloads—sharing types • Single requestor • Shared read-only • Shared read-write • Techniques to manage wire delay • Migration← Previously discussed • Selective Replication← Talk’s focus • Transmission Lines← Previously discussed • Combination outperforms isolated techniques
Outline • Introduction • Characterization of CMP working sets • L2 requests • L2 cache capacity • Sharing behavior • L2 request locality • ASR: Adaptive Selective Replication • Cache block migration • TLC: Transmission Line Caches • Combination of techniques
Characterizing CMP Working Sets • 8 processor CMP • 16 MB shared L2 cache • 64-byte block size • 64 KB L1 I&D caches • Profile L2 blocks during their on-chip lifetime • Three L2 block sharing types • Single requestor • All requests by a single processor • Shared read only • Read only requests by multiple processors • Shared read-write • Read and write requests by multiple processors • Workloads • Commercial: apache, jbb, otlp, zeus • Scientific: (SpecOMP) apsi & art(Splash) barnes & ocean
Percent of L2 Cache Requests Majority of commercial workload requests for shared blocks Request Types
Percent of L2 Cache Capacity Majority of Capacity for Single Requestor Blocks
Costs of Replication • Decrease effective cache capacity • Storing replicas instead of unique blocks • Analyze average number of sharers • During on-chip lifetime • Increase store latency • Invalidate remote read-only copies • Run length [Eggers & Katz ISCA 88] • Average intervening remote reads between writes from the same processor + intervening reads between writes from different processors • For L2 requests
Few intervening requests: Commercial Workloads Widely Shared: All Workloads Sharing Behavior requests breakdown
High Locality Inter. Locality No Locality Low Locality Locality Graphs
Request to Block Distribution: Single Requestor Blocks Lower Locality
Request to Block Distribution: Shared Read Only Blocks High Locality L2 Cache MRU Hit Ratio
Request to Block Distribution: Shared Read-Write Blocks Intermediate Locality
Workload Characterization: Summary • Commercial workloads • significantshared read-only activity • Most of requests 42-71% • Little capacity without replication 9-21% • Highly shared 3.0-4.5 avg. processors • High request locality 3% of blocks account for 70% of requests • Shared read-only data great candidate for selective replication
Outline • Introduction • Characterization of CMP working sets • ASR: Adaptive Selective Replication • Replication effect on memory performance • SPR: Selective Probabilistic Replication • Monitoring and adapting to workload behavior • Evaluation • Cache block migration • TLC: Transmission Line Caches • Combination of techniques
Replication and Memory Cycles Memory cycles + (Pmiss x Lmiss) Instruction Instructions Average cycles for L1 cache misses (PlocalL2 x LlocalL2) + (PremoteL2 x LremoteL2) =
Replication Benefit: L2 Hit Cycles L2 Hit Cycles Replication Capacity
Replication and Memory Cycles Memory cycles (PlocalL2 x LlocalL2) + (PremoteL2 x LremoteL2) + Instruction Instructions Average cycles for L1 cache misses (Pmiss x Lmiss) =
Replication Cost:L2 Miss Cycles L2 Miss Cycles Replication Capacity
Optimal Replication Effectiveness:Total Cycles Total Cycle Curve Total Cycles Replication Capacity
Outline • Wires and CMP caches • Characterization of CMP working sets • ASR: Adaptive Selective Replication • Replication effect on memory performance • SPR: Selective Probabilistic Replication • Monitoring and adapting to workload behavior • Evaluation • Cache block migration • TLC: Transmission Line Caches • Combination of techniques
Identifying and Replicating Shared Read-only • Minimal coherence impact • Per cache block identification • Heuristic - not perfect • Dirty bit • Indicates written data • Leverage current bandwidth reduction optimization • Shared bit • Indicates multiple sharers • Set for blocks with multiple requestors
SPR: Selective Probabilistic Replication • Mechanism for Selective Replication • Control duplication between L2 caches in CMP-Private • Relax L2 inclusion property • L2 evictions do not force L1 evictions • Non-exclusive cache hierarchy • Ring Writebacks • L1 Writebacks passed clockwise between private L2 caches • Merge with other existing L2 copies • Probabilistically choose between • Local writeback allow replication • Ring writeback disallow replication
L1 I $ L1 D $ L1 I $ L1 D $ L1 I $ L1 D $ L1 I $ L1 D $ SPR: Selective Probabilistic Replication L1 I $ Private L2 Private L2 CPU 3 CPU 4 L1 D $ L1 I $ Private L2 Private L2 CPU 2 CPU 5 L1 D $ L1 I $ Private L2 Private L2 CPU 1 CPU 6 L1 D $ L1 I $ Private L2 Private L2 CPU 0 CPU 7 L1 D $
SPR: Selective Probabilistic Replication Current Level Replication Capacity 3 5 1 4 0 2 Replication Levels real workloads
Outline • Introduction • Characterization of CMP working sets • ASR: Adaptive Selective Replication • Replication effect on memory performance • SPR: Selective Probabilistic Replication • Implementing ASR • Evaluation • Cache block migration • TLC: Transmission Line Caches • Combination of techniques
Implementing ASR • Four mechanisms estimate deltas • Decrease-in-replication Benefit • Increase-in-replication Benefit • Decrease-in-replication Cost • Increase-in-replication Cost • Triggering a cost-benefit analysis
lower level current level ASR: Decrease-in-replication Benefit L2 Hit Cycles Replication Capacity
ASR: Decrease-in-replication Benefit • Goal • Determine replication benefit decrease of the next lower level • Mechanism • Current Replica Bit • Per L2 cache block • Set for replications of the current level • Not set for replications of lower level • Current replica hits would be remote hits with next lower level • Overhead • 1-bit x 256 K L2 blocks = 32 KB
higher level current level ASR: Increase-in-replication Benefit L2 Hit Cycles Replication Capacity
ASR: Increase-in-replication Benefit • Goal • Determine replication benefit increase of the next higher level • Mechanism • Next Level Hit Buffers (NLHBs) • 8-bit partial tag buffer • Store replicas of the next higher • NLHB hits would be local L2 hits with next higher level • Overhead • 8-bits x 16 K entries x 8 processors = 128 KB
lower level current level ASR: Decrease-in-replicationCost L2 Miss Cycles Replication Capacity
ASR: Decrease-in-replication Cost • Goal • Determine replication cost decrease of the next lower level • Mechanism • Victim Tag Buffers (VTBs) • 16-bit partial tags • Store recently evicted blocks of current replication level • VTB hits would be on-chip hits with next lower level • Overhead • 16-bits x 1 K entry x 8 processors = 16 KB
higher level current level ASR: Increase-in-replicationCost L2 Miss Cycles Replication Capacity
ASR: Increase-in-replication Cost • Goal • Determine replication cost increase of the next higher level • Mechanism • Way and Set counters [Suh et al. HPCA 2002] • Identify soon-to-be-evicted blocks • 16-way pseudo LRU • 256 set groups • On-chip hits that would be off-chip with next higher level • Overhead • 255-bit pseudo LRU tree x 8 processors = 255 B • Overall storage overhead: 212 KB or 1.2% of total storage
ASR: Triggering a Cost-Benefit Analysis • Goal • Dynamically adapt to workload behavior • Avoid unnecessary replication level changes • Mechanism • Evaluation trigger • Local replications or NLHB allocations exceed 1K • Replication change • Four consecutive evaluations in the same direction
Outline • Introduction • Characterization of CMP working sets • ASR: Adaptive Selective Replication • Replication effect on memory performance • SPR: Selective Probabilistic Replication • Implementing ASR • Evaluation • Cache block migration • TLC: Transmission Line Caches • Combination of techniques
Methodology • Full system simulation • Simics • Wisconsin’s GEMS Timing Simulator • Out-of-order processor • Memory system • Workloads • Commercial • apache, jbb, otlp, zeus • Scientific • Not shown here, in dissertation
System Parameters [ 8 core CMP, 45 nm technology ]
Replication Benefit, Cost, & Effectiveness Curves Benefit Cost
Replication Benefit, Cost, & Effectiveness Curves Effectiveness 4 MB, 150 Memory latency
ASR: Adapting to Workload Behavior Oltp: All CPUs
ASR: Adapting to Workload Behavior Apache: All CPUs
ASR: Adapting to Workload Behavior Apache: CPU 0
ASR: Adapting to Workload Behavior Apache: CPUs 1-7
Lack Dynamic Adaptation Comparison of Replication Policies • SPR multiple possible policies • Evaluated 4 shared read-only replication policies • VR:Victim Replication • Previously proposed [Zhang ISCA 05] • Disallow replicas to evict shared owner blocks • NR: CMP-NuRapid • Previously proposed [Chishti ISCA 05] • Replicate upon the second request • CC:Cooperative Caching • Previously proposed [Chang ISCA 06] • Replace replicas first • Spill singlets to remote caches • Tunable parameter 100%, 70%, 30%, 0% • ASR:Adaptive Selective Replication • My proposal • Monitor and adjust to workload demand