Cache Coherence Techniques for Multicore Processors

Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007

Key Contributions Trend: Multicore ring interconnects emerging Challenge: Order of ring != order of bus Contribution:New protocol exploits ring order Trend: Multicore now the basic building block Challenge: Hierarchical coherence for Multiple-CMP is complex Contribution:DirectoryCMP and TokenCMP Trend: Workload consolidation w/ space sharing Challenge: Physical hierarchies often do not match workloads Contribution:Virtual Hierarchies

Outline Introduction and Motivation Multicore Trends Virtual Hierarchies Focus of presentation Multiple-CMP Coherence Ring-based Coherence Conclusion

P0 $ P1 $ bus $ P2 $ P3 memory controller Is SMP + On-chip Integration == Multicore? Multicore

Multicore Trends Multicore P1 $ P0 $ bus $ P2 $ P3 memory controller Trend: On-chip Interconnect • Competes for same resources as cores, caches • Ring an emerging multicore interconnect

Multicore Trends Multicore P1 $ P0 $ Shared $ bus Shared $ $ P2 $ P3 memory controller Trend: latency/bandwidth tradeoffs • Increasing on-chip wire delay, memory latency • Coherence protocol interacts with shared-cache hierarchy

Multicore Multicore Multicore Multicore P1 $ P1 $ P1 $ P1 $ P0 $ P0 $ P0 $ P0 $ bus bus bus bus $ P3 $ P3 $ P3 $ P3 $ P2 $ P2 $ P2 $ P2 memory controller memory controller memory controller memory controller Multicore Trends Trend: Multicore is the basic building block • Multiple-CMP systems instead of SMPs • Hierarchical systems required

Multicore Trends Multicore VM 1 P1 $ P0 $ bus $ P2 $ P3 memory controller VM 2 VM 3 Trend: Workload Consolidation w/ Space Sharing • More cores, more workload consolidation • Space sharing instead of time sharing • Opportunities to optimize caching, coherence

Outline Introduction and Motivation Virtual Hierarchies • Focus of presentation Multiple-CMP Coherence Ring-based Coherence Conclusion [ISCA 2007, IEEE Micro Top Pick 2008]

Virtual Hierarchy Motivations Space-sharing Server (workload) consolidation Tiled architectures APP 1 APP 1 APP 2 APP 3 APP 4

64-core CMP L2 Cache Core L1 Motivation: Server Consolidation www server database server #2 database server #1 middleware server #1 middleware server #1

64-core CMP Motivation: Server Consolidation www server database server #2 database server #1 middleware server #1 middleware server #1

64-core CMP Motivation: Server Consolidation www server database server #2 data database server #1 middleware server #1 data middleware server #1 Optimize Performance

64-core CMP Motivation: Server Consolidation www server database server #2 database server #1 middleware server #1 middleware server #1 Isolate Performance

Motivation: Server Consolidation www server 64-core CMP database server #2 database server #1 middleware server #1 middleware server #1 Dynamic Partitioning

64-core CMP Motivation: Server Consolidation www server database server #2 data database server #1 VMWare’s Content-based Page Sharing  Up to 60% reduced memory middleware server #1 middleware server #1 Inter-VM Sharing

Outline Introduction and Motivation Virtual Hierarchies Expanded Motivation Non-hierarchical approaches Proposed Virtual Hierarchies Evaluation Related Work Ring-based and Multiple-CMP Coherence Conclusion

L2 Cache Core L1 Tiled Architecture Memory System Memory Controller global broadcast too expensive

duplicate tag directory TAG-DIRECTORY A A Read A fwd 3 data 1 2 getM A duplicate tag directory

STATIC-BANK-DIRECTORY A Read A data 3 1 fwd 2 getM A A

STATIC-BANK-DIRECTORY with hypervisor-managed cache 2 A fwd getM A A Read A 1 data 3

Optimize Performance Isolate Performance Allow Dynamic Partitioning Support Inter-VM Sharing Hypervisor/OS Simplicity Goals STATIC-BANK-DIRECTORY w/ hypervisor-managed cache {STATIC-BANK, TAG}-DIRECTORY • No • No • Yes • Yes • Yes • Yes • Yes • ? • Yes • No

Virtual Hierarchies Key Idea: Overlay 2-level Cache & Coherence Hierarchy - First level harmonizes with VM/Workload - Second level allows inter-VM sharing, migration, reconfig

VH: First-Level Protocol Goals: • Exploit locality from space affinity • Isolate resources Strategy: Directory protocol • Interleave directories across first-level tiles • Store L2 block at first-level directory tile Questions: • How to name directories? • How to name sharers? getM INV

L2 Cache Core L1 p12 p13 p14 0 p12 1 p13 2 p14 Address 3 p12 ……000101 offset 4 p13 Home Tile: p14 5 p14 6 63 p12 VH: Naming First-level Directory Select Dynamic Home Tile with VM Config Table • Hardware VM Config Table at each tile • Set by hypervisor during scheduling Example: per-Tile VM Config Table Dynamic

VH: Dynamic Home Tile Actions Dynamic Home Tile either: • Returns data cached at L2 bank • Generates forwards/invalidates • Issues second-level request Stable First-level States (a subset): • Typical: M, E, S, I • Atypical: ILX: L2 Invalid, points to exclusive tile SLS: L2 Shared, other tiles share SLSX: L2 Shared, other tiles share, exclusive to first level getM

VH: Naming First-level Sharers Any tile can share the block Solution: full bit-vector • 64-bits for 64-tile system • Names multiple sharers or single exclusive Alternatives: • First-level broadcast • (Dynamic) coarse granularity getM INV

Virtual Hierarchies Two Solutions for Global Coherence: VHA and VHB memory controller(s)

Protocol VHA Directory as Second-level Protocol • Any tile can act as first-level directory • How to track and name first-level directories? Full bit-vector of sharers to name any tile • State stored in DRAM • Possibly cache on-chip + Maximum scalability, message efficiency • DRAM State ( ~ 12.5% overhead )

VHA Example 2 A getM A 6 data getM A 1 data 3 5 directory/memory controller Fwd A 4 Fwd data A

blocked blocked A A VHA: Handling Races Blocking Directories • Handles races within same protocol • Requires blocking buffer + wakeup/replay logic Inter-Intra Races • Naïve blocking leads to deadlock! blocked A getM A getM A blocked getM A getM A FWD A getM A getM A

blocked getM A getM A FWD A getM A getM A blocked blocked A A VHA: Handling Races(cont) Possible Solution: • Always handle second-level message at first-level • But this causes explosion of state space Second-level may interrupt first-level actions: • First-level indirections, invalidations, writebacks

VHA: Handling Races(cont) Reduce the state-space explosion w/ Safe States: • Subset of transient states • Immediately handle second-level message • Limit concurrency between protocols Algorithm: • Level-one requests either complete, or enter safe-state before issuing level-two request • Level-one directories handle level-two forwards when a safe state reached (they may stall) • Level-two requests eventually handled by Level-two directory • Completion messages unblock directories

Virtual Hierarchies Two Solutions for Global Coherence: VHA and VHB memory controller(s)

Protocol VHB Broadcast as Second-level Protocol • Locate first-level directory tiles • Memory controller tracks outstanding second-level requestor Attach token count for each block • T tokens for each block. One token to read, all to write • Allows 1-bit at memory per block • Eliminates system-wide ACK responses

Protocol VHB: Token Coalescing Memory logically holds all or none tokens: • Enables 1-bit token count Replacing tile sends tokens to memory controller: • Message usually contains all tokens Process: • Tokens held in Token Holding Buffer (THB) • FIND broadcast initiated to locate other first-level directory with tokens • First-level directories respond to THB, tokens sent • Repeat for race

VHB Example 2 A getM A getM A 1 memory controller global getM A 3 Data+tokens A 5 4 Fwd A

Goals Optimize Performance Isolate Performance Allow Dynamic Partitioning Support Inter-VM Sharing Hypervisor/OS Simplicity Virtual Hierarchies: VHA and VHB STATIC-BANK-DIRECTORY w/ hypervisor-managed cache {DRAM, STATIC-BANK, TAG}-DIRECTORY • No • No • Yes • Yes • Yes • Yes • Yes • ? • Yes • No • Yes • Yes • Yes • Yes • Yes

VHNULL Are two levels really necessary? VHNULL: first level only Implications: • Many OS modifications for single-OS environment • Dynamic Partitioning requires cache flushes • Inter-VM Sharing difficult • Hypervisor complexity increases • Requires atomic updates of VM Config Tables • Limits optimized placement policies

VH: Capacity/Latency Trade-off Maximize Capacity • Store only L2 copy at dynamic home tile • But, L2 access time penalized • Especially for large VMs Minimize L2 access latency/bandwidth: • Replicate data in local L2 slice • Selective/Adaptive Replication well-studied ASR [Beckmann et al.], CC [Chang et al.] • But, dynamic home tile still needed for first-level Can we exploit virtual hierarchy for placement?

VH: Data Placement Optimization Policy Data from memory placed in tile’s local L2 bank • Tag not allocated at dynamic home tile Use second-level coherence on first sharing miss • Then allocate tag at dynamic home tile for future sharing misses Benefits: • Private data allocates in tile’s local L2 bank • Overhead of replicating data reduced • Fast, first-level sharing for widely shared data

VH Evaluation Methods Wisconsin GEMS Target System: 64-core tiled CMP • In-order SPARC cores • 1 MB, 16-way L2 cache per tile, 10-cycle access • 2D mesh interconnect, 16-byte links, 5-cycle link latency • Eight on-chip memory controllers, 275-cycle DRAM latency

VH Evaluation: Simulating Consolidation Challenge: bring-up of consolidated workloads Solution: approximate virtualization Combine existing Simics checkpoints 64p checkpoint P0-P63 VM0_Memory0 VM0_PCI0, VM0_DISK0 VM1_Memory0 VM1_PCI0, VM1_DISK0 8p checkpoint Memory0 P0-P7 PCI0, DISK0 script

VH Evaluation: Simulating Consolidation At simulation-time, Ruby handles mapping: Converts <Processor ID, 32-bit Address> to <36-bit address> Schedules VMs to adjacent cores by sending Simics requests to appropriate L1 controllers Memory controllers evenly interleaved Bottom-line: Static scheduling No hypervisor execution simulated No content-based page sharing

VH Evaluation: Workloads OLTP, SpecJBB, Apache, Zeus • Separate instance of Solaris for each VM Homogenous Consolidation • Simulate same-size workload N times • Unit of work identical across all workloads • (each workload staggered by 1,000,000+ ins) Heterogeneous Consolidation • Simulate different-size, different workloads • Cycles-per-Transaction for each workload

VH Evaluation: Baseline Protocols DRAM-DIRECTORY: • 1 MB directory cache per controller • Each tile nominally private, but replication limited TAG-DIRECTORY: • 3-cycle central tag directory (1024 ways). Non-pipelined • Replication limited STATIC-BANK-DIRECTORY • Home tiles interleave by frame address • Home tile stores only L2 copy

VH Evaluation: VHA and VHB Protocols VHA • Based on DirectoryCMP implementation • Dynamic Home Tile stores only L2 copy VHB with optimizations • Private data placement optimization policy (shared data stored at home tile, private data is not) • Can violate inclusiveness (evict L2 tag w/ sharers) • Memory data returned directly to requestor

Micro-benchmark: Sharing Latency

Cache Coherence Techniques for Multicore Processors

Cache Coherence Techniques for Multicore Processors

Presentation Transcript

Programming Multicore Processors

Cache Utilization-Aware Scheduling for Multicore Processors

III. Multicore Processors (6)

Multicore / Manycore Processors

Cache Coherence Schemes for Multiprocessors

Multicore: Commercial Processors

Cache Coherence for GPU Architectures

Cache coherence

Cache Coherence

III. Multicore Processors (2)

Cache coherence for CMPs

The Cache-Coherence Problem

11. Multicore Processors

III. Multicore Processors (3)

III. Multicore Processors (5)

Cache Coherence

Cache Coherence Protocols

III. Multicore Processors (4)

Verification of Hierarchical Cache Coherence Protocols for Future Processors

Cache Coherence

III. Multicore Processors (4)

MultiCore Processors