920 likes | 963 Views
Cache Coherence Techniques for Multicore Processors. Dissertation Defense Mike Marty 12/19/2007. Key Contributions. Trend: Multicore ring interconnects emerging Challenge: Order of ring != order of bus Contribution: New protocol exploits ring order
E N D
Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007
Key Contributions Trend: Multicore ring interconnects emerging Challenge: Order of ring != order of bus Contribution:New protocol exploits ring order Trend: Multicore now the basic building block Challenge: Hierarchical coherence for Multiple-CMP is complex Contribution:DirectoryCMP and TokenCMP Trend: Workload consolidation w/ space sharing Challenge: Physical hierarchies often do not match workloads Contribution:Virtual Hierarchies
Outline Introduction and Motivation Multicore Trends Virtual Hierarchies Focus of presentation Multiple-CMP Coherence Ring-based Coherence Conclusion
P0 $ P1 $ bus $ P2 $ P3 memory controller Is SMP + On-chip Integration == Multicore? Multicore
Multicore Trends Multicore P1 $ P0 $ bus $ P2 $ P3 memory controller Trend: On-chip Interconnect • Competes for same resources as cores, caches • Ring an emerging multicore interconnect
Multicore Trends Multicore P1 $ P0 $ Shared $ bus Shared $ $ P2 $ P3 memory controller Trend: latency/bandwidth tradeoffs • Increasing on-chip wire delay, memory latency • Coherence protocol interacts with shared-cache hierarchy
Multicore Multicore Multicore Multicore P1 $ P1 $ P1 $ P1 $ P0 $ P0 $ P0 $ P0 $ bus bus bus bus $ P3 $ P3 $ P3 $ P3 $ P2 $ P2 $ P2 $ P2 memory controller memory controller memory controller memory controller Multicore Trends Trend: Multicore is the basic building block • Multiple-CMP systems instead of SMPs • Hierarchical systems required
Multicore Trends Multicore VM 1 P1 $ P0 $ bus $ P2 $ P3 memory controller VM 2 VM 3 Trend: Workload Consolidation w/ Space Sharing • More cores, more workload consolidation • Space sharing instead of time sharing • Opportunities to optimize caching, coherence
Outline Introduction and Motivation Virtual Hierarchies • Focus of presentation Multiple-CMP Coherence Ring-based Coherence Conclusion [ISCA 2007, IEEE Micro Top Pick 2008]
Virtual Hierarchy Motivations Space-sharing Server (workload) consolidation Tiled architectures APP 1 APP 1 APP 2 APP 3 APP 4
64-core CMP L2 Cache Core L1 Motivation: Server Consolidation www server database server #2 database server #1 middleware server #1 middleware server #1
64-core CMP Motivation: Server Consolidation www server database server #2 database server #1 middleware server #1 middleware server #1
64-core CMP Motivation: Server Consolidation www server database server #2 data database server #1 middleware server #1 data middleware server #1 Optimize Performance
64-core CMP Motivation: Server Consolidation www server database server #2 database server #1 middleware server #1 middleware server #1 Isolate Performance
Motivation: Server Consolidation www server 64-core CMP database server #2 database server #1 middleware server #1 middleware server #1 Dynamic Partitioning
64-core CMP Motivation: Server Consolidation www server database server #2 data database server #1 VMWare’s Content-based Page Sharing Up to 60% reduced memory middleware server #1 middleware server #1 Inter-VM Sharing
Outline Introduction and Motivation Virtual Hierarchies Expanded Motivation Non-hierarchical approaches Proposed Virtual Hierarchies Evaluation Related Work Ring-based and Multiple-CMP Coherence Conclusion
L2 Cache Core L1 Tiled Architecture Memory System Memory Controller global broadcast too expensive
duplicate tag directory TAG-DIRECTORY A A Read A fwd 3 data 1 2 getM A duplicate tag directory
STATIC-BANK-DIRECTORY A Read A data 3 1 fwd 2 getM A A
STATIC-BANK-DIRECTORY with hypervisor-managed cache 2 A fwd getM A A Read A 1 data 3
Optimize Performance Isolate Performance Allow Dynamic Partitioning Support Inter-VM Sharing Hypervisor/OS Simplicity Goals STATIC-BANK-DIRECTORY w/ hypervisor-managed cache {STATIC-BANK, TAG}-DIRECTORY • No • No • Yes • Yes • Yes • Yes • Yes • ? • Yes • No
Outline Introduction and Motivation Virtual Hierarchies Expanded Motivation Non-hierarchical approaches Proposed Virtual Hierarchies Evaluation Related Work Ring-based and Multiple-CMP Coherence Conclusion
Virtual Hierarchies Key Idea: Overlay 2-level Cache & Coherence Hierarchy - First level harmonizes with VM/Workload - Second level allows inter-VM sharing, migration, reconfig
VH: First-Level Protocol Goals: • Exploit locality from space affinity • Isolate resources Strategy: Directory protocol • Interleave directories across first-level tiles • Store L2 block at first-level directory tile Questions: • How to name directories? • How to name sharers? getM INV
L2 Cache Core L1 p12 p13 p14 0 p12 1 p13 2 p14 Address 3 p12 ……000101 offset 4 p13 Home Tile: p14 5 p14 6 63 p12 VH: Naming First-level Directory Select Dynamic Home Tile with VM Config Table • Hardware VM Config Table at each tile • Set by hypervisor during scheduling Example: per-Tile VM Config Table Dynamic
VH: Dynamic Home Tile Actions Dynamic Home Tile either: • Returns data cached at L2 bank • Generates forwards/invalidates • Issues second-level request Stable First-level States (a subset): • Typical: M, E, S, I • Atypical: ILX: L2 Invalid, points to exclusive tile SLS: L2 Shared, other tiles share SLSX: L2 Shared, other tiles share, exclusive to first level getM
VH: Naming First-level Sharers Any tile can share the block Solution: full bit-vector • 64-bits for 64-tile system • Names multiple sharers or single exclusive Alternatives: • First-level broadcast • (Dynamic) coarse granularity getM INV
Virtual Hierarchies Two Solutions for Global Coherence: VHA and VHB memory controller(s)
Protocol VHA Directory as Second-level Protocol • Any tile can act as first-level directory • How to track and name first-level directories? Full bit-vector of sharers to name any tile • State stored in DRAM • Possibly cache on-chip + Maximum scalability, message efficiency • DRAM State ( ~ 12.5% overhead )
VHA Example 2 A getM A 6 data getM A 1 data 3 5 directory/memory controller Fwd A 4 Fwd data A
blocked blocked A A VHA: Handling Races Blocking Directories • Handles races within same protocol • Requires blocking buffer + wakeup/replay logic Inter-Intra Races • Naïve blocking leads to deadlock! blocked A getM A getM A blocked getM A getM A FWD A getM A getM A
blocked getM A getM A FWD A getM A getM A blocked blocked A A VHA: Handling Races(cont) Possible Solution: • Always handle second-level message at first-level • But this causes explosion of state space Second-level may interrupt first-level actions: • First-level indirections, invalidations, writebacks
VHA: Handling Races(cont) Reduce the state-space explosion w/ Safe States: • Subset of transient states • Immediately handle second-level message • Limit concurrency between protocols Algorithm: • Level-one requests either complete, or enter safe-state before issuing level-two request • Level-one directories handle level-two forwards when a safe state reached (they may stall) • Level-two requests eventually handled by Level-two directory • Completion messages unblock directories
Virtual Hierarchies Two Solutions for Global Coherence: VHA and VHB memory controller(s)
Protocol VHB Broadcast as Second-level Protocol • Locate first-level directory tiles • Memory controller tracks outstanding second-level requestor Attach token count for each block • T tokens for each block. One token to read, all to write • Allows 1-bit at memory per block • Eliminates system-wide ACK responses
Protocol VHB: Token Coalescing Memory logically holds all or none tokens: • Enables 1-bit token count Replacing tile sends tokens to memory controller: • Message usually contains all tokens Process: • Tokens held in Token Holding Buffer (THB) • FIND broadcast initiated to locate other first-level directory with tokens • First-level directories respond to THB, tokens sent • Repeat for race
VHB Example 2 A getM A getM A 1 memory controller global getM A 3 Data+tokens A 5 4 Fwd A
Goals Optimize Performance Isolate Performance Allow Dynamic Partitioning Support Inter-VM Sharing Hypervisor/OS Simplicity Virtual Hierarchies: VHA and VHB STATIC-BANK-DIRECTORY w/ hypervisor-managed cache {DRAM, STATIC-BANK, TAG}-DIRECTORY • No • No • Yes • Yes • Yes • Yes • Yes • ? • Yes • No • Yes • Yes • Yes • Yes • Yes
VHNULL Are two levels really necessary? VHNULL: first level only Implications: • Many OS modifications for single-OS environment • Dynamic Partitioning requires cache flushes • Inter-VM Sharing difficult • Hypervisor complexity increases • Requires atomic updates of VM Config Tables • Limits optimized placement policies
VH: Capacity/Latency Trade-off Maximize Capacity • Store only L2 copy at dynamic home tile • But, L2 access time penalized • Especially for large VMs Minimize L2 access latency/bandwidth: • Replicate data in local L2 slice • Selective/Adaptive Replication well-studied ASR [Beckmann et al.], CC [Chang et al.] • But, dynamic home tile still needed for first-level Can we exploit virtual hierarchy for placement?
VH: Data Placement Optimization Policy Data from memory placed in tile’s local L2 bank • Tag not allocated at dynamic home tile Use second-level coherence on first sharing miss • Then allocate tag at dynamic home tile for future sharing misses Benefits: • Private data allocates in tile’s local L2 bank • Overhead of replicating data reduced • Fast, first-level sharing for widely shared data
Outline Introduction and Motivation Virtual Hierarchies Expanded Motivation Non-hierarchical approaches Proposed Virtual Hierarchies Evaluation Related Work Ring-based and Multiple-CMP Coherence Conclusion
VH Evaluation Methods Wisconsin GEMS Target System: 64-core tiled CMP • In-order SPARC cores • 1 MB, 16-way L2 cache per tile, 10-cycle access • 2D mesh interconnect, 16-byte links, 5-cycle link latency • Eight on-chip memory controllers, 275-cycle DRAM latency
VH Evaluation: Simulating Consolidation Challenge: bring-up of consolidated workloads Solution: approximate virtualization Combine existing Simics checkpoints 64p checkpoint P0-P63 VM0_Memory0 VM0_PCI0, VM0_DISK0 VM1_Memory0 VM1_PCI0, VM1_DISK0 8p checkpoint Memory0 P0-P7 PCI0, DISK0 script
VH Evaluation: Simulating Consolidation At simulation-time, Ruby handles mapping: Converts <Processor ID, 32-bit Address> to <36-bit address> Schedules VMs to adjacent cores by sending Simics requests to appropriate L1 controllers Memory controllers evenly interleaved Bottom-line: Static scheduling No hypervisor execution simulated No content-based page sharing
VH Evaluation: Workloads OLTP, SpecJBB, Apache, Zeus • Separate instance of Solaris for each VM Homogenous Consolidation • Simulate same-size workload N times • Unit of work identical across all workloads • (each workload staggered by 1,000,000+ ins) Heterogeneous Consolidation • Simulate different-size, different workloads • Cycles-per-Transaction for each workload
VH Evaluation: Baseline Protocols DRAM-DIRECTORY: • 1 MB directory cache per controller • Each tile nominally private, but replication limited TAG-DIRECTORY: • 3-cycle central tag directory (1024 ways). Non-pipelined • Replication limited STATIC-BANK-DIRECTORY • Home tiles interleave by frame address • Home tile stores only L2 copy
VH Evaluation: VHA and VHB Protocols VHA • Based on DirectoryCMP implementation • Dynamic Home Tile stores only L2 copy VHB with optimizations • Private data placement optimization policy (shared data stored at home tile, private data is not) • Can violate inclusiveness (evict L2 tag w/ sharers) • Memory data returned directly to requestor