1 / 91

Cache Coherence Techniques for Multicore Processors

Cache Coherence Techniques for Multicore Processors. Dissertation Defense Mike Marty 12/19/2007. Key Contributions. Trend: Multicore ring interconnects emerging Challenge: Order of ring != order of bus Contribution: New protocol exploits ring order

dlarry
Download Presentation

Cache Coherence Techniques for Multicore Processors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007

  2. Key Contributions Trend: Multicore ring interconnects emerging Challenge: Order of ring != order of bus Contribution:New protocol exploits ring order Trend: Multicore now the basic building block Challenge: Hierarchical coherence for Multiple-CMP is complex Contribution:DirectoryCMP and TokenCMP Trend: Workload consolidation w/ space sharing Challenge: Physical hierarchies often do not match workloads Contribution:Virtual Hierarchies

  3. Outline Introduction and Motivation Multicore Trends Virtual Hierarchies Focus of presentation Multiple-CMP Coherence Ring-based Coherence Conclusion

  4. P0 $ P1 $ bus $ P2 $ P3 memory controller Is SMP + On-chip Integration == Multicore? Multicore

  5. Multicore Trends Multicore P1 $ P0 $ bus $ P2 $ P3 memory controller Trend: On-chip Interconnect • Competes for same resources as cores, caches • Ring an emerging multicore interconnect

  6. Multicore Trends Multicore P1 $ P0 $ Shared $ bus Shared $ $ P2 $ P3 memory controller Trend: latency/bandwidth tradeoffs • Increasing on-chip wire delay, memory latency • Coherence protocol interacts with shared-cache hierarchy

  7. Multicore Multicore Multicore Multicore P1 $ P1 $ P1 $ P1 $ P0 $ P0 $ P0 $ P0 $ bus bus bus bus $ P3 $ P3 $ P3 $ P3 $ P2 $ P2 $ P2 $ P2 memory controller memory controller memory controller memory controller Multicore Trends Trend: Multicore is the basic building block • Multiple-CMP systems instead of SMPs • Hierarchical systems required

  8. Multicore Trends Multicore VM 1 P1 $ P0 $ bus $ P2 $ P3 memory controller VM 2 VM 3 Trend: Workload Consolidation w/ Space Sharing • More cores, more workload consolidation • Space sharing instead of time sharing • Opportunities to optimize caching, coherence

  9. Outline Introduction and Motivation Virtual Hierarchies • Focus of presentation Multiple-CMP Coherence Ring-based Coherence Conclusion [ISCA 2007, IEEE Micro Top Pick 2008]

  10. Virtual Hierarchy Motivations Space-sharing Server (workload) consolidation Tiled architectures APP 1 APP 1 APP 2 APP 3 APP 4

  11. 64-core CMP L2 Cache Core L1 Motivation: Server Consolidation www server database server #2 database server #1 middleware server #1 middleware server #1

  12. 64-core CMP Motivation: Server Consolidation www server database server #2 database server #1 middleware server #1 middleware server #1

  13. 64-core CMP Motivation: Server Consolidation www server database server #2 data database server #1 middleware server #1 data middleware server #1 Optimize Performance

  14. 64-core CMP Motivation: Server Consolidation www server database server #2 database server #1 middleware server #1 middleware server #1 Isolate Performance

  15. Motivation: Server Consolidation www server 64-core CMP database server #2 database server #1 middleware server #1 middleware server #1 Dynamic Partitioning

  16. 64-core CMP Motivation: Server Consolidation www server database server #2 data database server #1 VMWare’s Content-based Page Sharing  Up to 60% reduced memory middleware server #1 middleware server #1 Inter-VM Sharing

  17. Outline Introduction and Motivation Virtual Hierarchies Expanded Motivation Non-hierarchical approaches Proposed Virtual Hierarchies Evaluation Related Work Ring-based and Multiple-CMP Coherence Conclusion

  18. L2 Cache Core L1 Tiled Architecture Memory System Memory Controller global broadcast too expensive

  19. duplicate tag directory TAG-DIRECTORY A A Read A fwd 3 data 1 2 getM A duplicate tag directory

  20. STATIC-BANK-DIRECTORY A Read A data 3 1 fwd 2 getM A A

  21. STATIC-BANK-DIRECTORY with hypervisor-managed cache 2 A fwd getM A A Read A 1 data 3

  22. Optimize Performance Isolate Performance Allow Dynamic Partitioning Support Inter-VM Sharing Hypervisor/OS Simplicity Goals STATIC-BANK-DIRECTORY w/ hypervisor-managed cache {STATIC-BANK, TAG}-DIRECTORY • No • No • Yes • Yes • Yes • Yes • Yes • ? • Yes • No

  23. Outline Introduction and Motivation Virtual Hierarchies Expanded Motivation Non-hierarchical approaches Proposed Virtual Hierarchies Evaluation Related Work Ring-based and Multiple-CMP Coherence Conclusion

  24. Virtual Hierarchies Key Idea: Overlay 2-level Cache & Coherence Hierarchy - First level harmonizes with VM/Workload - Second level allows inter-VM sharing, migration, reconfig

  25. VH: First-Level Protocol Goals: • Exploit locality from space affinity • Isolate resources Strategy: Directory protocol • Interleave directories across first-level tiles • Store L2 block at first-level directory tile Questions: • How to name directories? • How to name sharers? getM INV

  26. L2 Cache Core L1 p12 p13 p14 0 p12 1 p13 2 p14 Address 3 p12 ……000101 offset 4 p13 Home Tile: p14 5 p14 6 63 p12 VH: Naming First-level Directory Select Dynamic Home Tile with VM Config Table • Hardware VM Config Table at each tile • Set by hypervisor during scheduling Example: per-Tile VM Config Table Dynamic

  27. VH: Dynamic Home Tile Actions Dynamic Home Tile either: • Returns data cached at L2 bank • Generates forwards/invalidates • Issues second-level request Stable First-level States (a subset): • Typical: M, E, S, I • Atypical: ILX: L2 Invalid, points to exclusive tile SLS: L2 Shared, other tiles share SLSX: L2 Shared, other tiles share, exclusive to first level getM

  28. VH: Naming First-level Sharers Any tile can share the block Solution: full bit-vector • 64-bits for 64-tile system • Names multiple sharers or single exclusive Alternatives: • First-level broadcast • (Dynamic) coarse granularity getM INV

  29. Virtual Hierarchies Two Solutions for Global Coherence: VHA and VHB memory controller(s)

  30. Protocol VHA Directory as Second-level Protocol • Any tile can act as first-level directory • How to track and name first-level directories? Full bit-vector of sharers to name any tile • State stored in DRAM • Possibly cache on-chip + Maximum scalability, message efficiency • DRAM State ( ~ 12.5% overhead )

  31. VHA Example 2 A getM A 6 data getM A 1 data 3 5 directory/memory controller Fwd A 4 Fwd data A

  32. blocked blocked A A VHA: Handling Races Blocking Directories • Handles races within same protocol • Requires blocking buffer + wakeup/replay logic Inter-Intra Races • Naïve blocking leads to deadlock! blocked A getM A getM A blocked getM A getM A FWD A getM A getM A

  33. blocked getM A getM A FWD A getM A getM A blocked blocked A A VHA: Handling Races(cont) Possible Solution: • Always handle second-level message at first-level • But this causes explosion of state space Second-level may interrupt first-level actions: • First-level indirections, invalidations, writebacks

  34. VHA: Handling Races(cont) Reduce the state-space explosion w/ Safe States: • Subset of transient states • Immediately handle second-level message • Limit concurrency between protocols Algorithm: • Level-one requests either complete, or enter safe-state before issuing level-two request • Level-one directories handle level-two forwards when a safe state reached (they may stall) • Level-two requests eventually handled by Level-two directory • Completion messages unblock directories

  35. Virtual Hierarchies Two Solutions for Global Coherence: VHA and VHB memory controller(s)

  36. Protocol VHB Broadcast as Second-level Protocol • Locate first-level directory tiles • Memory controller tracks outstanding second-level requestor Attach token count for each block • T tokens for each block. One token to read, all to write • Allows 1-bit at memory per block • Eliminates system-wide ACK responses

  37. Protocol VHB: Token Coalescing Memory logically holds all or none tokens: • Enables 1-bit token count Replacing tile sends tokens to memory controller: • Message usually contains all tokens Process: • Tokens held in Token Holding Buffer (THB) • FIND broadcast initiated to locate other first-level directory with tokens • First-level directories respond to THB, tokens sent • Repeat for race

  38. VHB Example 2 A getM A getM A 1 memory controller global getM A 3 Data+tokens A 5 4 Fwd A

  39. Goals Optimize Performance Isolate Performance Allow Dynamic Partitioning Support Inter-VM Sharing Hypervisor/OS Simplicity Virtual Hierarchies: VHA and VHB STATIC-BANK-DIRECTORY w/ hypervisor-managed cache {DRAM, STATIC-BANK, TAG}-DIRECTORY • No • No • Yes • Yes • Yes • Yes • Yes • ? • Yes • No • Yes • Yes • Yes • Yes • Yes

  40. VHNULL Are two levels really necessary? VHNULL: first level only Implications: • Many OS modifications for single-OS environment • Dynamic Partitioning requires cache flushes • Inter-VM Sharing difficult • Hypervisor complexity increases • Requires atomic updates of VM Config Tables • Limits optimized placement policies

  41. VH: Capacity/Latency Trade-off Maximize Capacity • Store only L2 copy at dynamic home tile • But, L2 access time penalized • Especially for large VMs Minimize L2 access latency/bandwidth: • Replicate data in local L2 slice • Selective/Adaptive Replication well-studied ASR [Beckmann et al.], CC [Chang et al.] • But, dynamic home tile still needed for first-level Can we exploit virtual hierarchy for placement?

  42. VH: Data Placement Optimization Policy Data from memory placed in tile’s local L2 bank • Tag not allocated at dynamic home tile Use second-level coherence on first sharing miss • Then allocate tag at dynamic home tile for future sharing misses Benefits: • Private data allocates in tile’s local L2 bank • Overhead of replicating data reduced • Fast, first-level sharing for widely shared data

  43. Outline Introduction and Motivation Virtual Hierarchies Expanded Motivation Non-hierarchical approaches Proposed Virtual Hierarchies Evaluation Related Work Ring-based and Multiple-CMP Coherence Conclusion

  44. VH Evaluation Methods Wisconsin GEMS Target System: 64-core tiled CMP • In-order SPARC cores • 1 MB, 16-way L2 cache per tile, 10-cycle access • 2D mesh interconnect, 16-byte links, 5-cycle link latency • Eight on-chip memory controllers, 275-cycle DRAM latency

  45. VH Evaluation: Simulating Consolidation Challenge: bring-up of consolidated workloads Solution: approximate virtualization Combine existing Simics checkpoints 64p checkpoint P0-P63 VM0_Memory0 VM0_PCI0, VM0_DISK0 VM1_Memory0 VM1_PCI0, VM1_DISK0 8p checkpoint Memory0 P0-P7 PCI0, DISK0 script

  46. VH Evaluation: Simulating Consolidation At simulation-time, Ruby handles mapping: Converts <Processor ID, 32-bit Address> to <36-bit address> Schedules VMs to adjacent cores by sending Simics requests to appropriate L1 controllers Memory controllers evenly interleaved Bottom-line: Static scheduling No hypervisor execution simulated No content-based page sharing

  47. VH Evaluation: Workloads OLTP, SpecJBB, Apache, Zeus • Separate instance of Solaris for each VM Homogenous Consolidation • Simulate same-size workload N times • Unit of work identical across all workloads • (each workload staggered by 1,000,000+ ins) Heterogeneous Consolidation • Simulate different-size, different workloads • Cycles-per-Transaction for each workload

  48. VH Evaluation: Baseline Protocols DRAM-DIRECTORY: • 1 MB directory cache per controller • Each tile nominally private, but replication limited TAG-DIRECTORY: • 3-cycle central tag directory (1024 ways). Non-pipelined • Replication limited STATIC-BANK-DIRECTORY • Home tiles interleave by frame address • Home tile stores only L2 copy

  49. VH Evaluation: VHA and VHB Protocols VHA • Based on DirectoryCMP implementation • Dynamic Home Tile stores only L2 copy VHB with optimizations • Private data placement optimization policy (shared data stored at home tile, private data is not) • Can violate inclusiveness (evict L2 tag w/ sharers) • Memory data returned directly to requestor

  50. Micro-benchmark: Sharing Latency

More Related