280 likes | 411 Views
Interconnect-Aware Coherence Protocols for Chip Multiprocessors. Liqun Cheng Naveen Muralimanohar Karthik Ramani Rajeev Balasubramonian John Carter. CMPs are ubiquitous Requires coherence among multiple cores Coherence operations entail frequent communication
E N D
Interconnect-Aware Coherence Protocols for Chip Multiprocessors Liqun Cheng Naveen Muralimanohar Karthik Ramani Rajeev Balasubramonian John Carter University of Utah
CMPs are ubiquitous Requires coherence among multiple cores Coherence operations entail frequent communication Messages have different latency and bandwidth needs Heterogeneous wires 11% better performance 22.5% lower wire power Motivation: Coherence Traffic C1 C2 C3 Data Inv Ack L1 L1 L1 Read Req Inval Ex Req Fwd to owner L2 Messagesrelated to read miss Messages related to write miss
Rd-Ex request from processor 1 Directory sends clean copy to processor 1 Directory sends invalidate message to processor 2 Cache 2 sends acknowledgement back to processor 1 Exclusive request for a shared copy Non-Critical Processor 1 Processor 2 4 Cache 1 Cache 2 2 3 1 Critical L2 & Directory
Wire Characteristics • Wire Resistance and capacitance per unit length
Design Space Exploration • Tuning wire width and spacing Base case B wires Fast but Low bandwidth L wires (Width & Spacing) Delay Bandwidth
Design Space Exploration • Tuning Repeater size and spacing Power Optimal Wires Smaller repeaters Increased spacing Delay Power Traditional Wires Large repeaters Optimum spacing
Design Space Exploration Base case B wires 8x plane Base case W wires 4x plane Power optimized PW wires 4x plane Fast, low bandwidth L wires 8x plane Latency 1x Power 1x Area 1x Latency 1.6x Power 0.9x Area 0.5x Latency 3.2x Power 0.3x Area 0.5x Latency 0.5x Power 0.5x Area 4x
Outline • Overview • Wire Design Space Exploration • Protocol-dependent Techniques • Protocol-independent Techniques • Results • Conclusions
Directory Based Protocol (Write-Invalidate) • Map critical/small messages on L wires and non-critical messages on PW wires • Read exclusive request for block in shared state • Read request for block in exclusive state • Negative Ack (NACK) messages Exploit hop imbalance
Read to an Exclusive Block Fwd Dirty Copy (critical) Proc 1 L1 ACK Proc 2 L1 Spec Reply (non-critical) Req Read Req WB Data L2 & Directory (non-critical)
NACK Messages • NACK – Negative Acknowledgement generated when directory state is busy • Can employ MSHR id of the request instead of full address • Directory load is low • Requests can be served at next try • Sending NACK on L-Wires can improve performance • Directory load is high • Frequent back off and retry cycles • Sending NACK on PW-Wires can reduce power consumption
Snoop Bus Based Protocol • Similar to bus-based SMP system • Signal wires and voting wires • Signal wires • To find the state of the block • Voting wires • To vote for owner of the shared data
Protocol-Independent Techniques • Narrow bit-width operands for synchronization variables • Lock and barrier use small integers • Writeback data to PW-wires • Writeback messages are rarely on the critical path • Narrow messages to L-wires • Only contain src, dst, operand and MSHR_id • For example: reply for upgrade message
Implementation Complexity • Heterogeneous interconnect incurs additional complexity • Cache coherence protocols • Robust enough to handle message re-ordering • Decision process • Interconnect implementation
Complexity in the Decision Process • In the directory based system • Optimizations that exploit hop imbalance • Check directory state • Dynamic mapping of NACK messages • Track directory load • Narrow Messages • Compute the width of an operand
Overhead in Interconnect Implementation • Additional Multiplexing/De-multiplexing at sender and receiver side • Additional latches required for power optimized wires • Power savings in PW-Wires goes down by 5% • Wire area overhead • Zero – Equal metal area for base and heterogeneous case
Router Complexity Physical Channel 1 VC 1 Out 1 Crossbar VC 2 Out 2 Base Model
Router Complexity L 24 bits L, PW, B PW Crossbar Out 1 64 bytes Out 2 B L, PW, B 32 bytes Each Physical channel is split into 3 channels (L, PW & B)
Outline • Overview • Wire Design Space Exploration • Protocol-dependent Techniques • Protocol-independent Techniques • Results • Conclusions
Evaluation Platform & Simulation Methodology Processor • Virtutech Simics Simulator • Sixteen-Core CMP • Ruby Timing model (GEMS) • NUCA cache architecture • MOESI Directory protocol • Benchmarks • SPLASH2 • Opal Timing model (GEMS) • Out-of-Order Processor • Multiple outstanding requests L2
Wire Model ores Cside-wall V Wire RC M M M ocap Icap Cadj Ref: Banerjee et al. 65nm process, 10 Metal Layers – 4 in 1X and 2 in each 2X, 4X and 8X plane
Heterogeneous Interconnects • B – Wires • Request carrying address • Response that are on critical path • L- Wires (latency optimized) • Narrow Messages • Unblock & Write-Control Messages • NACK • PW-Wires (power optimized) • Writeback data • Response to read request for an exclusive block
Performance Improvements Average improvement 11%
Percentage of Critical/Noncritical Messages PW Wire Traffic 13% L Wire Traffic 40% Performance 11% Power Saving in wire 22.5%
L-Message Distribution Narrow Msgs Unblock & Ctrl Hop Imbalance
Sensitivity Analysis • Impact of out-of-order core • Average speedup 9.3% • Partial simulation (only 100M instructions) • OOO core is more tolerant to long latency operations • Link Bandwidth & Routing Algorithm • Benchmarks with high link utilization are very sensitive to bandwidth change • Deterministic routing incurs 3% performance loss compared to adaptive routing
Conclusions • Coherence messages have diverse needs • Intelligent mapping of messages to heterogeneous wires can improve performance and power • Low bandwidth, high speed links improve performance by 11% for SPLASH benchmarks • Non-critical traffic on power optimized network decreases wire power by 22.5%