370 likes | 395 Views
Improving Multiple-CMP Systems with Token Coherence. Mike Marty 1 , Jesse Bingham 2 , Mark Hill 1 , Alan Hu 2 , Milo Martin 3 , and David Wood 1 1 University of Wisconsin-Madison 2 University of British Columbia 3 University of Pennsylvania Thanks to Intel, NSERC, NSF, and Sun. Summary.
E N D
Improving Multiple-CMP Systems with Token Coherence Mike Marty1,Jesse Bingham2, Mark Hill1, Alan Hu2, Milo Martin3, and David Wood1 1University of Wisconsin-Madison 2University of British Columbia 3University of Pennsylvania Thanks to Intel, NSERC, NSF, and Sun
Summary • Microprocessor Chip Multiprocessor (CMP) • Symmetric Multiprocessor (SMP) Multiple CMPs • Problem: Coherence with Multiple CMPs • Old Solution: Hierarchical Protocol Complex & Slow • New Solution: Apply Token Coherence • Developed for glueless multiprocessor [ISCA 2003] • Keep: Flat for Correctness • Exploit: Hierarchical for performance • Less Complex & Faster than Hierarchical Directory
Outline • Motivation and Background • Coherence in Multiple-CMP Systems • Example: DirectoryCMP • Token Coherence: Flat for Correctness • Token Coherence: Hierarchical for Performance • Evaluation
Coherence in Multiple-CMP Systems P P P P I I D I D D I D interconnect L2 L2 L2 L2 • Chip Multiprocessors (CMPs) emerging • Larger systems will be built with Multiple CMPs CMP 2 CMP 1 interconnect CMP 3 CMP 4
Problem: Hierarchical Coherence • Intra-CMP protocol for coherence within CMP • Inter-CMP protocol for coherence between CMPs • Interactions between protocols increase complexity • explodes state space CMP 2 CMP 1 interconnect Inter-CMP Coherence Intra-CMP Coherence CMP 3 CMP 4
Improving Multiple CMP Systems with Token Coherence • Token Coherence allows Multiple-CMP systems to be... • Flat for correctness, but • Hierarchical for performance Low Complexity Fast Correctness Substrate CMP 2 CMP 1 interconnect Performance Protocol CMP 3 CMP 4
Example: DirectoryCMP 2-level MOESI Directory RACE CONDITIONS! CMP 0 CMP 1 Store B Store B P0 P1 P2 P3 P4 P5 P6 P7 L1 I&D L1 I&D L1 I&D L1 I&D L1 I&D L1 I&D L1 I&D L1 I&D S S O S data/ ack data/ ack getx WB getx inv ack inv ack inv fwd ack data/ ack Shared L2 / directory Shared L2 / directory S getx WB fwd B: [M I] B: [S O] getx Memory/Directory Memory/Directory
Outline • Motivation and Background • Token Coherence: Flat for Correctness • Safety • Starvation Avoidance • Token Coherence: Hierarchical for Performance • Evaluation
Example: Token Coherence [ISCA 2003] Load B Load B Store B Store B • Each memory block initialized with T tokens • Tokens stored in memory, caches, & messages • At least one token to read a block • All tokens to write a block P0 P1 P2 P3 L1 I&D L1 I&D L1 I&D L1 I&D L2 L2 L2 L2 mem 0 interconnect mem 3
Extending to Multiple-CMP System CMP 0 CMP 1 P0 P1 P2 P3 L1 I&D L1 I&D L1 I&D L1 I&D L2 L2 L2 L2 interconnect interconnect Shared L2 Shared L2 mem 0 interconnect mem 1
Extending to Multiple-CMP System CMP 0 CMP 1 • Token counting remains flat • Tokens to caches • Handles shared caches and other complex hierarchies Store B Store B P0 P1 P2 P3 L1 I&D L1 I&D L1 I&D L1 I&D interconnect interconnect Shared L2 Shared L2 mem 0 mem 1 interconnect
Starvation Avoidance GETX GETX GETX CMP 0 CMP 1 • Tokens move freely in the system • Transient requests can miss in-flight tokens • Incorrect speculation, filters, prediction, etc Store B Store B Store B P0 P1 P2 P3 L1 I&D L1 I&D L1 I&D L1 I&D interconnect interconnect Shared L2 Shared L2 mem 0 mem 1 interconnect
Starvation Avoidance CMP 0 CMP 1 • Solution: issue Persistent Request • Heavyweight request guaranteed to succeed • Methods: Centralized [2003] and Distributed (New) Store B Store B Store B P0 P1 P2 P3 L1 I&D L1 I&D L1 I&D L1 I&D interconnect interconnect Shared L2 Shared L2 mem 0 mem 1 interconnect
Old Scheme: Central Arbiter [2003] CMP 0 CMP 1 • Processors issue persistent requests Store B Store B Store B timeout timeout timeout P0 P1 P2 P3 L1 I&D L1 I&D L1 I&D L1 I&D interconnect interconnect Shared L2 Shared L2 mem 0 mem 1 arbiter 0 interconnect B: P0 arbiter 0 B: P2 B: P1
Old Scheme: Central Arbiter [2003] CMP 0 CMP 1 • Processors issue persistent requests • Arbiter orders and broadcasts activate Store B Store B Store B Store B P0 P1 P2 P3 L1 I&D L1 I&D L1 I&D L1 I&D B: P0 B: P0 B: P0 B: P0 interconnect interconnect B: P0 Shared L2 Shared L2 B: P0 mem 0 mem 1 arbiter 0 interconnect B: P0 arbiter 0 B: P2 B: P1
Old Scheme: Central Arbiter [2003] CMP 0 CMP 1 • Processor sends deactivate to arbiter • Arbiter broadcasts deactivate (and next activate) • Bottom Line: handoff is 3 message latencies Store B Store B Store B P0 P1 P2 P3 L1 I&D L1 I&D L1 I&D L1 I&D B: P0 B: P2 B: P0 B: P2 B: P0 B: P2 B: P2 B: P0 3 interconnect interconnect B: P0 B: P2 Shared L2 Shared L2 B: P2 B: P0 1 2 mem 0 mem 1 arbiter 0 interconnect B: P0 arbiter 0 B: P2 B: P2 B: P1
Improved Scheme: Distributed Arbitration [NEW] CMP 0 CMP 1 • Processors broadcast persistent requests Store B Store B Store B P0 P1 P2 P3 P0: B P0: B P0: B P0: B P1: B P1: B P1: B P1: B L1 I&D L1 I&D L1 I&D L1 I&D P2: B P2: B P2: B P2: B interconnect interconnect P0: B Shared L2 Shared L2 P0: B P1: B P1: B P2: B P2: B mem 0 mem 1 interconnect P0: B P1: B P2: B
Improved Scheme: Distributed Arbitration [NEW] CMP 0 CMP 1 • Processors broadcast persistent requests • Fixed priority (processor number) Store B Store B Store B P0 P1 P2 P3 P0: B P0: B P0: B P0: B P0: B P0: B P0: B P0: B P1: B P1: B P1: B P1: B L1 I&D L1 I&D L1 I&D L1 I&D P2: B P2: B P2: B P2: B interconnect interconnect P0: B P0: B Shared L2 Shared L2 P0: B P0: B P1: B P1: B P2: B P2: B mem 0 mem 1 interconnect P0: B P0: B P1: B P2: B
Improved Scheme: Distributed Arbitration [NEW] CMP 0 CMP 1 • Processors broadcast persistent requests • Fixed priority (processor number) • Processors broadcast deactivate Store B Store B P0 P1 P2 P3 P0: B P0: B P0: B P0: B P1: B P1: B P1: B P1: B P1: B P1: B P1: B P1: B 1 L1 I&D L1 I&D L1 I&D L1 I&D P2: B P2: B P2: B P2: B interconnect interconnect P0: B Shared L2 Shared L2 P0: B P1: B P1: B P1: B P1: B P2: B P2: B mem 0 mem 1 interconnect P0: B P1: B P1: B P2: B
Improved Scheme: Distributed Arbitration [NEW] CMP 0 CMP 1 • Bottom line: Handoff is a single message latency • Subtle point: P0 and P1 must wait until next “wave” P0 P1 P2 P3 P1: B P1: B P1: B P1: B P1: B P1: B P1: B P1: B 1 L1 I&D L1 I&D L1 I&D L1 I&D P2: B P2: B P2: B P2: B interconnect interconnect Shared L2 Shared L2 P1: B P1: B P1: B P1: B P2: B P2: B mem 0 mem 1 interconnect P1: B P1: B P2: B
Outline • Motivation and Background • Token Coherence: Flat for Correctness • Token Coherence: Hierarchical for Performance • Evaluation
Hierarchical for Performance: TokenCMP • Target System: • 2-8 CMPs • Private L1s, shared L2 per CMP • Any interconnect, but high-bandwidth • Performance Policy Goals: • Aggressively acquire tokens • Exploit on-chip locality and bandwidth • Respect cache hierarchy • Detecting and handling missed tokens
Hierarchical for Performance: TokenCMP • Approach: • On L1 miss, broadcast within own CMP • Local cache responds if possible • On L2 miss, broadcast to other CMPs • Appropriate L2 bank responds or broadcasts within its CMP • Optionally filter • Responses between CMPs carry extra tokensfor future locality • Handling missed tokens: • Timeout after average memory latency • Invoke persistent request (no retries) • Larger systems can use filters, multicast, soft-state directories
Outline • Motivation and Background • Token Coherence: Flat for Correctness • Token Coherence: Hierarchical for Performance • Evaluation • Model checking • Performance w/ commercial workloads • Robustness
TokenCMP Evaluation • Simple? • Model checking • Fast? • Full-system simulation w/ commercial workloads • Robust? • Micro-benchmarks to simulate high contention
Complexity Evaluation with Model Checking • Methods: • TLA+ and TLC • DirectoryCMP omits all intra-CMP details • TokenCMP’s correctness substrate modeled • Result: • Complexity similar between TokenCMP and non-hierarchical DirectoryCMP • Correctness Substrate verified to be correct and deadlock-free • Small configuration, varied parameters • All possible performance protocols correct
Performance Evaluation • Target System: • 4 CMPs, 4 procs/cmp • 2GHz OoO SPARC, 8MB shared L2 per chip • Directly connected interconnect • Methods: Multifacet GEMS simulator • Simics augmented with timing models • Released soon: http://www.cs.wisc.edu/gems • ISCA 2005 Tutorial! • Benchmarks: • Performance: Apache, Spec, OLTP • Robustness: Locking uBenchmark
Full-system Simulation: Runtime • TokenCMP performs 9-50% faster than DirectoryCMP
Full-system Simulation: Runtime • TokenCMP performs 9-50% faster than DirectoryCMP DRAM Directory Perfect L2
Full-system Simulation: Traffic • TokenCMP traffic is reasonable (or better) • DirectoryCMP control overhead greater than broadcast for small system
Performance Robustness Locking micro-benchmark (correctness substrate only) less contention more contention
Performance Robustness Locking micro-benchmark (correctness substrate only) less contention more contention
Performance Robustness Locking micro-benchmark less contention more contention
Summary • Microprocessor Chip Multiprocessor (CMP) • Symmetric Multiprocessor (SMP) Multiple CMPs • Problem: Coherence with Multiple CMPs • Old Solution: Hierarchical Protocol Complex & Slow • New Solution: Apply Token Coherence • Developed for glueless multiprocessor [2003] • Keep: Flat for Correctness • Exploit: Hierarchical for performance • Less Complex & Faster than Hierarchical Directory