Token Coherence Framework for Multi-CMP Systems

Token Coherence: A Framework for Implementing Multiple-CMP Systems Mike Marty1,Jesse Bingham2, Mark Hill1, Alan Hu2, Milo Martin3, and David Wood1 1University of Wisconsin-Madison 2University of British Columbia 3University of Pennsylvania February 17th, 2005

Summary • Microprocessor  Chip Multiprocessor (CMP) • Symmetric Multiprocessor (SMP)  Multiple CMPs • Problem: Coherence with Multiple CMPs • Old Solution: Hierarchical Directory Complex & Slow • New Solution: Apply Token Coherence • Developed for glueless multiprocessor [2003] • Keep: Flat for Correctness • Exploit: Hierarchical for performance • Less Complex & Faster than Hierarchical Directory

Outline • Motivation and Background • Coherence in Multiple-CMP Systems • Example: DirectoryCMP • Token Coherence: Flat for Correctness • Token Coherence: Hierarchical for Performance • Evaluation

Coherence in Multiple-CMP Systems P P P P I I D I D D I D interconnect L2 L2 L2 L2 • Chip Multiprocessors (CMPs) emerging • Larger systems will be built with Multiple CMPs CMP 2 CMP 1 interconnect CMP 3 CMP 4

Problem: Hierarchical Coherence • Intra-CMP protocol for coherence within CMP • Inter-CMP protocol for coherence between CMPs • Interactions between protocols increase complexity • explodes state space CMP 2 CMP 1 interconnect Inter-CMP Coherence Intra-CMP Coherence CMP 3 CMP 4

Improving Multiple CMP Systems with Token Coherence • Token Coherence allows Multiple-CMP systems to be... • Flat for correctness, but • Hierarchical for performance Low Complexity Fast Correctness Substrate CMP 2 CMP 1 interconnect Performance Protocol CMP 3 CMP 4

Example: DirectoryCMP 2-level MOESI Directory RACE CONDITIONS! CMP 0 CMP 1 Store B Store B P0 P1 P2 P3 P4 P5 P6 P7 L1 I&D L1 I&D L1 I&D L1 I&D L1 I&D L1 I&D L1 I&D L1 I&D S S O S data/ ack data/ ack getx WB getx inv ack inv ack inv fwd ack data/ ack Shared L2 / directory Shared L2 / directory S getx WB fwd B: [M I] B: [S O] getx Memory/Directory Memory/Directory

Token Coherence Summary • Token Coherence separates performance from correctness • Correctness Substrate: Enforces coherence invariant and prevents starvation • Safety with Token Counting • Starvation Avoidance with Persistent Requests • Performance Policy: Makes the common case fast • Transient requests to seek tokens • Unordered, untracked, unacknowledged • Possible prediction, multicast, filters, etc

Outline • Motivation and Background • Token Coherence: Flat for Correctness • Safety • Starvation Avoidance • Token Coherence: Hierarchical for Performance • Evaluation

Example: Token Coherence [ISCA 2003] Load B Load B Store B Store B • Each memory block initialized with T tokens • Tokens stored in memory, caches, & messages • At least one token to read a block • All tokens to write a block P0 P1 P2 P3 L1 I&D L1 I&D L1 I&D L1 I&D L2 L2 L2 L2 mem 0 interconnect mem 3

Extending to Multiple-CMP System CMP 0 CMP 1 P0 P1 P2 P3 L1 I&D L1 I&D L1 I&D L1 I&D L2 L2 L2 L2 interconnect interconnect Shared L2 Shared L2 mem 0 interconnect mem 1

Extending to Multiple-CMP System CMP 0 CMP 1 • Token counting remains flat • Tokens to caches • Handles shared caches and other complex hierarchies Store B Store B P0 P1 P2 P3 L1 I&D L1 I&D L1 I&D L1 I&D interconnect interconnect Shared L2 Shared L2 mem 0 mem 1 interconnect

Safety Recap • Safety: Maintain coherence invariant • Only one writer, or multiple readers • Tokens for Safety • T Tokens associated with each memory block • # tokens encoded in 1+log2T • Processor acquires all tokens to write, a single token to read • Tokens passed to nodes in glueless multiprocessor scheme • But CMPs have private and shared caches • Tokens passed to caches in Multiple-CMP system • Arbitrary cache hierarchy easily handled • Flat for correctness

Some Token Counting Implications • Memory must store tokens • Separate RAM • Use extra ECC bits • Token cache • T sized to # caches to allow read-only copies in all caches • Replacements cannot be silent • Tokens must not be lost or dropped • Targeted for invalidate-based protocols • Not a solution for write-through or update protocols • Tokens must be identified by block address • Address must be in all token-carrying messages

Starvation Avoidance • Request messages can miss tokens • In-flight tokens • Transient Requests are not tracked throughout system • Incorrect filtering, multicast, destination-set prediction, etc • Possible Solution: Retries • Retry w/ optional randomized backoff is effective for races • Guaranteed Solution: Persistent Requests • Heavyweight request guaranteed to succeed • Should be rare (uses more bandwidth) • Locates all tokens in the system • Orders competing requests

Starvation Avoidance GETX GETX GETX CMP 0 CMP 1 • Tokens move freely in the system • Transient requests can miss in-flight tokens • Incorrect speculation, filters, prediction, etc Store B Store B Store B P0 P1 P2 P3 L1 I&D L1 I&D L1 I&D L1 I&D interconnect interconnect Shared L2 Shared L2 mem 0 mem 1 interconnect

Starvation Avoidance CMP 0 CMP 1 • Solution: issue Persistent Request • Heavyweight request guaranteed to succeed • Methods: Centralized [2003] and Distributed (New) Store B Store B Store B P0 P1 P2 P3 L1 I&D L1 I&D L1 I&D L1 I&D interconnect interconnect Shared L2 Shared L2 mem 0 mem 1 interconnect

Old Scheme: Central Arbiter [2003] CMP 0 CMP 1 • Processors issue persistent requests Store B Store B Store B timeout timeout timeout P0 P1 P2 P3 L1 I&D L1 I&D L1 I&D L1 I&D interconnect interconnect Shared L2 Shared L2 mem 0 mem 1 arbiter 0 interconnect B: P0 arbiter 0 B: P2 B: P1

Old Scheme: Central Arbiter [2003] CMP 0 CMP 1 • Processors issue persistent requests • Arbiter orders and broadcasts activate Store B Store B Store B Store B P0 P1 P2 P3 L1 I&D L1 I&D L1 I&D L1 I&D B: P0 B: P0 B: P0 B: P0 interconnect interconnect B: P0 Shared L2 Shared L2 B: P0 mem 0 mem 1 arbiter 0 interconnect B: P0 arbiter 0 B: P2 B: P1

Old Scheme: Central Arbiter [2003] CMP 0 CMP 1 • Processor sends deactivate to arbiter • Arbiter broadcasts deactivate (and next activate) • Bottom Line: handoff is 3 message latencies Store B Store B Store B P0 P1 P2 P3 L1 I&D L1 I&D L1 I&D L1 I&D B: P0 B: P2 B: P0 B: P2 B: P0 B: P2 B: P2 B: P0 3 interconnect interconnect B: P0 B: P2 Shared L2 Shared L2 B: P2 B: P0 1 2 mem 0 mem 1 arbiter 0 interconnect B: P0 arbiter 0 B: P2 B: P2 B: P1

Improved Scheme: Distributed Arbitration [NEW] CMP 0 CMP 1 • Processors broadcast persistent requests Store B Store B Store B P0 P1 P2 P3 P0: B P0: B P0: B P0: B P1: B P1: B P1: B P1: B L1 I&D L1 I&D L1 I&D L1 I&D P2: B P2: B P2: B P2: B interconnect interconnect P0: B Shared L2 Shared L2 P0: B P1: B P1: B P2: B P2: B mem 0 mem 1 interconnect P0: B P1: B P2: B

Improved Scheme: Distributed Arbitration [NEW] CMP 0 CMP 1 • Processors broadcast persistent requests • Fixed priority (processor number) Store B Store B Store B P0 P1 P2 P3 P0: B P0: B P0: B P0: B P0: B P0: B P0: B P0: B P1: B P1: B P1: B P1: B L1 I&D L1 I&D L1 I&D L1 I&D P2: B P2: B P2: B P2: B interconnect interconnect P0: B P0: B Shared L2 Shared L2 P0: B P0: B P1: B P1: B P2: B P2: B mem 0 mem 1 interconnect P0: B P0: B P1: B P2: B

Improved Scheme: Distributed Arbitration [NEW] CMP 0 CMP 1 • Processors broadcast persistent requests • Fixed priority (processor number) • Processors broadcast deactivate Store B Store B P0 P1 P2 P3 P0: B P0: B P0: B P0: B P1: B P1: B P1: B P1: B P1: B P1: B P1: B P1: B 1 L1 I&D L1 I&D L1 I&D L1 I&D P2: B P2: B P2: B P2: B interconnect interconnect P0: B Shared L2 Shared L2 P0: B P1: B P1: B P1: B P1: B P2: B P2: B mem 0 mem 1 interconnect P0: B P1: B P1: B P2: B

Improved Scheme: Distributed Arbitration [NEW] CMP 0 CMP 1 • Bottom line: Handoff is a single message latency • Subtle point: P0 and P1 must wait until next “wave” P0 P1 P2 P3 P1: B P1: B P1: B P1: B P1: B P1: B P1: B P1: B L1 I&D L1 I&D L1 I&D L1 I&D P2: B P2: B P2: B P2: B interconnect interconnect Shared L2 Shared L2 P1: B P1: B P1: B P1: B P2: B P2: B mem 0 mem 1 interconnect P1: B P1: B P2: B

Implementing Distributed Persistent Requests • Table at each cache • Sized to N entries for each processor (we use N=1) • Indexed by processor ID • Content-addressable by Address • Each incoming message must access table • Not on the critical path– can be slow CAM • Activate/deactivate reordering cannot be allowed • Persistent request virtual channel must be point-to-point ordered • Or, other solution such as sequence numbers or acks

Implementing Distributed Persistent Requests • Should reads be distinguished from writes? • Not necessary, but • Persistent Read request is helpful • Implications of flat distributed arbitration • Simple  flat for correctness • Global broadcast when used • Fortunately they are rare in typical workloads (0.3%) • Bad workload (very high contention) would burn bandwidth • Maximum # processors must be architected • What about a hierarchical persistent request scheme? • Possible, but correctness is no longer flat • Make the common case fast

Reducing Unnecessary Traffic • Problem: Which token-holding cache responds with data? • Solution: Distinguish one token as the owner token • The owner includes data with token response • Clean vs. dirty owner distinction also useful for writebacks

Outline • Motivation and Background • Token Coherence: Flat for Correctness • Token Coherence: Hierarchical for Performance • TokenCMP • Another look at performance policies • Evaluation

Hierarchical for Performance: TokenCMP • Target System: • 2-8 CMPs • Private L1s, shared L2 per CMP • Any interconnect, but high-bandwidth • Performance Policy Goals: • Aggressively acquire tokens • Exploit on-chip locality and bandwidth • Respect cache hierarchy • Detecting and handling missed tokens

Hierarchical for Performance: TokenCMP • Approach: • On L1 miss, broadcast within own CMP • Local cache responds if possible • On L2 miss, broadcast to other CMPs • Appropriate L2 bank responds or broadcasts within its CMP • Optionally filter • Responses between CMPs carry extra tokensfor future locality • Handling missed tokens: • Timeout after average memory latency • Invoke persistent request (no retries) • Larger systems can use filters, multicast, soft-state directories

Other Optimizations in TokenCMP • Implementing E-state • Memory responds with all tokens on read request • Use clean/dirty owner distinction to eliminate writing back unwritten data • Implementing Migratory Sharing • What is it? • A processor’s read request results in exclusive permission if responder has exclusive permission and wrote the block • In TokenCMP, simply return all tokens • Non-speculative delay • Hold block for some # cycles so permission isn’t stolen prematurely

Another Look at Performance Policies • How to find tokens? • Broadcast • Broadcast w/ filters • Multicast (destination-set prediction) • Directories (soft or hard) • Who responds with data? • Owner token • TokenCMP uses Owner token for Inter-CMP responses • Other heuristics • For TokenCMP intra-CMP responses, cache responds if it has extra tokens

Transient Requests May Reduce Complexity • Processor holds the only required state about request • L2 controller in TokenCMP very simple: • Re-broadcasts L1 request message on a miss • Re-broadcasts or filters external request messages • Possible states: • no tokens (I) • all tokens (M) • some tokens (S) • Bounce unexpected tokens to memory • DirectoryCMP’s L2 controller is complex • Allocates MSHR on miss and forward • Issues invalidates and receives acks • Orders all intra-CMP requests and writebacks • 57 states in our L2 implementation!

Writebacks • DirectoryCMP uses “3-phase writebacks” • L1 issues writeback request • L2 enters transient state or blocks request • L2 responds with writeback ack • L1 sends data • TokenCMP uses “fire-and-forget” writebacks • Immediately send tokens and data • Heuristic: Only send data if # tokens > 1

Outline • Motivation and Background • Token Coherence: Flat for Correctness • Token Coherence: Hierarchical for Performance • Evaluation • Model checking • Performance w/ commercial workloads • Robustness

TokenCMP Evaluation • Simple? • Some anecdotal examples and comparisons • Model checking • Fast? • Full-system simulation w/ commercial workloads • Robust? • Micro-benchmarks to simulate high contention

Complexity Evaluation with Model Checking This work performed by Jesse Bingham and Alan Hu of the University of British Columbia • Methods: • TLA+ and TLC • DirectoryCMP omits all intra-CMP details • TokenCMP’s correctness substrate modeled • Result: • Complexity similar between TokenCMP and non-hierarchical DirectoryCMP • Correctness Substrate verified to be correct and deadlock-free • All possible performance protocols correct

Performance Evaluation • Target System: • 4 CMPs, 4 procs/cmp • 2GHz OoO SPARC, 8MB shared L2 per chip • Directly connected interconnect • Methods: Multifacet GEMS simulator • Simics augmented with timing models • Released soon: http://www.cs.wisc.edu/gems • Benchmarks: • Performance: Apache, Spec, OLTP • Robustness: Locking uBenchmark

Full-system Simulation: Runtime • TokenCMP performs 9-50% faster than DirectoryCMP

Full-system Simulation: Runtime • TokenCMP performs 9-50% faster than DirectoryCMP DRAM Directory Perfect L2

Full-system Simulation: Inter-CMP Traffic • TokenCMP traffic is reasonable (or better) • DirectoryCMP control overhead greater than broadcast for small system

Full-system Simulation: Intra-CMP Traffic

Performance Robustness Locking micro-benchmark (correctness substrate only) less contention more contention

Performance Robustness Locking micro-benchmark less contention more contention

Summary • Microprocessor  Chip Multiprocessor (CMP) • Symmetric Multiprocessor (SMP)  Multiple CMPs • Problem: Coherence with Multiple CMPs • Old Solution: Hierarchical Directory Complex & Slow • New Solution: Apply Token Coherence • Developed for glueless multiprocessor [2003] • Keep: Flat for Correctness • Exploit: Hierarchical for performance • Less Complex & Faster than Hierarchical Directory

Token Coherence Framework for Multi-CMP Systems