460 likes | 474 Views
This paper introduces Token Coherence as a solution for implementing multiple-Chip Multiprocessor (CMP) systems efficiently, balancing correctness and performance. It explores the separation of performance and correctness through token counting, safety measures, and starvation avoidance techniques. The Token Coherence framework simplifies complexity in hierarchical directory systems, making it faster and more robust for emerging larger CMP configurations. The text describes how Token Coherence optimally manages tokens for different memory operations in various CMP setups, ensuring system-wide coherence and avoiding issues like race conditions. Token Coherence broadens the scope from single to multiple CMP systems, providing insights on safety, token management, and efficient handling of cache hierarchies.
E N D
Token Coherence: A Framework for Implementing Multiple-CMP Systems Mike Marty1,Jesse Bingham2, Mark Hill1, Alan Hu2, Milo Martin3, and David Wood1 1University of Wisconsin-Madison 2University of British Columbia 3University of Pennsylvania February 17th, 2005
Summary • Microprocessor Chip Multiprocessor (CMP) • Symmetric Multiprocessor (SMP) Multiple CMPs • Problem: Coherence with Multiple CMPs • Old Solution: Hierarchical Directory Complex & Slow • New Solution: Apply Token Coherence • Developed for glueless multiprocessor [2003] • Keep: Flat for Correctness • Exploit: Hierarchical for performance • Less Complex & Faster than Hierarchical Directory
Outline • Motivation and Background • Coherence in Multiple-CMP Systems • Example: DirectoryCMP • Token Coherence: Flat for Correctness • Token Coherence: Hierarchical for Performance • Evaluation
Coherence in Multiple-CMP Systems P P P P I I D I D D I D interconnect L2 L2 L2 L2 • Chip Multiprocessors (CMPs) emerging • Larger systems will be built with Multiple CMPs CMP 2 CMP 1 interconnect CMP 3 CMP 4
Problem: Hierarchical Coherence • Intra-CMP protocol for coherence within CMP • Inter-CMP protocol for coherence between CMPs • Interactions between protocols increase complexity • explodes state space CMP 2 CMP 1 interconnect Inter-CMP Coherence Intra-CMP Coherence CMP 3 CMP 4
Improving Multiple CMP Systems with Token Coherence • Token Coherence allows Multiple-CMP systems to be... • Flat for correctness, but • Hierarchical for performance Low Complexity Fast Correctness Substrate CMP 2 CMP 1 interconnect Performance Protocol CMP 3 CMP 4
Example: DirectoryCMP 2-level MOESI Directory RACE CONDITIONS! CMP 0 CMP 1 Store B Store B P0 P1 P2 P3 P4 P5 P6 P7 L1 I&D L1 I&D L1 I&D L1 I&D L1 I&D L1 I&D L1 I&D L1 I&D S S O S data/ ack data/ ack getx WB getx inv ack inv ack inv fwd ack data/ ack Shared L2 / directory Shared L2 / directory S getx WB fwd B: [M I] B: [S O] getx Memory/Directory Memory/Directory
Token Coherence Summary • Token Coherence separates performance from correctness • Correctness Substrate: Enforces coherence invariant and prevents starvation • Safety with Token Counting • Starvation Avoidance with Persistent Requests • Performance Policy: Makes the common case fast • Transient requests to seek tokens • Unordered, untracked, unacknowledged • Possible prediction, multicast, filters, etc
Outline • Motivation and Background • Token Coherence: Flat for Correctness • Safety • Starvation Avoidance • Token Coherence: Hierarchical for Performance • Evaluation
Example: Token Coherence [ISCA 2003] Load B Load B Store B Store B • Each memory block initialized with T tokens • Tokens stored in memory, caches, & messages • At least one token to read a block • All tokens to write a block P0 P1 P2 P3 L1 I&D L1 I&D L1 I&D L1 I&D L2 L2 L2 L2 mem 0 interconnect mem 3
Extending to Multiple-CMP System CMP 0 CMP 1 P0 P1 P2 P3 L1 I&D L1 I&D L1 I&D L1 I&D L2 L2 L2 L2 interconnect interconnect Shared L2 Shared L2 mem 0 interconnect mem 1
Extending to Multiple-CMP System CMP 0 CMP 1 • Token counting remains flat • Tokens to caches • Handles shared caches and other complex hierarchies Store B Store B P0 P1 P2 P3 L1 I&D L1 I&D L1 I&D L1 I&D interconnect interconnect Shared L2 Shared L2 mem 0 mem 1 interconnect
Safety Recap • Safety: Maintain coherence invariant • Only one writer, or multiple readers • Tokens for Safety • T Tokens associated with each memory block • # tokens encoded in 1+log2T • Processor acquires all tokens to write, a single token to read • Tokens passed to nodes in glueless multiprocessor scheme • But CMPs have private and shared caches • Tokens passed to caches in Multiple-CMP system • Arbitrary cache hierarchy easily handled • Flat for correctness
Some Token Counting Implications • Memory must store tokens • Separate RAM • Use extra ECC bits • Token cache • T sized to # caches to allow read-only copies in all caches • Replacements cannot be silent • Tokens must not be lost or dropped • Targeted for invalidate-based protocols • Not a solution for write-through or update protocols • Tokens must be identified by block address • Address must be in all token-carrying messages
Starvation Avoidance • Request messages can miss tokens • In-flight tokens • Transient Requests are not tracked throughout system • Incorrect filtering, multicast, destination-set prediction, etc • Possible Solution: Retries • Retry w/ optional randomized backoff is effective for races • Guaranteed Solution: Persistent Requests • Heavyweight request guaranteed to succeed • Should be rare (uses more bandwidth) • Locates all tokens in the system • Orders competing requests
Starvation Avoidance GETX GETX GETX CMP 0 CMP 1 • Tokens move freely in the system • Transient requests can miss in-flight tokens • Incorrect speculation, filters, prediction, etc Store B Store B Store B P0 P1 P2 P3 L1 I&D L1 I&D L1 I&D L1 I&D interconnect interconnect Shared L2 Shared L2 mem 0 mem 1 interconnect
Starvation Avoidance CMP 0 CMP 1 • Solution: issue Persistent Request • Heavyweight request guaranteed to succeed • Methods: Centralized [2003] and Distributed (New) Store B Store B Store B P0 P1 P2 P3 L1 I&D L1 I&D L1 I&D L1 I&D interconnect interconnect Shared L2 Shared L2 mem 0 mem 1 interconnect
Old Scheme: Central Arbiter [2003] CMP 0 CMP 1 • Processors issue persistent requests Store B Store B Store B timeout timeout timeout P0 P1 P2 P3 L1 I&D L1 I&D L1 I&D L1 I&D interconnect interconnect Shared L2 Shared L2 mem 0 mem 1 arbiter 0 interconnect B: P0 arbiter 0 B: P2 B: P1
Old Scheme: Central Arbiter [2003] CMP 0 CMP 1 • Processors issue persistent requests • Arbiter orders and broadcasts activate Store B Store B Store B Store B P0 P1 P2 P3 L1 I&D L1 I&D L1 I&D L1 I&D B: P0 B: P0 B: P0 B: P0 interconnect interconnect B: P0 Shared L2 Shared L2 B: P0 mem 0 mem 1 arbiter 0 interconnect B: P0 arbiter 0 B: P2 B: P1
Old Scheme: Central Arbiter [2003] CMP 0 CMP 1 • Processor sends deactivate to arbiter • Arbiter broadcasts deactivate (and next activate) • Bottom Line: handoff is 3 message latencies Store B Store B Store B P0 P1 P2 P3 L1 I&D L1 I&D L1 I&D L1 I&D B: P0 B: P2 B: P0 B: P2 B: P0 B: P2 B: P2 B: P0 3 interconnect interconnect B: P0 B: P2 Shared L2 Shared L2 B: P2 B: P0 1 2 mem 0 mem 1 arbiter 0 interconnect B: P0 arbiter 0 B: P2 B: P2 B: P1
Improved Scheme: Distributed Arbitration [NEW] CMP 0 CMP 1 • Processors broadcast persistent requests Store B Store B Store B P0 P1 P2 P3 P0: B P0: B P0: B P0: B P1: B P1: B P1: B P1: B L1 I&D L1 I&D L1 I&D L1 I&D P2: B P2: B P2: B P2: B interconnect interconnect P0: B Shared L2 Shared L2 P0: B P1: B P1: B P2: B P2: B mem 0 mem 1 interconnect P0: B P1: B P2: B
Improved Scheme: Distributed Arbitration [NEW] CMP 0 CMP 1 • Processors broadcast persistent requests • Fixed priority (processor number) Store B Store B Store B P0 P1 P2 P3 P0: B P0: B P0: B P0: B P0: B P0: B P0: B P0: B P1: B P1: B P1: B P1: B L1 I&D L1 I&D L1 I&D L1 I&D P2: B P2: B P2: B P2: B interconnect interconnect P0: B P0: B Shared L2 Shared L2 P0: B P0: B P1: B P1: B P2: B P2: B mem 0 mem 1 interconnect P0: B P0: B P1: B P2: B
Improved Scheme: Distributed Arbitration [NEW] CMP 0 CMP 1 • Processors broadcast persistent requests • Fixed priority (processor number) • Processors broadcast deactivate Store B Store B P0 P1 P2 P3 P0: B P0: B P0: B P0: B P1: B P1: B P1: B P1: B P1: B P1: B P1: B P1: B 1 L1 I&D L1 I&D L1 I&D L1 I&D P2: B P2: B P2: B P2: B interconnect interconnect P0: B Shared L2 Shared L2 P0: B P1: B P1: B P1: B P1: B P2: B P2: B mem 0 mem 1 interconnect P0: B P1: B P1: B P2: B
Improved Scheme: Distributed Arbitration [NEW] CMP 0 CMP 1 • Bottom line: Handoff is a single message latency • Subtle point: P0 and P1 must wait until next “wave” P0 P1 P2 P3 P1: B P1: B P1: B P1: B P1: B P1: B P1: B P1: B L1 I&D L1 I&D L1 I&D L1 I&D P2: B P2: B P2: B P2: B interconnect interconnect Shared L2 Shared L2 P1: B P1: B P1: B P1: B P2: B P2: B mem 0 mem 1 interconnect P1: B P1: B P2: B
Implementing Distributed Persistent Requests • Table at each cache • Sized to N entries for each processor (we use N=1) • Indexed by processor ID • Content-addressable by Address • Each incoming message must access table • Not on the critical path– can be slow CAM • Activate/deactivate reordering cannot be allowed • Persistent request virtual channel must be point-to-point ordered • Or, other solution such as sequence numbers or acks
Implementing Distributed Persistent Requests • Should reads be distinguished from writes? • Not necessary, but • Persistent Read request is helpful • Implications of flat distributed arbitration • Simple flat for correctness • Global broadcast when used • Fortunately they are rare in typical workloads (0.3%) • Bad workload (very high contention) would burn bandwidth • Maximum # processors must be architected • What about a hierarchical persistent request scheme? • Possible, but correctness is no longer flat • Make the common case fast
Reducing Unnecessary Traffic • Problem: Which token-holding cache responds with data? • Solution: Distinguish one token as the owner token • The owner includes data with token response • Clean vs. dirty owner distinction also useful for writebacks
Outline • Motivation and Background • Token Coherence: Flat for Correctness • Token Coherence: Hierarchical for Performance • TokenCMP • Another look at performance policies • Evaluation
Hierarchical for Performance: TokenCMP • Target System: • 2-8 CMPs • Private L1s, shared L2 per CMP • Any interconnect, but high-bandwidth • Performance Policy Goals: • Aggressively acquire tokens • Exploit on-chip locality and bandwidth • Respect cache hierarchy • Detecting and handling missed tokens
Hierarchical for Performance: TokenCMP • Approach: • On L1 miss, broadcast within own CMP • Local cache responds if possible • On L2 miss, broadcast to other CMPs • Appropriate L2 bank responds or broadcasts within its CMP • Optionally filter • Responses between CMPs carry extra tokensfor future locality • Handling missed tokens: • Timeout after average memory latency • Invoke persistent request (no retries) • Larger systems can use filters, multicast, soft-state directories
Other Optimizations in TokenCMP • Implementing E-state • Memory responds with all tokens on read request • Use clean/dirty owner distinction to eliminate writing back unwritten data • Implementing Migratory Sharing • What is it? • A processor’s read request results in exclusive permission if responder has exclusive permission and wrote the block • In TokenCMP, simply return all tokens • Non-speculative delay • Hold block for some # cycles so permission isn’t stolen prematurely
Another Look at Performance Policies • How to find tokens? • Broadcast • Broadcast w/ filters • Multicast (destination-set prediction) • Directories (soft or hard) • Who responds with data? • Owner token • TokenCMP uses Owner token for Inter-CMP responses • Other heuristics • For TokenCMP intra-CMP responses, cache responds if it has extra tokens
Transient Requests May Reduce Complexity • Processor holds the only required state about request • L2 controller in TokenCMP very simple: • Re-broadcasts L1 request message on a miss • Re-broadcasts or filters external request messages • Possible states: • no tokens (I) • all tokens (M) • some tokens (S) • Bounce unexpected tokens to memory • DirectoryCMP’s L2 controller is complex • Allocates MSHR on miss and forward • Issues invalidates and receives acks • Orders all intra-CMP requests and writebacks • 57 states in our L2 implementation!
Writebacks • DirectoryCMP uses “3-phase writebacks” • L1 issues writeback request • L2 enters transient state or blocks request • L2 responds with writeback ack • L1 sends data • TokenCMP uses “fire-and-forget” writebacks • Immediately send tokens and data • Heuristic: Only send data if # tokens > 1
Outline • Motivation and Background • Token Coherence: Flat for Correctness • Token Coherence: Hierarchical for Performance • Evaluation • Model checking • Performance w/ commercial workloads • Robustness
TokenCMP Evaluation • Simple? • Some anecdotal examples and comparisons • Model checking • Fast? • Full-system simulation w/ commercial workloads • Robust? • Micro-benchmarks to simulate high contention
Complexity Evaluation with Model Checking This work performed by Jesse Bingham and Alan Hu of the University of British Columbia • Methods: • TLA+ and TLC • DirectoryCMP omits all intra-CMP details • TokenCMP’s correctness substrate modeled • Result: • Complexity similar between TokenCMP and non-hierarchical DirectoryCMP • Correctness Substrate verified to be correct and deadlock-free • All possible performance protocols correct
Performance Evaluation • Target System: • 4 CMPs, 4 procs/cmp • 2GHz OoO SPARC, 8MB shared L2 per chip • Directly connected interconnect • Methods: Multifacet GEMS simulator • Simics augmented with timing models • Released soon: http://www.cs.wisc.edu/gems • Benchmarks: • Performance: Apache, Spec, OLTP • Robustness: Locking uBenchmark
Full-system Simulation: Runtime • TokenCMP performs 9-50% faster than DirectoryCMP
Full-system Simulation: Runtime • TokenCMP performs 9-50% faster than DirectoryCMP DRAM Directory Perfect L2
Full-system Simulation: Inter-CMP Traffic • TokenCMP traffic is reasonable (or better) • DirectoryCMP control overhead greater than broadcast for small system
Performance Robustness Locking micro-benchmark (correctness substrate only) less contention more contention
Performance Robustness Locking micro-benchmark (correctness substrate only) less contention more contention
Performance Robustness Locking micro-benchmark less contention more contention
Summary • Microprocessor Chip Multiprocessor (CMP) • Symmetric Multiprocessor (SMP) Multiple CMPs • Problem: Coherence with Multiple CMPs • Old Solution: Hierarchical Directory Complex & Slow • New Solution: Apply Token Coherence • Developed for glueless multiprocessor [2003] • Keep: Flat for Correctness • Exploit: Hierarchical for performance • Less Complex & Faster than Hierarchical Directory