230 likes | 320 Views
Token Coherence for CMPs. Mike Marty CS 838 12/09/2003. Outline. Motivation for Token Coherence Token Coherence 101 Issues with CMPs Maintaining correctness When to retry? Persistent requests. Motivation #1: Performance Tradeoffs. Snooping Low latency (w/o network contention)
E N D
Token Coherence for CMPs Mike Marty CS 838 12/09/2003
Outline • Motivation for Token Coherence • Token Coherence 101 • Issues with CMPs • Maintaining correctness • When to retry? • Persistent requests
Motivation #1: Performance Tradeoffs • Snooping • Low latency (w/o network contention) • Bandwidth doesn’t scale • Requires ordered interconnect • Directory • No broadcast == scalability • Enables unordered interconnects • Indirections increase latency • Can we get best of both worlds?
Motivation #2: Complexity • Even basic coherence protocols are hard to get right • Future sources of additional complexity • Prediction • e.g. Destination-Set Prediction • Hierarchy • CMPs • DNUCA
P P P P P P Complexity Example Dir/Mem 2. GETX 2. GETX 3. Fwd 3. Inv S L2 L2 4. Inv 1. GETX 2. WB 1. GETX 5. DATA 4. Fwd 4. Inv S O S S
Token Coherence • Global Invariant • For each block, allow either one writer or multiple readers • Enforce Locally • Each (logical) has T tokens (initially at memory) • Tokens never created nor destroyed • Components exchange tokens & data • All tokens to write <==> one writer invariant • At least one token to read <==> multiple readers Invariants enforced explicitly, rather than grab-bag of invalidations, acks, & nacks
Starvation Avoidance • Purpose: handle pathological cases • Detect possible starvation • E.g. > 4 retries issued • Invoke Persistent Request • Starving processor issues • Fair arbitration scheme activates request • Request persists at each processor until satisfied • Deactivate upon completion • Persistent Requests should be rare
Token Coherence Generalizes • Correctness Substrate • Safety with token counting • Starvation avoidance with persistent requests • Performance Protocol • Make the common case fast • Alternatives: • TokenB – broadcast • TokenM – multicast with prediction • TokenD – soft-state directory for a large system
Token Coherence for CMPs • Ensuring safety with tokens • Tokens to individual caches • Exploiting CMP locality • Extra tokens to chip on initial requests • Scaling Persistent Requests
2-Level Directory TokenCMP … P P … P P Local Dir Local Dir Global Dir Local Dir Local Dir … … P P P P TokenCMP logically
TokenCMP Reissues • Transient requests can fail • Coherence race • Network contention • When to reissue? • Reissue transient or persistent request?
TokenCMP Persistent Requests • Recall: • Anything that holds/sends/receives tokens must remember all outstanding persistent requests • Scale persistent requests via hierarchy • One outstanding persistent request per chip • Achieve performance via locality • Satisfy all local persistent requests first
Conclusion • CMPs move the complexity from the processors to the memory system • Token Coherence reduces complexity by separating performance from correctness • CMP performance largely depends on coherence • Token Coherence gives designers more performance tradeoffs
Max Tokens Request to write Delayed T=0 T=0 T=16 (R/W) 1 P0 P1 P2 2 3 Delayed Request to read • P0 issues a request to write (delayed to P2) • P1 issues a request to read Token Coherence Example
T=0 T=16 (R/W) P0 P2 T=0 P1 T=1(R) T=15(R) 1 2 4 T=1 3 • P2 responds with data to P1
T=0 T=16 (R/W) P0 P2 T=0 5 P1 T=1(R) T=15(R) 1 2 4 3 • P0’s delayed request arrives at P2
T=0 P1 6 T=15 T=0 T=1(R) T=15(R) 1 T=15(R) T=16 (R/W) 5 P0 P2 2 7 4 3 • P2 responds to P0
T=0 P1 6 T=0 T=1(R) T=15(R) 1 T=15(R) T=16 (R/W) 5 P0 P2 2 7 4 3
T=15(R) T=1(R) T=0 P0 P1 P2 Now what? (P0 wants all tokens)
8 T=15(R) T=1(R) T=0 P0 P1 P2 9 T=1 • P0 reissues request • P1 responds with a token Timeout!
T=0 P1 T=16 (R/W) T=0 P0 P2 • P0’s request completed