390 likes | 486 Views
A Systematic Methodology to Develop Resilient Cache Coherence Protocols. Konstantinos Aisopos (Princeton, MIT ) Li- Shiuan Peh (MIT ). Motivation. CMP era is here … Enabled by aggressive transistor scaling shrinking transistor dimensions unreliable silicon
E N D
A Systematic Methodology to Develop Resilient Cache Coherence Protocols Konstantinos Aisopos (Princeton, MIT) Li-ShiuanPeh (MIT)
Motivation • CMP era is here… • Enabled by aggressive transistor scaling • shrinking transistor dimensions unreliable silicon • (10K-100K FITs, frequency of errors : months) • … C C C C P P$ S$ NIC R R R R R R R R R R R • [1,2] R R R R R R • [1] R. Bauman (TI), IEEE Design Test of Computers, vol. 22 (3), 2005 [2] J. Graham (MoSys), EE Times, 2002
Motivation • data • request • CMP era is here… • Enabled by aggressive transistor scaling • shrinking transistor dimensions unreliable silicon • (10K-100K FITs, frequency of errors : months) • Goal: resilient cache coherence protocol R S • … R R C C C C R P • loss of a single coherence • message : deadlock R R P$ S$ NIC
Outline • Motivation • Methodology • Walkthrough: a resilient transaction • Defining resilience properties • Enforcing resilience properties • Evaluation • Overhead • Performance • Conclusions
Walkthrough Example: transaction resilient transaction S{ } BM M{ } R • request (M) • request (M) dir S S S • unblock R S1 S2 SM I I • ack M • ack • 1. initiator sends request to the directory • 2. directory forwards request to the sharers • 3. sharers invalidate their copy and acknowledge • 4. request completes and initiator sends unblock to the dir • 5. dir updates sharing vector and may now process succeeding requests R S1 S2
Walkthrough Example: transaction resilient transaction • request (M) • request (M) dir S • request (M) R S1 S2 SM • 1. initiator sends request to the directory • 2. request is lost • 3. initiator resends request after a timeout • 4. directory forwards request to the sharers • (…transaction continues identically as before)
Walkthrough Example: transaction resilient transaction S{ } S{R,S1,S2} BM • request (M) • request (M) dir S • request (M) R S1 S2 SM • ack • ack • 1. initiator resends its request R S1 S2
Walkthrough Example: transaction resilient transaction • request S{ } • (M) S{R,S1,S2} BM • request (M) • request (M) dir S • request (M) S R S1 S2 BM SM ? • ack BM BS • ack • request • unblock • unblock • (M) • tolerate a duplicate request: • (1) transit to same state • (2) generate the same messages • request • request • request M • 1. initiator resends its request S • (S) • (M) • (M) R S1 S2
Walkthrough Example: transaction resilient transaction S{ } S{R,S1,S2} BM • request (M) • request (M) dir S • request (M) R S1 S2 SM • ack • ack • 1. initiator resends its request • 2. directory forwards the request to sharers (again) R S1 S2
Walkthrough Example: transaction resilient transaction • request • request • (M) • (M) • request (M) S • ack S1 S2 I • ack • ack • tolerate a duplicate request: • (1) transit to same state • (2) generate the same messages • ack
Walkthrough Example: transaction resilient transaction • request (M) • request (M) dir S • request (M) R S1 S2 SM • ack • ack M • ack • ack • 1. initiator resends its request • 2. directory forwards the request to sharers (again) • 3. sharers acknowledge (again) • (…transaction completes identically as before)
Outline • Motivation • Methodology • Walkthrough: a resilient transaction • Defining resilience properties • Enforcing resilience properties • Evaluation • Overhead • Performance • Conclusions
Defining the Resilience Properties • … • … • request • response • response R R • … • message loss => transaction suspended • the requestor regenerates its request after timeout • - same state transition • - same outgoing messages • - same state transition • - same outgoing messages
Defining the Resilience Properties • … • … • request • response R R • … stable • request X • msgA • msgB transient A • … • … • … transient • last Y • message • msgB • msgA stable • Property 3 • Property 1 • Property 2 • msgA • msgA • initiator remains transient throughout the transaction • retain information to regenerate msgs • replicate msgs roll-back to same earlier state
Outline • Motivation • Methodology • Walkthrough: a resilient transaction • Defining resilience properties • Enforcing resilience properties • Evaluation • Overhead • Performance • Conclusions
Enforcing Property 1 • Property 1 • the initiator remains transient throughout a transaction to be able to resend lost messages stable • request transient • … transient • last • message stable
Enforcing Property 1 • Property 1 stable • request • the initiator remains transient throughout a transaction to be able to resend lost messages transient • … transient • last • message stable • Enforcement: • counter-example: stable • request • - detect every outgoing message that transits the initiator to stable state • … transient • response dir • unblock • - replace the stable with a transient state, and wait for done transient stable • done • initiator cannot resend unblock
Enforcing Property 2 • Property 2 A • replicate messages roll-back to the earlier state the original message transitioned to • … • msgA • msgA
Enforcing Property 2 • Property 2 A • replicate messages roll-back to the earlier state the original message transitioned to • … S S • … • … • … • … T2 T2 T1 T1 • disassociate branches after merging point • … • … • … • … • msgA TM1 TM2 • msgA • msgA TM TM • msgA • msgA • msgA • msgA • msgA • T1 or T2? • … • … • msgA • …
Enforcing Property 3 • Property 3 • msgA • msgB • retain info to regenerate every outgoing message, in case a replicate request is received • … • msgB • msgA Sharer • ( ) • ( ) dir dir M • request (M) R • unique data I • request (M) • unique data
Enforcing Property 3 • Property 3 • msgA • msgB • retain info to regenerate every outgoing message, in case a replicate request is received • … • msgB • msgA Sharer • ( ) dir M • … • request (M) R • unique data • retains TI I • invalidate permission • unique data TM • invalidate ack • …
Outline • Motivation • Methodology • Walkthrough: a resilient transaction • Defining resilience properties • Enforcing resilience properties • Evaluation • Overhead • Performance • Conclusions
Evaluation: Overhead broadcast-based protocol (AMD Hammer, MOESI) directory-based protocol (static directory node, MESI) stable stable transient transient 9 to 17 states(4 to 5 bits) No state was introduced into the critical path of serving a request 12to22states(4 to 5 bits) 12 to 22 cache states (4 to 5 bits)
Evaluation: Overhead Miss Status Holding Register (MSHR) • 0 to 213 • 13bits • 6bits • 1bit • 64bits • 11 bytes • 4-32 • entries • total storage overhead: • < 0.5 KB / core • (worst-case: 2KB / core) • (*) • (*) • assuming a 64-node CMP with in-order cores
Evaluation: Performance Simulator: Wisconsin Multifacet GEMS
Evaluation: Performance metric: runtime overhead vs. non-resilient baseline 11% directory protocol lower is better 7.4% 3.5% 1.8% 1.4% 1.1% fftfmmlu radix water water blacks cannealfluidanswaptions x264 AVERAGE nsqspcholesimate SPLASH PARSEC
Evaluation: Performance metric: runtime overhead vs. non-resilient baseline broadcast protocol 56% 51% 20.4% 5.1% 2.4% 0.5% fftfmmlu radix water water blacks cannealfluidanswaptions x264 AVERAGE nsqspcholesimate SPLASH PARSEC
Outline • Motivation • Methodology • Walkthrough: a resilient transaction • Defining resilience properties • Enforcing resilience properties • Evaluation • Overhead • Performance • Conclusions
Conclusions We have presented a generic methodology: • coherence protocol -> resilient coherence protocol …by enforcing 3 properties • minimal hardware overhead (<2KB / node) • small performance overhead • directory-based protocol: 1.4% (1 fault / msec) • broadcast-based protocol: 2.4% (1 fault / msec)
Thank You! Questions?
BACKUP SLIDES
Why performance overhead? • transactions last longer => a request may have to wait for outstanding conflicting requests to complete • data remain in caches for longer (3-way hs) => cache replacement duration • more messages are injected in the NoC => network traffic => average NoC latency
Transaction Duration +18% B: baseline protocol, no faults R: resilient protocol, 1fault/10μsec L1: transaction served by sharer's L1 L2: transaction served by directory (L2) +12%
Transaction Duration B: baseline protocol, no faults R: resilient protocol, 1fault/10μsec L1: transaction served by sharer's L1 L2: transaction served by directory (L2) 11% large working sets, shared data => high number of requests (high traffic) (!) retransmissions saturate network) 24%
Network Traffic mostcongested link average over all links
Enforcing the Resilience Properties • P2 • A single message type transits to a unique state in every FSM branch • Case 2: identical messages in same branch R • request (M) • … X • msgA SM + acks =0 • ack T1 T count =1 SM + acks =1 • … Y • ack • msgA SM + acks =2 Tcount =2 T2 • … • … M
Enforcing the Resilience Properties • P2 • A single message type transits to a unique state in every FSM branch • Case 2: identical messages in same branch • … • … • msgA • msgA X X T count =1 T [XYZ=100] • … • … • msgA • msgA Y Y Tcount =2 T [XYZ=110] • … • …
Enforcing the Resilience Properties • P2 • A single message type transits to a unique state in every FSM branch • Case 2: identical messages in same branch • … • … • msgA • msgA X X T count =1 T [XYZ=100] • … • … • msgA • msgA Y X • (duplicate) Tcount =2 T [XYZ=100] • … • …
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63