Coherence Ordering for Ring-based Chip Multiprocessors

Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison

Overview STABLE FAST • Rings a viable interconnect for future CMPs • Problem: Ring != Bus for ordering • Bus-based snooping coherence not sufficient • Solutions: • ORDERING-POINT: establish an ordering point • GREEDY-ORDER: greedily order requests • RING-ORDER: complete requests in ring order • RING-ORDER offers and performance STABLE FAST

Outline • Introduction and Motivation • Ring-based Coherence Protocols • Application to a CMP • Results • Conclusion

Future CMPs Bus? Crossbar? Packet-Switched? Ring?

The “Cell” Processor

Ring Interconnect • Why? • Short, fast point-to-point links • Fewer (data) ports • Less complex than packet-switched • Simple, distributed arbitration • Exploitable ordering for coherence

Cache Coherence for a Ring

Cache Coherence for a Ring • Ring is broadcast and offers ordering • Apply existing bus-based snooping protocols? • NO! • Order properties of ring are different

Ring Order != Bus Order {A, B} P12 A P9 P3 B P6 {B, A}

Snooping Protocols for Rings • Assumptions: • Unidirectional ring • Multiple rings per-address OK • Write-back, write-invalidate caches • Eager request forwarding • e.g., forward message then snoop • [Strauss et al. ISCA 2006] • Can total bus order be recreated? YES

ORDERING-POINT Example ordering point S P11 P1 P9 getM P10 P2 (inactive) P9 P3 O Store P8 P4 P7 P5 P6

ORDERING-POINT Example ordering point S P11 P1 P9 getM P10 P2 P9 P3 own request ordered O  I Store P8 P4 P7 P5 P6

ORDERING-POINT Example ordering point S  I P11 P1 P9 getM P10 P2 P9 P3 own request ordered O  I Store P8 P4 P7 P5 P6

ORDERING-POINT Example ordering point S  I P11 P1 P10 P2 P9 ACK P9 P3 own request ordered O  I Store Data to P9 P8 P4 P7 P5 P6

ORDERING-POINT Example ordering point S  I P11 P1 P10 P2 P9 ACK P6 getM P9 P3 own request ordered O  I Store P6 getM Data to P9 P8 P4 P7 P5 P6 Store

ORDERING-POINT Example ordering point P11 P1 P10 P2 Data to P6 P9 P3 P6 getM Store Complete P8 P4 P7 P5 P6 Store

Bottom line: ORDERING-POINT STABLE • Requests totally ordered + Stable, predictable performance • Slow – Requests not active immediately • Extra control overhead – N + N/2 hops for request message – N/2 hops for Ack message • Can requests be active immediately? YES (e.g., IBM Power4/5)

GREEDY-ORDER Example P12 S  I P11 P1 P10 P2 P9 getM response: P9 P3 O Store P8 P4 P7 P5 P6 Store

GREEDY-ORDER Example P12 P11 P1 P10 P2 P9 P3 response: ACK O  I P9 getM Store will send data P8 P4 P7 P5 P6

GREEDY-ORDER Example P12 P11 P1 P10 P2 P6 getM P9 P3 response: ACK O  I response: P9 getM Store will send data P8 P4 P7 P5 P6 Store

GREEDY-ORDER Example P12 P11 P1 P6 getM P10 P2 response: P9 P3 acked O  I Data to P9 Store will send data P8 P4 P7 P5 P6 Store

GREEDY-ORDER Example P12 P11 P1 P10 P2 P9 P3 acked M Data to P9 Store response: P8 P6 getM P4 P7 P5 P6 RETRY Store

Bottom line: GREEDY-ORDER FAST • Average case is fast + Request active immediately • Requires combined snoop response • Synchronous timing of snoops for efficiency • Resorts to unbounded # of retries in conflict • Will conditions eventually allow request completion? • Probabilistic system (e.g. Ethernet)

Recap STABLE FAST • Existing Solutions: • ORDERING-POINT • Establishes total order • Extra latency and control message overhead • GREEDY-ORDER • Fast in common case • Unbounded retries • Ideal Solution • Fast for average case • Stable for worse-case (no retries)

New Approach: RING-ORDER STABLE FAST • + Requests complete in order of ring position • Fully exploits ring ordering • + Initial requests always succeeds • No retries, No ordering point • Fast, stable, predictable performance • Key: Use token counting • All tokens to write, one token to read

RING-ORDER Example = token = priority token P12 P11 P1 P9 getM P10 P2 P9 P3 Store P8 P4 P7 P5 P6

RING-ORDER Example FurthestDest = P9 = token = priority token P12 P11 P1 P9 getM P10 P2 P9 P3 Store P8 P4 P7 P5 P6

RING-ORDER Example FurthestDest = P9 P12 P11 P1 P10 P2 P9 P3 P6 getM Store P8 P4 P7 P5 P6 Store

RING-ORDER Example FurthestDest = P9 P12 P11 P1 P10 P2 P9 P3 Store Store Complete P8 P4 P7 P5 P6 Store Complete

RING-ORDER Recap • Key: Exploit Order ofRing with token counting • Requests never race with tokens • Furthest Destination field • Carried in responses, tracked in MSHRs • Determines if tokens need to keep moving • Priority token ensures liveness • Data satisfies all requestors during traversal

RING-ORDER vs. Token Coherence

Applying to Baseline CMP

Interfacing with Memory Controllers • Problem: When should memory respond? • Solution: 1-bit per block of memory • Owner bit for ORDERING-POINT and GREEDY-ORDER • Token-count bit for RING-ORDER • All or none tokens • Cache the bits in a Memory Interface Cache • Eliminates costly DRAM accesses • Enable GREEDY-ORDER to meet snoop timing

Outline • Introduction and Motivation • Ring-based Coherence Protocols • Application to a CMP • Results • Metholodogy • Runtime • Traffic • Performance Stability • Conclusion

Methodology • Full-system Simulation • Virtutech Simics • Wisconsin GEMS • GPL software • http://www.cs.wisc.edu/gems • Workloads: • Commercial: OLTP, Apache, SpecJBB, Zeus • Scientific: OMPart, OMPfma3d, OMPmgrid • Protocols: • ORDERING-POINT • GREEDY-ORDER (called –IDEAL in paper) • RING-ORDER

Simulation Parameters 1/2 SPARC 4GHz 64KB I&D, 4-way 2-cycle access 1MB, 4-way 15-cycle data access 8MB, 16-way 25-cycle bank access

Simulation Parameters 2/2 275-cycle DRAM access Memory Interface Cache 128KB, 16-way 256-bits per tag Ring Link: 8-cycles total delay 80-bytes per cycle

Normalized Runtime RING-ORDER is up to 52% faster than ORDERING-POINT

Ring Bandwidth RING-ORDER uses up to 34% less bandwidth

GREEDY-ORDER Starvation Processor 3 Processor 4 Processor 6 Processor 7 time issue getM 631597033 RETRY #2 ......045 RETRY #10 ......059 Complete ......081 RETRY #1 ......083 ack p7, send data ......087 issue getM ......111 RETRY #11 ......116 Complete ......127 RETRY #2 ......140 ack p3, send data ......148 RETRY #1 ......161 issue getM ......180 RETRY #3 ......197 Complete ......198 ack p7, send data ......205 RETRY #2 ......218 issue getM ......237 RETRY #4 ......254 Complete ......255 ack p3, send data ......262 issue getM RETRY #1402 +70,000 cycles

Retries RING-ORDER offers stable, bounded performance

Conclusion STABLE FAST • Rings a viable interconnect for CMPs • Ring != Bus for ordering • RING-ORDER protocol offers best of: • ORDERING-POINT (stable) and, • GREEDY-ORDER (fast) • P.S. RING-ORDER requires NO system-wide snoop response • Useful for hierarchy of rings

BACKUP SLIDES

Flexible Snooping [Strauss et al. ISCA 2006] • Eager vs. Lazy forwarding • Key Differences: • Targets coherence between bus-based CMPs • Logical ring on message-passing interconnect • Protocol similar to GREEDY-ORDER • Uses a separate combined snoop response message • RING-ORDER also works with logical ring • Possible to extend protocol to send data off the ring • Lazy vs. Eager Forwarding applies to RING-ORDER • Synergistic fit to reduce snoop power

Coherence Ordering for Ring-based Chip Multiprocessors