480 likes | 601 Views
Coherence Ordering for Ring-based Chip Multiprocessors. Mike Marty and Mark D. Hill University of Wisconsin-Madison. Overview. STABLE. FAST. Rings a viable interconnect for future CMPs Problem: Ring != Bus for ordering Bus-based snooping coherence not sufficient Solutions:
E N D
Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison
Overview STABLE FAST • Rings a viable interconnect for future CMPs • Problem: Ring != Bus for ordering • Bus-based snooping coherence not sufficient • Solutions: • ORDERING-POINT: establish an ordering point • GREEDY-ORDER: greedily order requests • RING-ORDER: complete requests in ring order • RING-ORDER offers and performance STABLE FAST
Outline • Introduction and Motivation • Ring-based Coherence Protocols • Application to a CMP • Results • Conclusion
Future CMPs Bus? Crossbar? Packet-Switched? Ring?
Ring Interconnect • Why? • Short, fast point-to-point links • Fewer (data) ports • Less complex than packet-switched • Simple, distributed arbitration • Exploitable ordering for coherence
Cache Coherence for a Ring • Ring is broadcast and offers ordering • Apply existing bus-based snooping protocols? • NO! • Order properties of ring are different
Ring Order != Bus Order {A, B} P12 A P9 P3 B P6 {B, A}
Outline • Introduction and Motivation • Ring-based Coherence Protocols • Application to a CMP • Results • Conclusion
Snooping Protocols for Rings • Assumptions: • Unidirectional ring • Multiple rings per-address OK • Write-back, write-invalidate caches • Eager request forwarding • e.g., forward message then snoop • [Strauss et al. ISCA 2006] • Can total bus order be recreated? YES
ORDERING-POINT Example ordering point S P11 P1 P9 getM P10 P2 (inactive) P9 P3 O Store P8 P4 P7 P5 P6
ORDERING-POINT Example ordering point S P11 P1 P9 getM P10 P2 P9 P3 own request ordered O I Store P8 P4 P7 P5 P6
ORDERING-POINT Example ordering point S I P11 P1 P9 getM P10 P2 P9 P3 own request ordered O I Store P8 P4 P7 P5 P6
ORDERING-POINT Example ordering point S I P11 P1 P10 P2 P9 ACK P9 P3 own request ordered O I Store Data to P9 P8 P4 P7 P5 P6
ORDERING-POINT Example ordering point S I P11 P1 P10 P2 P9 ACK P6 getM P9 P3 own request ordered O I Store P6 getM Data to P9 P8 P4 P7 P5 P6 Store
ORDERING-POINT Example ordering point P11 P1 P10 P2 Data to P6 P9 P3 P6 getM Store Complete P8 P4 P7 P5 P6 Store
Bottom line: ORDERING-POINT STABLE • Requests totally ordered + Stable, predictable performance • Slow – Requests not active immediately • Extra control overhead – N + N/2 hops for request message – N/2 hops for Ack message • Can requests be active immediately? YES (e.g., IBM Power4/5)
GREEDY-ORDER Example P12 S I P11 P1 P10 P2 P9 getM response: P9 P3 O Store P8 P4 P7 P5 P6 Store
GREEDY-ORDER Example P12 P11 P1 P10 P2 P9 P3 response: ACK O I P9 getM Store will send data P8 P4 P7 P5 P6
GREEDY-ORDER Example P12 P11 P1 P10 P2 P6 getM P9 P3 response: ACK O I response: P9 getM Store will send data P8 P4 P7 P5 P6 Store
GREEDY-ORDER Example P12 P11 P1 P6 getM P10 P2 response: P9 P3 acked O I Data to P9 Store will send data P8 P4 P7 P5 P6 Store
GREEDY-ORDER Example P12 P11 P1 P10 P2 P9 P3 acked M Data to P9 Store response: P8 P6 getM P4 P7 P5 P6 RETRY Store
Bottom line: GREEDY-ORDER FAST • Average case is fast + Request active immediately • Requires combined snoop response • Synchronous timing of snoops for efficiency • Resorts to unbounded # of retries in conflict • Will conditions eventually allow request completion? • Probabilistic system (e.g. Ethernet)
Recap STABLE FAST • Existing Solutions: • ORDERING-POINT • Establishes total order • Extra latency and control message overhead • GREEDY-ORDER • Fast in common case • Unbounded retries • Ideal Solution • Fast for average case • Stable for worse-case (no retries)
New Approach: RING-ORDER STABLE FAST • + Requests complete in order of ring position • Fully exploits ring ordering • + Initial requests always succeeds • No retries, No ordering point • Fast, stable, predictable performance • Key: Use token counting • All tokens to write, one token to read
RING-ORDER Example = token = priority token P12 P11 P1 P9 getM P10 P2 P9 P3 Store P8 P4 P7 P5 P6
RING-ORDER Example FurthestDest = P9 = token = priority token P12 P11 P1 P9 getM P10 P2 P9 P3 Store P8 P4 P7 P5 P6
RING-ORDER Example FurthestDest = P9 P12 P11 P1 P10 P2 P9 P3 P6 getM Store P8 P4 P7 P5 P6 Store
RING-ORDER Example FurthestDest = P9 P12 P11 P1 P10 P2 P9 P3 Store Store Complete P8 P4 P7 P5 P6 Store Complete
RING-ORDER Recap • Key: Exploit Order ofRing with token counting • Requests never race with tokens • Furthest Destination field • Carried in responses, tracked in MSHRs • Determines if tokens need to keep moving • Priority token ensures liveness • Data satisfies all requestors during traversal
Outline • Introduction and Motivation • Ring-based Coherence Protocols • Application to a CMP • Results • Conclusion
Interfacing with Memory Controllers • Problem: When should memory respond? • Solution: 1-bit per block of memory • Owner bit for ORDERING-POINT and GREEDY-ORDER • Token-count bit for RING-ORDER • All or none tokens • Cache the bits in a Memory Interface Cache • Eliminates costly DRAM accesses • Enable GREEDY-ORDER to meet snoop timing
Outline • Introduction and Motivation • Ring-based Coherence Protocols • Application to a CMP • Results • Metholodogy • Runtime • Traffic • Performance Stability • Conclusion
Methodology • Full-system Simulation • Virtutech Simics • Wisconsin GEMS • GPL software • http://www.cs.wisc.edu/gems • Workloads: • Commercial: OLTP, Apache, SpecJBB, Zeus • Scientific: OMPart, OMPfma3d, OMPmgrid • Protocols: • ORDERING-POINT • GREEDY-ORDER (called –IDEAL in paper) • RING-ORDER
Simulation Parameters 1/2 SPARC 4GHz 64KB I&D, 4-way 2-cycle access 1MB, 4-way 15-cycle data access 8MB, 16-way 25-cycle bank access
Simulation Parameters 2/2 275-cycle DRAM access Memory Interface Cache 128KB, 16-way 256-bits per tag Ring Link: 8-cycles total delay 80-bytes per cycle
Normalized Runtime RING-ORDER is up to 52% faster than ORDERING-POINT
Ring Bandwidth RING-ORDER uses up to 34% less bandwidth
GREEDY-ORDER Starvation Processor 3 Processor 4 Processor 6 Processor 7 time issue getM 631597033 RETRY #2 ......045 RETRY #10 ......059 Complete ......081 RETRY #1 ......083 ack p7, send data ......087 issue getM ......111 RETRY #11 ......116 Complete ......127 RETRY #2 ......140 ack p3, send data ......148 RETRY #1 ......161 issue getM ......180 RETRY #3 ......197 Complete ......198 ack p7, send data ......205 RETRY #2 ......218 issue getM ......237 RETRY #4 ......254 Complete ......255 ack p3, send data ......262 issue getM RETRY #1402 +70,000 cycles
Retries RING-ORDER offers stable, bounded performance
Conclusion STABLE FAST • Rings a viable interconnect for CMPs • Ring != Bus for ordering • RING-ORDER protocol offers best of: • ORDERING-POINT (stable) and, • GREEDY-ORDER (fast) • P.S. RING-ORDER requires NO system-wide snoop response • Useful for hierarchy of rings
Flexible Snooping [Strauss et al. ISCA 2006] • Eager vs. Lazy forwarding • Key Differences: • Targets coherence between bus-based CMPs • Logical ring on message-passing interconnect • Protocol similar to GREEDY-ORDER • Uses a separate combined snoop response message • RING-ORDER also works with logical ring • Possible to extend protocol to send data off the ring • Lazy vs. Eager Forwarding applies to RING-ORDER • Synergistic fit to reduce snoop power