480 likes | 502 Views
This paper discusses the challenges of ordering in ring-based chip multiprocessors and proposes the concepts of ordering point, greedy-order, and ring-order as potential solutions. It explores the benefits of ring interconnects and presents the performance of different coherence protocols. The paper concludes by introducing the concept of token counting for achieving stable and fast ordering in ring-based CMPs.
E N D
Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison
Overview STABLE FAST • Rings a viable interconnect for future CMPs • Problem: Ring != Bus for ordering • Bus-based snooping coherence not sufficient • Solutions: • ORDERING-POINT: establish an ordering point • GREEDY-ORDER: greedily order requests • RING-ORDER: complete requests in ring order • RING-ORDER offers and performance STABLE FAST
Outline • Introduction and Motivation • Ring-based Coherence Protocols • Application to a CMP • Results • Conclusion
Future CMPs Bus? Crossbar? Packet-Switched? Ring?
Ring Interconnect • Why? • Short, fast point-to-point links • Fewer (data) ports • Less complex than packet-switched • Simple, distributed arbitration • Exploitable ordering for coherence
Cache Coherence for a Ring • Ring is broadcast and offers ordering • Apply existing bus-based snooping protocols? • NO! • Order properties of ring are different
Ring Order != Bus Order {A, B} P12 A P9 P3 B P6 {B, A}
Outline • Introduction and Motivation • Ring-based Coherence Protocols • Application to a CMP • Results • Conclusion
Snooping Protocols for Rings • Assumptions: • Unidirectional ring • Multiple rings per-address OK • Write-back, write-invalidate caches • Eager request forwarding • e.g., forward message then snoop • [Strauss et al. ISCA 2006] • Can total bus order be recreated? YES
ORDERING-POINT Example ordering point S P11 P1 P9 getM P10 P2 (inactive) P9 P3 O Store P8 P4 P7 P5 P6
ORDERING-POINT Example ordering point S P11 P1 P9 getM P10 P2 P9 P3 own request ordered O I Store P8 P4 P7 P5 P6
ORDERING-POINT Example ordering point S I P11 P1 P9 getM P10 P2 P9 P3 own request ordered O I Store P8 P4 P7 P5 P6
ORDERING-POINT Example ordering point S I P11 P1 P10 P2 P9 ACK P9 P3 own request ordered O I Store Data to P9 P8 P4 P7 P5 P6
ORDERING-POINT Example ordering point S I P11 P1 P10 P2 P9 ACK P6 getM P9 P3 own request ordered O I Store P6 getM Data to P9 P8 P4 P7 P5 P6 Store
ORDERING-POINT Example ordering point P11 P1 P10 P2 Data to P6 P9 P3 P6 getM Store Complete P8 P4 P7 P5 P6 Store
Bottom line: ORDERING-POINT STABLE • Requests totally ordered + Stable, predictable performance • Slow – Requests not active immediately • Extra control overhead – N + N/2 hops for request message – N/2 hops for Ack message • Can requests be active immediately? YES (e.g., IBM Power4/5)
GREEDY-ORDER Example P12 S I P11 P1 P10 P2 P9 getM response: P9 P3 O Store P8 P4 P7 P5 P6 Store
GREEDY-ORDER Example P12 P11 P1 P10 P2 P9 P3 response: ACK O I P9 getM Store will send data P8 P4 P7 P5 P6
GREEDY-ORDER Example P12 P11 P1 P10 P2 P6 getM P9 P3 response: ACK O I response: P9 getM Store will send data P8 P4 P7 P5 P6 Store
GREEDY-ORDER Example P12 P11 P1 P6 getM P10 P2 response: P9 P3 acked O I Data to P9 Store will send data P8 P4 P7 P5 P6 Store
GREEDY-ORDER Example P12 P11 P1 P10 P2 P9 P3 acked M Data to P9 Store response: P8 P6 getM P4 P7 P5 P6 RETRY Store
Bottom line: GREEDY-ORDER FAST • Average case is fast + Request active immediately • Requires combined snoop response • Synchronous timing of snoops for efficiency • Resorts to unbounded # of retries in conflict • Will conditions eventually allow request completion? • Probabilistic system (e.g. Ethernet)
Recap STABLE FAST • Existing Solutions: • ORDERING-POINT • Establishes total order • Extra latency and control message overhead • GREEDY-ORDER • Fast in common case • Unbounded retries • Ideal Solution • Fast for average case • Stable for worse-case (no retries)
New Approach: RING-ORDER STABLE FAST • + Requests complete in order of ring position • Fully exploits ring ordering • + Initial requests always succeeds • No retries, No ordering point • Fast, stable, predictable performance • Key: Use token counting • All tokens to write, one token to read
RING-ORDER Example = token = priority token P12 P11 P1 P9 getM P10 P2 P9 P3 Store P8 P4 P7 P5 P6
RING-ORDER Example FurthestDest = P9 = token = priority token P12 P11 P1 P9 getM P10 P2 P9 P3 Store P8 P4 P7 P5 P6
RING-ORDER Example FurthestDest = P9 P12 P11 P1 P10 P2 P9 P3 P6 getM Store P8 P4 P7 P5 P6 Store
RING-ORDER Example FurthestDest = P9 P12 P11 P1 P10 P2 P9 P3 Store Store Complete P8 P4 P7 P5 P6 Store Complete
RING-ORDER Recap • Key: Exploit Order ofRing with token counting • Requests never race with tokens • Furthest Destination field • Carried in responses, tracked in MSHRs • Determines if tokens need to keep moving • Priority token ensures liveness • Data satisfies all requestors during traversal
Outline • Introduction and Motivation • Ring-based Coherence Protocols • Application to a CMP • Results • Conclusion
Interfacing with Memory Controllers • Problem: When should memory respond? • Solution: 1-bit per block of memory • Owner bit for ORDERING-POINT and GREEDY-ORDER • Token-count bit for RING-ORDER • All or none tokens • Cache the bits in a Memory Interface Cache • Eliminates costly DRAM accesses • Enable GREEDY-ORDER to meet snoop timing
Outline • Introduction and Motivation • Ring-based Coherence Protocols • Application to a CMP • Results • Metholodogy • Runtime • Traffic • Performance Stability • Conclusion
Methodology • Full-system Simulation • Virtutech Simics • Wisconsin GEMS • GPL software • http://www.cs.wisc.edu/gems • Workloads: • Commercial: OLTP, Apache, SpecJBB, Zeus • Scientific: OMPart, OMPfma3d, OMPmgrid • Protocols: • ORDERING-POINT • GREEDY-ORDER (called –IDEAL in paper) • RING-ORDER
Simulation Parameters 1/2 SPARC 4GHz 64KB I&D, 4-way 2-cycle access 1MB, 4-way 15-cycle data access 8MB, 16-way 25-cycle bank access
Simulation Parameters 2/2 275-cycle DRAM access Memory Interface Cache 128KB, 16-way 256-bits per tag Ring Link: 8-cycles total delay 80-bytes per cycle
Normalized Runtime RING-ORDER is up to 52% faster than ORDERING-POINT
Ring Bandwidth RING-ORDER uses up to 34% less bandwidth
GREEDY-ORDER Starvation Processor 3 Processor 4 Processor 6 Processor 7 time issue getM 631597033 RETRY #2 ......045 RETRY #10 ......059 Complete ......081 RETRY #1 ......083 ack p7, send data ......087 issue getM ......111 RETRY #11 ......116 Complete ......127 RETRY #2 ......140 ack p3, send data ......148 RETRY #1 ......161 issue getM ......180 RETRY #3 ......197 Complete ......198 ack p7, send data ......205 RETRY #2 ......218 issue getM ......237 RETRY #4 ......254 Complete ......255 ack p3, send data ......262 issue getM RETRY #1402 +70,000 cycles
Retries RING-ORDER offers stable, bounded performance
Conclusion STABLE FAST • Rings a viable interconnect for CMPs • Ring != Bus for ordering • RING-ORDER protocol offers best of: • ORDERING-POINT (stable) and, • GREEDY-ORDER (fast) • P.S. RING-ORDER requires NO system-wide snoop response • Useful for hierarchy of rings
Flexible Snooping [Strauss et al. ISCA 2006] • Eager vs. Lazy forwarding • Key Differences: • Targets coherence between bus-based CMPs • Logical ring on message-passing interconnect • Protocol similar to GREEDY-ORDER • Uses a separate combined snoop response message • RING-ORDER also works with logical ring • Possible to extend protocol to send data off the ring • Lazy vs. Eager Forwarding applies to RING-ORDER • Synergistic fit to reduce snoop power