1 / 47

Coherence Ordering for Ring-based Chip Multiprocessors

Coherence Ordering for Ring-based Chip Multiprocessors. Mike Marty and Mark D. Hill University of Wisconsin-Madison. Overview. STABLE. FAST. Rings a viable interconnect for future CMPs Problem: Ring != Bus for ordering Bus-based snooping coherence not sufficient Solutions:

herbert
Download Presentation

Coherence Ordering for Ring-based Chip Multiprocessors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison

  2. Overview STABLE FAST • Rings a viable interconnect for future CMPs • Problem: Ring != Bus for ordering • Bus-based snooping coherence not sufficient • Solutions: • ORDERING-POINT: establish an ordering point • GREEDY-ORDER: greedily order requests • RING-ORDER: complete requests in ring order • RING-ORDER offers and performance STABLE FAST

  3. Outline • Introduction and Motivation • Ring-based Coherence Protocols • Application to a CMP • Results • Conclusion

  4. Future CMPs Bus? Crossbar? Packet-Switched? Ring?

  5. The “Cell” Processor

  6. Ring Interconnect • Why? • Short, fast point-to-point links • Fewer (data) ports • Less complex than packet-switched • Simple, distributed arbitration • Exploitable ordering for coherence

  7. Cache Coherence for a Ring

  8. Cache Coherence for a Ring • Ring is broadcast and offers ordering • Apply existing bus-based snooping protocols? • NO! • Order properties of ring are different

  9. Ring Order != Bus Order {A, B} P12 A P9 P3 B P6 {B, A}

  10. Outline • Introduction and Motivation • Ring-based Coherence Protocols • Application to a CMP • Results • Conclusion

  11. Snooping Protocols for Rings • Assumptions: • Unidirectional ring • Multiple rings per-address OK • Write-back, write-invalidate caches • Eager request forwarding • e.g., forward message then snoop • [Strauss et al. ISCA 2006] • Can total bus order be recreated? YES

  12. ORDERING-POINT Example ordering point S P11 P1 P9 getM P10 P2 (inactive) P9 P3 O Store P8 P4 P7 P5 P6

  13. ORDERING-POINT Example ordering point S P11 P1 P9 getM P10 P2 P9 P3 own request ordered O  I Store P8 P4 P7 P5 P6

  14. ORDERING-POINT Example ordering point S  I P11 P1 P9 getM P10 P2 P9 P3 own request ordered O  I Store P8 P4 P7 P5 P6

  15. ORDERING-POINT Example ordering point S  I P11 P1 P10 P2 P9 ACK P9 P3 own request ordered O  I Store Data to P9 P8 P4 P7 P5 P6

  16. ORDERING-POINT Example ordering point S  I P11 P1 P10 P2 P9 ACK P6 getM P9 P3 own request ordered O  I Store P6 getM Data to P9 P8 P4 P7 P5 P6 Store

  17. ORDERING-POINT Example ordering point P11 P1 P10 P2 Data to P6 P9 P3 P6 getM Store Complete P8 P4 P7 P5 P6 Store

  18. Bottom line: ORDERING-POINT STABLE • Requests totally ordered + Stable, predictable performance • Slow – Requests not active immediately • Extra control overhead – N + N/2 hops for request message – N/2 hops for Ack message • Can requests be active immediately? YES (e.g., IBM Power4/5)

  19. GREEDY-ORDER Example P12 S  I P11 P1 P10 P2 P9 getM response: P9 P3 O Store P8 P4 P7 P5 P6 Store

  20. GREEDY-ORDER Example P12 P11 P1 P10 P2 P9 P3 response: ACK O  I P9 getM Store will send data P8 P4 P7 P5 P6

  21. GREEDY-ORDER Example P12 P11 P1 P10 P2 P6 getM P9 P3 response: ACK O  I response: P9 getM Store will send data P8 P4 P7 P5 P6 Store

  22. GREEDY-ORDER Example P12 P11 P1 P6 getM P10 P2 response: P9 P3 acked O  I Data to P9 Store will send data P8 P4 P7 P5 P6 Store

  23. GREEDY-ORDER Example P12 P11 P1 P10 P2 P9 P3 acked M Data to P9 Store response: P8 P6 getM P4 P7 P5 P6 RETRY Store

  24. Bottom line: GREEDY-ORDER FAST • Average case is fast + Request active immediately • Requires combined snoop response • Synchronous timing of snoops for efficiency • Resorts to unbounded # of retries in conflict • Will conditions eventually allow request completion? • Probabilistic system (e.g. Ethernet)

  25. Recap STABLE FAST • Existing Solutions: • ORDERING-POINT • Establishes total order • Extra latency and control message overhead • GREEDY-ORDER • Fast in common case • Unbounded retries • Ideal Solution • Fast for average case • Stable for worse-case (no retries)

  26. New Approach: RING-ORDER STABLE FAST • + Requests complete in order of ring position • Fully exploits ring ordering • + Initial requests always succeeds • No retries, No ordering point • Fast, stable, predictable performance • Key: Use token counting • All tokens to write, one token to read

  27. RING-ORDER Example = token = priority token P12 P11 P1 P9 getM P10 P2 P9 P3 Store P8 P4 P7 P5 P6

  28. RING-ORDER Example FurthestDest = P9 = token = priority token P12 P11 P1 P9 getM P10 P2 P9 P3 Store P8 P4 P7 P5 P6

  29. RING-ORDER Example FurthestDest = P9 P12 P11 P1 P10 P2 P9 P3 P6 getM Store P8 P4 P7 P5 P6 Store

  30. RING-ORDER Example FurthestDest = P9 P12 P11 P1 P10 P2 P9 P3 Store Store Complete P8 P4 P7 P5 P6 Store Complete

  31. RING-ORDER Recap • Key: Exploit Order ofRing with token counting • Requests never race with tokens • Furthest Destination field • Carried in responses, tracked in MSHRs • Determines if tokens need to keep moving • Priority token ensures liveness • Data satisfies all requestors during traversal

  32. RING-ORDER vs. Token Coherence

  33. Outline • Introduction and Motivation • Ring-based Coherence Protocols • Application to a CMP • Results • Conclusion

  34. Applying to Baseline CMP

  35. Interfacing with Memory Controllers • Problem: When should memory respond? • Solution: 1-bit per block of memory • Owner bit for ORDERING-POINT and GREEDY-ORDER • Token-count bit for RING-ORDER • All or none tokens • Cache the bits in a Memory Interface Cache • Eliminates costly DRAM accesses • Enable GREEDY-ORDER to meet snoop timing

  36. Outline • Introduction and Motivation • Ring-based Coherence Protocols • Application to a CMP • Results • Metholodogy • Runtime • Traffic • Performance Stability • Conclusion

  37. Methodology • Full-system Simulation • Virtutech Simics • Wisconsin GEMS • GPL software • http://www.cs.wisc.edu/gems • Workloads: • Commercial: OLTP, Apache, SpecJBB, Zeus • Scientific: OMPart, OMPfma3d, OMPmgrid • Protocols: • ORDERING-POINT • GREEDY-ORDER (called –IDEAL in paper) • RING-ORDER

  38. Simulation Parameters 1/2 SPARC 4GHz 64KB I&D, 4-way 2-cycle access 1MB, 4-way 15-cycle data access 8MB, 16-way 25-cycle bank access

  39. Simulation Parameters 2/2 275-cycle DRAM access Memory Interface Cache 128KB, 16-way 256-bits per tag Ring Link: 8-cycles total delay 80-bytes per cycle

  40. Normalized Runtime RING-ORDER is up to 52% faster than ORDERING-POINT

  41. Ring Bandwidth RING-ORDER uses up to 34% less bandwidth

  42. GREEDY-ORDER Starvation Processor 3 Processor 4 Processor 6 Processor 7 time issue getM 631597033 RETRY #2 ......045 RETRY #10 ......059 Complete ......081 RETRY #1 ......083 ack p7, send data ......087 issue getM ......111 RETRY #11 ......116 Complete ......127 RETRY #2 ......140 ack p3, send data ......148 RETRY #1 ......161 issue getM ......180 RETRY #3 ......197 Complete ......198 ack p7, send data ......205 RETRY #2 ......218 issue getM ......237 RETRY #4 ......254 Complete ......255 ack p3, send data ......262 issue getM RETRY #1402 +70,000 cycles

  43. Retries RING-ORDER offers stable, bounded performance

  44. Conclusion STABLE FAST • Rings a viable interconnect for CMPs • Ring != Bus for ordering • RING-ORDER protocol offers best of: • ORDERING-POINT (stable) and, • GREEDY-ORDER (fast) • P.S. RING-ORDER requires NO system-wide snoop response • Useful for hierarchy of rings

  45. BACKUP SLIDES

  46. Flexible Snooping [Strauss et al. ISCA 2006] • Eager vs. Lazy forwarding • Key Differences: • Targets coherence between bus-based CMPs • Logical ring on message-passing interconnect • Protocol similar to GREEDY-ORDER • Uses a separate combined snoop response message • RING-ORDER also works with logical ring • Possible to extend protocol to send data off the ring • Lazy vs. Eager Forwarding applies to RING-ORDER • Synergistic fit to reduce snoop power

More Related