1 / 30

Efficient Interconnects for Clustered Microarchitectures

Efficient Interconnects for Clustered Microarchitectures. Joan-Manuel Parcerisa Antonio González Universitat Politècnica de Catalunya – Barcelona, Spain {jmanel,antonio}@ac.upc.es. Julio Sahuquillo José Duato

josephbarry
Download Presentation

Efficient Interconnects for Clustered Microarchitectures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient Interconnects for Clustered Microarchitectures Joan-Manuel Parcerisa Antonio González Universitat Politècnica de Catalunya – Barcelona, Spain{jmanel,antonio}@ac.upc.es Julio Sahuquillo José Duato Universitat Politècnica de València – València, Spain{jsahuqui,jduato}@disca.upv.es

  2. Why Clustered Microarchitectures • Larger issue width, window length, predictor sizes • More complexity  more latency and power • Even worse: wire delays do not scale across technologies • Deeper pipelines, fewer logic levels per stage • Tight loops difficult to fit in a single cycle • E.g. issue logic, bypass • Partitioning critical structures attacks both problems • E.g. clustered microarchitectures

  3. Fetch/Decode/Rename Steering Logic Local I-Queue C0 C1 C2 C3 Local Register File FU FU Interconnect-network (ICN) A Typical Clustered uArch • Partitioned processor core • Instructions dynamically steered • Each cluster: RF, IQ, FUs • Faster issue, read, bypass • Inter-cluster communications • Go through slow interconnects • Take 1 cycle or more • Steering must maximize communication locality

  4. Motivation • ICN is a critical part of the architecture • Performance very sensitive to communication latency ! • ICN assumed by previous works • Cross-bar  does not scale • Ring  simple, but long delays • Idealized Our proposals • Several point-to-point ICN for 4 and 8 clusters • Implementable, simple and efficient • A topology-aware steering

  5. Outline • Clustered architecture • Topology-aware steering • Proposed Interconnects • Experimental results • Summary and conclusions

  6. Our Assumed Clustered uArch • Distributed RF • Results only written to local RF • Values are communicated with copy instructions • Automatically inserted • Each copy creates a new instance • Rename Table tracks locations of multiple instances

  7. (to C1) R1:= R2 + R2 I WB ICN delay Wait for R1 F D I Ex WB (to C2) R3:= R1 + R3 F D I Ex WB Wakeup signals Communication Timing (to C1) copy R1C1->C2

  8. Baseline Steering Scheme(dependence-based) 1. Minimize communication penalty • If all source operands available • Select clusters that minimize # communications • If any source operand not available • Select producer cluster 2. Maximize workload balance • Choose the least loaded of clusters selected by rule 1 One exception: • If workload imbalance > threshold, ignore rule 1

  9. Topology-Aware Steering Scheme • Also minimize distance • Change part of rule 1: If all source operands are available: • Baseline: “Select clusters that minimize # communications” • Topology-aware: “Select clusters that minimize the longest communication distance”

  10. cluster cluster router router Design Issues: Bandwidth • For each additional input bypass path • 1 tag across the IQ • 1 RF write port • 1 entry to FU input MUXes • It increases the wakeup and bypass delays • Bandwidth requirements are rather low • 1 input bypass path per cluster (1 RF write port) • 2 links per connected cluster pair

  11. Design Issues: Latency • Performance very sensitive to communication latency • Simple routing structures and algorithms • Source routing • No intermediate buffering • In-transit messages have priority over newly injected ones

  12. Design Issues: Connectivity • Assumed 1-cycle communication delay between adjacent clusters • Number of “adjacents” dictated by technology and layout • Study topologies with different connectivity degrees

  13. Design Issues: Point-to-point vs Buses • Point-to-point advantages • Access to links is arbitrated locally • Wires are shorter and less loaded • Shared buses are studied for comparison

  14. C0 C1 C2 C3 Interconnects for 4 clusters (I) • Bus2 • 1 Bus per cluster, each connected to 1 write port • Latency = 4 cycles (2 for arbitration + 2 for transmission) • Arbitration overlaps with transmission

  15. Even cycles Odd cycles No conflict! Inject 1-hop message (or forward in-transit) Inject 2-hops message Interconnects for 4 clusters (II) • Synchronous Ring • Injection rules prevent that 2 messages arrive at once: • Even cycles: 1-hop: counter-clockwise/ 2-hops: clockwise • Odd cycles: reverse directions

  16. c1 c0 Input Queues c3 c2 Interconnects for 4 clusters (III) • Partially Asynchronous Ring • Messages may issue in any cycle • 2 messages may arrive at once • Small input queues

  17. Interconnects for 4 clusters (IV) • Ideal Ring • Contention-free • unlimited number of links • unlimited number of RF write ports • For comparison purposes (upper-bound performance)

  18. Interconnects for 8 Clusters (I) • Buses • Analogous to those for 4 clusters • Bus2: same latency (optimistic): 2+2 cycles • Bus4: twice the latency (realistic): 4+4 cycles • Rings • Analogous to those for 4 clusters • Synchronous and Asynchronous • Max. Distance = 4 hops (average 2.29 hops)

  19. Top Left Right Cluster datapath Only for last hop of messages Interconnects for 8 Clusters (II) • Mesh • Max. distance = 4 hops (average = 2 hops) • 2 in-transit messages may compete for the same output link • Constrained connectivity

  20. Only for last hop of messages Interconnects for 8 Clusters (III) • Torus • Max. distance = 3 hops • Same connectivity constraints as the mesh

  21. Interconnects for 8 Clusters (IV) • Ideal Torus • Contention-free • unlimited number of links • unlimited number of RF write ports • For comparison purposes (upper-bound performance)

  22. Top Link Qin Router Structures • Common features to all ICN • No intermediate buffering LeftLink RightLink • Partially asynchronous ICN • Competence for a write port • Add small input queues • Topologies with 3 adjacent nodes • Competence for the same output link • Constrained connectivity Cluster Datapath

  23. Experimental Setup • Simulation • Extended version of sim-outorder (SimpleScalar v3.0) • 14 Mediabench programs • Compiled with –O4 for an Alpha AXP • Architecture • L1 D-cache: 64KB, 2-way, 3-cycle hit • 128 ROB, 64 LSQ • Each cluster: 2-way issue, 16-entry IQ, 56 physical regs.

  24. Performance: 4 Clusters • Poor performance of Bus2 • Asynchronous Ring • Better than Synchronous Ring • Close to Ideal (within 1%)

  25. Synchronous / Asynchronous • Contention delays • Lower for Async. Ring • Message issues as soon asthe link is available • Higher for 1-hop messages • a single path • Sync. Ring: issue 1 cycle every 2

  26. Length of Input Queues • Max. observed occupancy < 9 entries • Handle overflows by flushing the pipeline • Rather than including complex control flow Sample statistics (djpeg)

  27. Performance: 8 Clusters • Poor performance of buses • Connectivity degree has a significant impact • Asynchronous Torus close to Ideal (within1.5%)

  28. Topology-Aware Steering 16.5% IPC improvement with 8 clusters (2.5% with 4 clusters)

  29. Summary • An efficient topology-aware steering scheme • Cluster point-to-point interconnects • For 4 clusters and 8 clusters • Designed to minimize complexity and latency • Compared to • Bus-based models • Idealized models with unlimited bandwidth

  30. Conclusions • The choice of ICN is crucial for performance • Point-to-point better than buses • Asynchronous rings better than synchronous • Asynchronous interconnects perform close to ideal • with minimal complexity • Higher connectivity significantly improves performance • Topology-aware steering essential to reduce latency • Especially with many clusters

More Related