Efficient Interconnects for Clustered Microarchitectures

Efficient Interconnects for Clustered Microarchitectures. Joan-Manuel Parcerisa Antonio González Universitat Politècnica de Catalunya – Barcelona, Spain {jmanel,antonio}@ac.upc.es. Julio Sahuquillo José Duato

Efficient Interconnects for Clustered Microarchitectures

  1. Efficient Interconnects for Clustered Microarchitectures Joan-Manuel Parcerisa Antonio González Universitat Politècnica de Catalunya – Barcelona, Spain{jmanel,antonio}@ac.upc.es Julio Sahuquillo José Duato Universitat Politècnica de València – València, Spain{jsahuqui,jduato}@disca.upv.es

  2. Why Clustered Microarchitectures • Larger issue width, window length, predictor sizes • More complexity  more latency and power • Even worse: wire delays do not scale across technologies • Deeper pipelines, fewer logic levels per stage • Tight loops difficult to fit in a single cycle • E.g. issue logic, bypass • Partitioning critical structures attacks both problems • E.g. clustered microarchitectures

  3. Fetch/Decode/Rename Steering Logic Local I-Queue C0 C1 C2 C3 Local Register File FU FU Interconnect-network (ICN) A Typical Clustered uArch • Partitioned processor core • Instructions dynamically steered • Each cluster: RF, IQ, FUs • Faster issue, read, bypass • Inter-cluster communications • Go through slow interconnects • Take 1 cycle or more • Steering must maximize communication locality

  4. Motivation • ICN is a critical part of the architecture • Performance very sensitive to communication latency ! • ICN assumed by previous works • Cross-bar  does not scale • Ring  simple, but long delays • Idealized Our proposals • Several point-to-point ICN for 4 and 8 clusters • Implementable, simple and efficient • A topology-aware steering

  5. Outline • Clustered architecture • Topology-aware steering • Proposed Interconnects • Experimental results • Summary and conclusions

  6. Our Assumed Clustered uArch • Distributed RF • Results only written to local RF • Values are communicated with copy instructions • Automatically inserted • Each copy creates a new instance • Rename Table tracks locations of multiple instances

  7. (to C1) R1:= R2 + R2 I WB ICN delay Wait for R1 F D I Ex WB (to C2) R3:= R1 + R3 F D I Ex WB Wakeup signals Communication Timing (to C1) copy R1C1->C2

  8. Baseline Steering Scheme(dependence-based) 1. Minimize communication penalty • If all source operands available • Select clusters that minimize # communications • If any source operand not available • Select producer cluster 2. Maximize workload balance • Choose the least loaded of clusters selected by rule 1 One exception: • If workload imbalance > threshold, ignore rule 1

  9. Topology-Aware Steering Scheme • Also minimize distance • Change part of rule 1: If all source operands are available: • Baseline: “Select clusters that minimize # communications” • Topology-aware: “Select clusters that minimize the longest communication distance”

  10. cluster cluster router router Design Issues: Bandwidth • For each additional input bypass path • 1 tag across the IQ • 1 RF write port • 1 entry to FU input MUXes • It increases the wakeup and bypass delays • Bandwidth requirements are rather low • 1 input bypass path per cluster (1 RF write port) • 2 links per connected cluster pair

  11. Design Issues: Latency • Performance very sensitive to communication latency • Simple routing structures and algorithms • Source routing • No intermediate buffering • In-transit messages have priority over newly injected ones

  12. Design Issues: Connectivity • Assumed 1-cycle communication delay between adjacent clusters • Number of “adjacents” dictated by technology and layout • Study topologies with different connectivity degrees

  13. Design Issues: Point-to-point vs Buses • Point-to-point advantages • Access to links is arbitrated locally • Wires are shorter and less loaded • Shared buses are studied for comparison

  14. C0 C1 C2 C3 Interconnects for 4 clusters (I) • Bus2 • 1 Bus per cluster, each connected to 1 write port • Latency = 4 cycles (2 for arbitration + 2 for transmission) • Arbitration overlaps with transmission

  15. Even cycles Odd cycles No conflict! Inject 1-hop message (or forward in-transit) Inject 2-hops message Interconnects for 4 clusters (II) • Synchronous Ring • Injection rules prevent that 2 messages arrive at once: • Even cycles: 1-hop: counter-clockwise/ 2-hops: clockwise • Odd cycles: reverse directions

  16. c1 c0 Input Queues c3 c2 Interconnects for 4 clusters (III) • Partially Asynchronous Ring • Messages may issue in any cycle • 2 messages may arrive at once • Small input queues

  17. Interconnects for 4 clusters (IV) • Ideal Ring • Contention-free • unlimited number of links • unlimited number of RF write ports • For comparison purposes (upper-bound performance)

  18. Interconnects for 8 Clusters (I) • Buses • Analogous to those for 4 clusters • Bus2: same latency (optimistic): 2+2 cycles • Bus4: twice the latency (realistic): 4+4 cycles • Rings • Analogous to those for 4 clusters • Synchronous and Asynchronous • Max. Distance = 4 hops (average 2.29 hops)

  19. Top Left Right Cluster datapath Only for last hop of messages Interconnects for 8 Clusters (II) • Mesh • Max. distance = 4 hops (average = 2 hops) • 2 in-transit messages may compete for the same output link • Constrained connectivity

  20. Only for last hop of messages Interconnects for 8 Clusters (III) • Torus • Max. distance = 3 hops • Same connectivity constraints as the mesh

  21. Interconnects for 8 Clusters (IV) • Ideal Torus • Contention-free • unlimited number of links • unlimited number of RF write ports • For comparison purposes (upper-bound performance)

  22. Top Link Qin Router Structures • Common features to all ICN • No intermediate buffering LeftLink RightLink • Partially asynchronous ICN • Competence for a write port • Add small input queues • Topologies with 3 adjacent nodes • Competence for the same output link • Constrained connectivity Cluster Datapath

  23. Experimental Setup • Simulation • Extended version of sim-outorder (SimpleScalar v3.0) • 14 Mediabench programs • Compiled with –O4 for an Alpha AXP • Architecture • L1 D-cache: 64KB, 2-way, 3-cycle hit • 128 ROB, 64 LSQ • Each cluster: 2-way issue, 16-entry IQ, 56 physical regs.

  24. Performance: 4 Clusters • Poor performance of Bus2 • Asynchronous Ring • Better than Synchronous Ring • Close to Ideal (within 1%)

  25. Synchronous / Asynchronous • Contention delays • Lower for Async. Ring • Message issues as soon asthe link is available • Higher for 1-hop messages • a single path • Sync. Ring: issue 1 cycle every 2

  26. Length of Input Queues • Max. observed occupancy < 9 entries • Handle overflows by flushing the pipeline • Rather than including complex control flow Sample statistics (djpeg)

  27. Performance: 8 Clusters • Poor performance of buses • Connectivity degree has a significant impact • Asynchronous Torus close to Ideal (within1.5%)

  28. Topology-Aware Steering 16.5% IPC improvement with 8 clusters (2.5% with 4 clusters)

  29. Summary • An efficient topology-aware steering scheme • Cluster point-to-point interconnects • For 4 clusters and 8 clusters • Designed to minimize complexity and latency • Compared to • Bus-based models • Idealized models with unlimited bandwidth

  30. Conclusions • The choice of ICN is crucial for performance • Point-to-point better than buses • Asynchronous rings better than synchronous • Asynchronous interconnects perform close to ideal • with minimal complexity • Higher connectivity significantly improves performance • Topology-aware steering essential to reduce latency • Especially with many clusters

