Efficient Interconnects for Clustered Microarchitectures

Efficient Interconnects for Clustered Microarchitectures Joan-Manuel Parcerisa Antonio González Universitat Politècnica de Catalunya – Barcelona, Spain{jmanel,antonio}@ac.upc.es Julio Sahuquillo José Duato Universitat Politècnica de València – València, Spain{jsahuqui,jduato}@disca.upv.es

Why Clustered Microarchitectures • Larger issue width, window length, predictor sizes • More complexity  more latency and power • Even worse: wire delays do not scale across technologies • Deeper pipelines, fewer logic levels per stage • Tight loops difficult to fit in a single cycle • E.g. issue logic, bypass • Partitioning critical structures attacks both problems • E.g. clustered microarchitectures

Fetch/Decode/Rename Steering Logic Local I-Queue C0 C1 C2 C3 Local Register File FU FU Interconnect-network (ICN) A Typical Clustered uArch • Partitioned processor core • Instructions dynamically steered • Each cluster: RF, IQ, FUs • Faster issue, read, bypass • Inter-cluster communications • Go through slow interconnects • Take 1 cycle or more • Steering must maximize communication locality

Motivation • ICN is a critical part of the architecture • Performance very sensitive to communication latency ! • ICN assumed by previous works • Cross-bar  does not scale • Ring  simple, but long delays • Idealized Our proposals • Several point-to-point ICN for 4 and 8 clusters • Implementable, simple and efficient • A topology-aware steering

Outline • Clustered architecture • Topology-aware steering • Proposed Interconnects • Experimental results • Summary and conclusions

Our Assumed Clustered uArch • Distributed RF • Results only written to local RF • Values are communicated with copy instructions • Automatically inserted • Each copy creates a new instance • Rename Table tracks locations of multiple instances

(to C1) R1:= R2 + R2 I WB ICN delay Wait for R1 F D I Ex WB (to C2) R3:= R1 + R3 F D I Ex WB Wakeup signals Communication Timing (to C1) copy R1C1->C2

Baseline Steering Scheme(dependence-based) 1. Minimize communication penalty • If all source operands available • Select clusters that minimize # communications • If any source operand not available • Select producer cluster 2. Maximize workload balance • Choose the least loaded of clusters selected by rule 1 One exception: • If workload imbalance > threshold, ignore rule 1

Topology-Aware Steering Scheme • Also minimize distance • Change part of rule 1: If all source operands are available: • Baseline: “Select clusters that minimize # communications” • Topology-aware: “Select clusters that minimize the longest communication distance”

cluster cluster router router Design Issues: Bandwidth • For each additional input bypass path • 1 tag across the IQ • 1 RF write port • 1 entry to FU input MUXes • It increases the wakeup and bypass delays • Bandwidth requirements are rather low • 1 input bypass path per cluster (1 RF write port) • 2 links per connected cluster pair

Design Issues: Latency • Performance very sensitive to communication latency • Simple routing structures and algorithms • Source routing • No intermediate buffering • In-transit messages have priority over newly injected ones

Design Issues: Connectivity • Assumed 1-cycle communication delay between adjacent clusters • Number of “adjacents” dictated by technology and layout • Study topologies with different connectivity degrees

Design Issues: Point-to-point vs Buses • Point-to-point advantages • Access to links is arbitrated locally • Wires are shorter and less loaded • Shared buses are studied for comparison

C0 C1 C2 C3 Interconnects for 4 clusters (I) • Bus2 • 1 Bus per cluster, each connected to 1 write port • Latency = 4 cycles (2 for arbitration + 2 for transmission) • Arbitration overlaps with transmission

Even cycles Odd cycles No conflict! Inject 1-hop message (or forward in-transit) Inject 2-hops message Interconnects for 4 clusters (II) • Synchronous Ring • Injection rules prevent that 2 messages arrive at once: • Even cycles: 1-hop: counter-clockwise/ 2-hops: clockwise • Odd cycles: reverse directions

c1 c0 Input Queues c3 c2 Interconnects for 4 clusters (III) • Partially Asynchronous Ring • Messages may issue in any cycle • 2 messages may arrive at once • Small input queues

Interconnects for 4 clusters (IV) • Ideal Ring • Contention-free • unlimited number of links • unlimited number of RF write ports • For comparison purposes (upper-bound performance)

Interconnects for 8 Clusters (I) • Buses • Analogous to those for 4 clusters • Bus2: same latency (optimistic): 2+2 cycles • Bus4: twice the latency (realistic): 4+4 cycles • Rings • Analogous to those for 4 clusters • Synchronous and Asynchronous • Max. Distance = 4 hops (average 2.29 hops)

Top Left Right Cluster datapath Only for last hop of messages Interconnects for 8 Clusters (II) • Mesh • Max. distance = 4 hops (average = 2 hops) • 2 in-transit messages may compete for the same output link • Constrained connectivity

Only for last hop of messages Interconnects for 8 Clusters (III) • Torus • Max. distance = 3 hops • Same connectivity constraints as the mesh

Interconnects for 8 Clusters (IV) • Ideal Torus • Contention-free • unlimited number of links • unlimited number of RF write ports • For comparison purposes (upper-bound performance)

Top Link Qin Router Structures • Common features to all ICN • No intermediate buffering LeftLink RightLink • Partially asynchronous ICN • Competence for a write port • Add small input queues • Topologies with 3 adjacent nodes • Competence for the same output link • Constrained connectivity Cluster Datapath

Experimental Setup • Simulation • Extended version of sim-outorder (SimpleScalar v3.0) • 14 Mediabench programs • Compiled with –O4 for an Alpha AXP • Architecture • L1 D-cache: 64KB, 2-way, 3-cycle hit • 128 ROB, 64 LSQ • Each cluster: 2-way issue, 16-entry IQ, 56 physical regs.

Performance: 4 Clusters • Poor performance of Bus2 • Asynchronous Ring • Better than Synchronous Ring • Close to Ideal (within 1%)

Synchronous / Asynchronous • Contention delays • Lower for Async. Ring • Message issues as soon asthe link is available • Higher for 1-hop messages • a single path • Sync. Ring: issue 1 cycle every 2

Length of Input Queues • Max. observed occupancy < 9 entries • Handle overflows by flushing the pipeline • Rather than including complex control flow Sample statistics (djpeg)

Performance: 8 Clusters • Poor performance of buses • Connectivity degree has a significant impact • Asynchronous Torus close to Ideal (within1.5%)

Topology-Aware Steering 16.5% IPC improvement with 8 clusters (2.5% with 4 clusters)

Summary • An efficient topology-aware steering scheme • Cluster point-to-point interconnects • For 4 clusters and 8 clusters • Designed to minimize complexity and latency • Compared to • Bus-based models • Idealized models with unlimited bandwidth

Conclusions • The choice of ICN is crucial for performance • Point-to-point better than buses • Asynchronous rings better than synchronous • Asynchronous interconnects perform close to ideal • with minimal complexity • Higher connectivity significantly improves performance • Topology-aware steering essential to reduce latency • Especially with many clusters

Efficient Interconnects for Clustered Microarchitectures

Efficient Interconnects for Clustered Microarchitectures

Presentation Transcript

Basic Interconnects

Research Directions for On-chip Network Microarchitectures

Optical Interconnects

Optical Interconnects for Computing Applications

21.1 Efficient On-Chip Global Interconnects

Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures

Interconnects for more than MPI

Efficient Support for Interactive Browsing Operations in Clustered CBR Video Servers

Interconnects

Heterogeneous Clustered VLIW Microarchitectures

Clustered Computing

VLSI Interconnects

Clustered Systems for Massive Parallelism

Efficient Clustered BVH Update Algorithm for Highly-Dynamic Models

Optical Interconnects for Computer Systems

ISAs and Microarchitectures

Bio-templated Interconnects

Clustered Planarity = Flat Clustered Planarity

Heterogeneous Clustered VLIW Microarchitectures

ISAs and Microarchitectures

Efficient Clustered BVH Update Algorithm for Highly-Dynamic Models

A Survey on Clustered and Energy Efficient Routing Protocols for Wireless Sensor Networks