Microarchitectural Wire Management for Performance and Power in Partitioned Architectures

Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen Muralimanohar Karthik Ramani Venkatanand Venkatachalapathy University of Utah

Overview/Motivation • Wire delays are costly for performance and power • Latencies of 30 cycles to reach ends of a chip • 50% of dynamic power is in interconnect switching (Magen et al. SLIP 04) • Abundant number of metal layers

Wire Characteristics • Wire Resistance and capacitance per unit length (Width & Spacing)  Delay  (as delay  RC), Bandwidth 

Design Space Exploration • Tuning wire width and spacing 2d d Resistance Resistance B Wires Capacitance Capacitance Bandwidth L wires

Transmission Lines • Allow extremely low delay • High implementation complexity and overhead! • Large width • Large spacing between wires • Design of sensing circuit • Shielding power and ground lines adjacent to each line • Implemented in test CMOS chips • Not employed in this study

Design Space Exploration • Tuning Repeater size and spacing Power Optimal Wires Smaller repeaters Increased spacing Delay Power Traditional Wires Large repeaters Optimum spacing

Design Space Exploration Base case B wires Bandwidth Optimized W wires Power Optimized P wires Power and B/W Optimized PW wires Fast, low bandwidth L wires

Outline • Overview • Wire Design Space Exploration • Employing L wires for Performance • PW wires: The Power Optimizers • Results • Conclusions

Evaluation Platform • Centralized front-end • I-Cache & D-Cache • LSQ • Branch Predictor • Clustered back-end L1 D Cache Cluster

Cache Access 5c Eff. Address Transfer 10c L S Q L1 D Cache Data return at 20c Mem. Dep Resolution 5c Cache Pipeline Cache Access 5c Cache Access 5c Eff. Address Transfer 10c Functional Unit L S Q L1 D Cache Eff. Address Transfer 10c L S Q L1 D Cache 8-bit Transfer 5c Data return at 20c Mem. Dep Resolution 5c Data return at 14c Partial Mem. Dep Resolution 3c

L wires: Accelerating cache access • Transmit LSB bits of effective address through L wires • Faster memory disambiguation • Partial comparison of loads and stores in LSQ • Introduces false dependences ( < 9%) • Indexing data and tag RAM arrays • LSB bits can prefetch data out of L1$ • Reduce access latency of loads

L wires: Narrow Bit Width Operands • PowerPC: Data bit-width determines FU latency • Transfer of 10 bit integers on L wires • Can introduce scheduling difficulties • A predictor table of saturating counters • Accuracy of 98% • Reduction in branch mispredict penalty

Power Efficient Wires. Idea: steer non-critical data through energy efficient PW interconnect Base case B wires Power and B/W Optimized PW wires

PW wires: Power/Bandwidth Efficient Regfile • Ready Register operands • Transfer of data at instruction dispatch • Transfer of input operands to remote register file • Covered by long dispatch to issue latency • Store data • Could stall commit process • Delay dependent loads IQ FU Operand is ready at cycle 90 Regfile Rename & Dispatch IQ FU Regfile IQ FU Consumer instruction Dispatched at cycle 100 Regfile IQ FU

Outline • Overview • Wire Design Space Exploration • Employing L wires for Performance • PW wires: The Power Optimizers • Results • Conclusions

Evaluation Methodology • Simplescalar -3.0 augmented to simulate a dynamically scheduled 4-cluster model • Crossbar interconnects (L, B and PW wires) L1 D Cache Cluster B wires (2 cycles) L wires (1 cycle) PW wires (3 cycles)

Heterogeneous Interconnects • Intercluster global Interconnect • 72 B wires (64 data bits and 8 control bits) • Repeaters sized and spaced for optimum delay • 18 L wires • Wide wires and large spacing • Occupies more area • Low latencies • 144 PW wires • Poor delay • High bandwidth • Low power

Analytical Model C = Ca + WsCb + Cc/S 1 2 3 Fringing Capacitance Capacitance between adjacent metal layers Capacitance between adjacent wires RC Model of the wire Total Power = Short-Circuit Power + Switching Power + Leakage Power

Evaluation methodology • Simplescalar -3.0 augmented to simulate a dynamically scheduled 16-cluster model • Ring latencies • B wires ( 4 cycles) • PW wires ( 6 cycles) • L wires (2 cycles) D-cache I-Cache Cluster LSQ Cross bar Ring interconnect

IPC improvements: L wires L wires improve performance by 4.2% on four cluster system and 7.1% on a sixteen cluster system

Four Cluster System: ED2 Improvements Link Relativemetal area IPC Relative processor energy (10%) Relative ED2 (10%) Relative ED2 (20%) 144 B 1.0 0.95 100 100 100 288 PW 1.0 0.92 97 103.4 100.2 144 PW 36 L 1.5 0.96 97 95.0 92.1 288 B 2.0 0.98 103 96.6 99.2 288 PW,36 L 2.0 0.97 99 94.4 93.2 144 B, 36 L 2.0 0.99 101 93.3 94.5

Link IPC Relative Processor Energy (20%) Relative ED2 (20%) 144 B 1.11 100 100 144 PW, 36 L 1.05 94 105.3 288 B 1.18 105 93.1 144 B, 36 L 1.19 102 88.7 288 B, 36 L 1.22 107 88.7 Sixteen Cluster system: ED2 gains

Conclusions • Exposing the wire design space to the architecture • A case for micro-architectural wire management! • A low latency low bandwidth network alone helps improve performance by up to 7% • ED2 improvements of about 11% compared to a baseline processor with homogeneous interconnect • Entails hardware complexity

Future work • 3-D wire model for the interconnects • Design of heterogeneous clusters • Interconnects for cache coherence and L2$

Questions and Comments? Thank you!

Microarchitectural Wire Management for Performance and Power in Partitioned Architectures

Microarchitectural Wire Management for Performance and Power in Partitioned Architectures

Presentation Transcript

Scalable Thread Scheduling and Global Power Management for Heterogeneous Many-Core Architectures

Performance and Power Management for Cloud Infrastructures

Compiler Managed Partitioned Data Caches for Low Power

Performance and Power Optimization through Data Compression in Network-on-Chip Architectures

Performance Management and Pay for Performance

Performance and Productivity of Emerging Architectures

Performance in GPU Architectures: Potentials and Distances

Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures

Coordinated Performance and Power Management

Microarchitectural Techniques for Power Gating of Execution Units

High-Performance Networks for Dataflow Architectures

Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures

Microarchitectural Floorplanning Under Performance and Temperature Tradeoff

Wire-driven Microarchitectural Design Space Exploration

Microarchitectural Wire Management for Performance and Power in partitioned architectures

Performance Evaluation of Architectures

Microarchitectural Performance Characterization of Irregular GPU Kernels

Hardware Architectures for Power and Energy Adaptation

Inherently Lower-Power High-Performance Superscalar Architectures

Compiler Challenges for High Performance Architectures

Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors