250 likes | 340 Views
Microarchitectural Wire Management for Performance and Power in Partitioned Architectures. Rajeev Balasubramonian Naveen Muralimanohar Karthik Ramani Venkatanand Venkatachalapathy. Overview/Motivation . Wire delays are costly for performance and power
E N D
Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen Muralimanohar Karthik Ramani Venkatanand Venkatachalapathy University of Utah
Overview/Motivation • Wire delays are costly for performance and power • Latencies of 30 cycles to reach ends of a chip • 50% of dynamic power is in interconnect switching (Magen et al. SLIP 04) • Abundant number of metal layers
Wire Characteristics • Wire Resistance and capacitance per unit length (Width & Spacing) Delay (as delay RC), Bandwidth
Design Space Exploration • Tuning wire width and spacing 2d d Resistance Resistance B Wires Capacitance Capacitance Bandwidth L wires
Transmission Lines • Allow extremely low delay • High implementation complexity and overhead! • Large width • Large spacing between wires • Design of sensing circuit • Shielding power and ground lines adjacent to each line • Implemented in test CMOS chips • Not employed in this study
Design Space Exploration • Tuning Repeater size and spacing Power Optimal Wires Smaller repeaters Increased spacing Delay Power Traditional Wires Large repeaters Optimum spacing
Design Space Exploration Base case B wires Bandwidth Optimized W wires Power Optimized P wires Power and B/W Optimized PW wires Fast, low bandwidth L wires
Outline • Overview • Wire Design Space Exploration • Employing L wires for Performance • PW wires: The Power Optimizers • Results • Conclusions
Evaluation Platform • Centralized front-end • I-Cache & D-Cache • LSQ • Branch Predictor • Clustered back-end L1 D Cache Cluster
Cache Access 5c Eff. Address Transfer 10c L S Q L1 D Cache Data return at 20c Mem. Dep Resolution 5c Cache Pipeline Cache Access 5c Cache Access 5c Eff. Address Transfer 10c Functional Unit L S Q L1 D Cache Eff. Address Transfer 10c L S Q L1 D Cache 8-bit Transfer 5c Data return at 20c Mem. Dep Resolution 5c Data return at 14c Partial Mem. Dep Resolution 3c
L wires: Accelerating cache access • Transmit LSB bits of effective address through L wires • Faster memory disambiguation • Partial comparison of loads and stores in LSQ • Introduces false dependences ( < 9%) • Indexing data and tag RAM arrays • LSB bits can prefetch data out of L1$ • Reduce access latency of loads
L wires: Narrow Bit Width Operands • PowerPC: Data bit-width determines FU latency • Transfer of 10 bit integers on L wires • Can introduce scheduling difficulties • A predictor table of saturating counters • Accuracy of 98% • Reduction in branch mispredict penalty
Power Efficient Wires. Idea: steer non-critical data through energy efficient PW interconnect Base case B wires Power and B/W Optimized PW wires
PW wires: Power/Bandwidth Efficient Regfile • Ready Register operands • Transfer of data at instruction dispatch • Transfer of input operands to remote register file • Covered by long dispatch to issue latency • Store data • Could stall commit process • Delay dependent loads IQ FU Operand is ready at cycle 90 Regfile Rename & Dispatch IQ FU Regfile IQ FU Consumer instruction Dispatched at cycle 100 Regfile IQ FU
Outline • Overview • Wire Design Space Exploration • Employing L wires for Performance • PW wires: The Power Optimizers • Results • Conclusions
Evaluation Methodology • Simplescalar -3.0 augmented to simulate a dynamically scheduled 4-cluster model • Crossbar interconnects (L, B and PW wires) L1 D Cache Cluster B wires (2 cycles) L wires (1 cycle) PW wires (3 cycles)
Heterogeneous Interconnects • Intercluster global Interconnect • 72 B wires (64 data bits and 8 control bits) • Repeaters sized and spaced for optimum delay • 18 L wires • Wide wires and large spacing • Occupies more area • Low latencies • 144 PW wires • Poor delay • High bandwidth • Low power
Analytical Model C = Ca + WsCb + Cc/S 1 2 3 Fringing Capacitance Capacitance between adjacent metal layers Capacitance between adjacent wires RC Model of the wire Total Power = Short-Circuit Power + Switching Power + Leakage Power
Evaluation methodology • Simplescalar -3.0 augmented to simulate a dynamically scheduled 16-cluster model • Ring latencies • B wires ( 4 cycles) • PW wires ( 6 cycles) • L wires (2 cycles) D-cache I-Cache Cluster LSQ Cross bar Ring interconnect
IPC improvements: L wires L wires improve performance by 4.2% on four cluster system and 7.1% on a sixteen cluster system
Four Cluster System: ED2 Improvements Link Relativemetal area IPC Relative processor energy (10%) Relative ED2 (10%) Relative ED2 (20%) 144 B 1.0 0.95 100 100 100 288 PW 1.0 0.92 97 103.4 100.2 144 PW 36 L 1.5 0.96 97 95.0 92.1 288 B 2.0 0.98 103 96.6 99.2 288 PW,36 L 2.0 0.97 99 94.4 93.2 144 B, 36 L 2.0 0.99 101 93.3 94.5
Link IPC Relative Processor Energy (20%) Relative ED2 (20%) 144 B 1.11 100 100 144 PW, 36 L 1.05 94 105.3 288 B 1.18 105 93.1 144 B, 36 L 1.19 102 88.7 288 B, 36 L 1.22 107 88.7 Sixteen Cluster system: ED2 gains
Conclusions • Exposing the wire design space to the architecture • A case for micro-architectural wire management! • A low latency low bandwidth network alone helps improve performance by up to 7% • ED2 improvements of about 11% compared to a baseline processor with homogeneous interconnect • Entails hardware complexity
Future work • 3-D wire model for the interconnects • Design of heterogeneous clusters • Interconnects for cache coherence and L2$
Questions and Comments? Thank you!