Microarchitectural Wire Management for Performance and Power in partitioned architectures

Processor Architecture Microarchitectural Wire Management for Performance and Power in partitioned architectures Rajeev Balasubramonian Naveen Muralimanohar Karthik Ramani Venkatand Venkatachalapathy University of Utah

Overview/Motivation • Wire delays hamper performance. • Power incurred in movement of data • 50% of dynamic power is in interconnect switching (Magen et al. SLIP 04) • MIT Raw processor’s on-chip network consumes 36% of total chip power (Wang et al. 2003) • Abundant number of metal layers

Wire characteristics • Wire Resistance and capacitance per unit length • Width   R , C  • Spacing  C  • Delay  (as delay  RC), Bandwidth 

Design space exploration • Tuning wire width and spacing 2d d Resistance Resistance B Wires Capacitance Capacitance Bandwidth

Transmission Lines • Similar to L wires - extremely low delay • Constraining implementation requirements! • Large width • Large spacing between wires • Design of sensing circuits • Implemented in test CMOS chips

Design space exploration • Tuning Repeater size and spacing Delay Power Traditional Wires Large repeaters Optimum spacing Power Optimal Wires Smaller repeaters Increased spacing

Design space exploration Delay Optimized B wires Bandwidth Optimized W wires Power Optimized P wires Power and B/W Optimized PW wires Fast, low bandwidth L wires

Heterogeneous Interconnects • Intercluster global Interconnect • 72 B wires • Repeaters sized and spaced for optimum delay • 18 L wires • Wide wires and large spacing • Occupies more area • Low latencies • 144 PW wires • Poor delay • High bandwidth • Low power

Outline • Overview • Design Space Exploration • Heterogeneous Interconnects • Employing L wires for performance • PW wires: The power optimizers • Evaluation • Results • Conclusion

L1 Cache pipeline Cache Access 5c Eff. Address Transfer 10c L S Q L1 D Cache Data return at 20c Mem. Dep Resolution 5c

Exploiting L-Wires Cache Access 5c Eff. Address Transfer 10c L S Q L1 D Cache 8-bit Transfer 5c Data return at 14c Partial Mem. Dep Resolution 3c

L wires: Accelerating cache access • Transmit LSB bits of effective address through L wires • Partial comparison of loads and stores in LSQ • Faster memory disambiguation • Introduces false dependences ( < 9%) • Indexing data and tag RAM arrays • LSB bits can prefetch data out of L1$ • Reduce access latency of loads

L wires: Narrow bit width operands • Transfer of 10 bit integers on L wires • Schedule wake up operations • Reduction in branch mispredict penalty • A predictor table of 8K two bit counters • Identifies 95% of all narrow bit-width results • Accuracy of 98% • Implemented in the PowerPC!

PW wires: Power/Bandwidth efficient • Idea: steer non-critical data through energy efficient PW interconnect • Transfer of data at instruction dispatch • Transfer of input operands to remote register file • Covered by long dispatch to issue latency • Store data

Evaluation methodology • A dynamically scheduled clustered modeled with 4 clusters in simplescalar-3.0 • Crossbar interconnects • Centralized front-end • I-Cache & D-Cache • LSQ • Branch Predictor L1 D Cache Cluster B wires (2 cycles) L wires (1 cycle) PW wires (3 cycles)

Evaluation methodology • A dynamically scheduled 16 cluster modeled in Simplescalar-3.0 • Ring latencies • B wires ( 4 cycles) • PW wires ( 6 cycles) • L wires (2 cycles) D-cache I-Cache Cluster LSQ Cross bar Ring interconnect

IPC improvements: L wires L wires improves performance by 4% on four cluster system and 7.1% on a sixteen cluster system

Four cluster system: ED2 gains

Sixteen Cluster system: ED2 gains

Conclusions • Exposing the wire design space to the architecture • A case for micro-architectural wire management! • A low latency low bandwidth network alone helps improve performance by upto 7% • ED2 improvements of about 11% compared to a baseline processor with homogeneous interconnect • Entails hardware complexity

Future work • A preliminary evaluation looks promising • Heterogeneous interconnect entails complexity • Design of heterogeneous clusters • Energy efficient interconnect

Questions and Comments? Thank you!

Backup

L wires: Accelerating cache access • TLB access for page look up • Transmit a few bits of Virtual page number on L wires • Prefetch data our of L1$ and TLB • 18 L wires( 6 tag bits, 8 L1 index and 4 TLB index bits)

Model parameters • Simplescalar-3.0 with separate integer and floating point queues • 32 KB 2 way Instruction cache • 32 KB 4 way Data cache • 128 entry 8 way I and D TLB

Overview/Motivation: • Three wire implementations employed in this study • B wires: traditional • Optimal delay • Huge power consumption • L wires: • Faster than B wires • Lesser bandwidth • PW wires: • Reduced power consumption • Higher bandwidth compared to B wires • Increased delay through the wires

Microarchitectural Wire Management for Performance and Power in partitioned architectures

Microarchitectural Wire Management for Performance and Power in partitioned architectures

Presentation Transcript

Scalable Thread Scheduling and Global Power Management for Heterogeneous Many-Core Architectures

Performance and Power Management for Cloud Infrastructures

Compiler Managed Partitioned Data Caches for Low Power

Performance and Power Optimization through Data Compression in Network-on-Chip Architectures

Performance Management and Pay for Performance

Microarchitectural Wire Management for Performance and Power in Partitioned Architectures

Performance and Productivity of Emerging Architectures

Performance in GPU Architectures: Potentials and Distances

Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures

Coordinated Performance and Power Management

Microarchitectural Techniques for Power Gating of Execution Units

High-Performance Networks for Dataflow Architectures

Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures

Microarchitectural Floorplanning Under Performance and Temperature Tradeoff

Wire-driven Microarchitectural Design Space Exploration

Performance Evaluation of Architectures

Microarchitectural Performance Characterization of Irregular GPU Kernels

Hardware Architectures for Power and Energy Adaptation

Inherently Lower-Power High-Performance Superscalar Architectures

Compiler Challenges for High Performance Architectures

Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors