1 / 26

Microarchitectural Wire Management for Performance and Power in partitioned architectures

Processor Architecture. Microarchitectural Wire Management for Performance and Power in partitioned architectures. Rajeev Balasubramonian Naveen Muralimanohar Karthik Ramani Venkatand Venkatachalapathy. Overview/Motivation . Wire delays hamper performance.

kiefer
Download Presentation

Microarchitectural Wire Management for Performance and Power in partitioned architectures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Processor Architecture Microarchitectural Wire Management for Performance and Power in partitioned architectures Rajeev Balasubramonian Naveen Muralimanohar Karthik Ramani Venkatand Venkatachalapathy University of Utah

  2. Overview/Motivation • Wire delays hamper performance. • Power incurred in movement of data • 50% of dynamic power is in interconnect switching (Magen et al. SLIP 04) • MIT Raw processor’s on-chip network consumes 36% of total chip power (Wang et al. 2003) • Abundant number of metal layers

  3. Wire characteristics • Wire Resistance and capacitance per unit length • Width   R , C  • Spacing  C  • Delay  (as delay  RC), Bandwidth 

  4. Design space exploration • Tuning wire width and spacing 2d d Resistance Resistance B Wires Capacitance Capacitance Bandwidth

  5. Transmission Lines • Similar to L wires - extremely low delay • Constraining implementation requirements! • Large width • Large spacing between wires • Design of sensing circuits • Implemented in test CMOS chips

  6. Design space exploration • Tuning Repeater size and spacing Delay Power Traditional Wires Large repeaters Optimum spacing Power Optimal Wires Smaller repeaters Increased spacing

  7. Design space exploration Delay Optimized B wires Bandwidth Optimized W wires Power Optimized P wires Power and B/W Optimized PW wires Fast, low bandwidth L wires

  8. Heterogeneous Interconnects • Intercluster global Interconnect • 72 B wires • Repeaters sized and spaced for optimum delay • 18 L wires • Wide wires and large spacing • Occupies more area • Low latencies • 144 PW wires • Poor delay • High bandwidth • Low power

  9. Outline • Overview • Design Space Exploration • Heterogeneous Interconnects • Employing L wires for performance • PW wires: The power optimizers • Evaluation • Results • Conclusion

  10. L1 Cache pipeline Cache Access 5c Eff. Address Transfer 10c L S Q L1 D Cache Data return at 20c Mem. Dep Resolution 5c

  11. Exploiting L-Wires Cache Access 5c Eff. Address Transfer 10c L S Q L1 D Cache 8-bit Transfer 5c Data return at 14c Partial Mem. Dep Resolution 3c

  12. L wires: Accelerating cache access • Transmit LSB bits of effective address through L wires • Partial comparison of loads and stores in LSQ • Faster memory disambiguation • Introduces false dependences ( < 9%) • Indexing data and tag RAM arrays • LSB bits can prefetch data out of L1$ • Reduce access latency of loads

  13. L wires: Narrow bit width operands • Transfer of 10 bit integers on L wires • Schedule wake up operations • Reduction in branch mispredict penalty • A predictor table of 8K two bit counters • Identifies 95% of all narrow bit-width results • Accuracy of 98% • Implemented in the PowerPC!

  14. PW wires: Power/Bandwidth efficient • Idea: steer non-critical data through energy efficient PW interconnect • Transfer of data at instruction dispatch • Transfer of input operands to remote register file • Covered by long dispatch to issue latency • Store data

  15. Evaluation methodology • A dynamically scheduled clustered modeled with 4 clusters in simplescalar-3.0 • Crossbar interconnects • Centralized front-end • I-Cache & D-Cache • LSQ • Branch Predictor L1 D Cache Cluster B wires (2 cycles) L wires (1 cycle) PW wires (3 cycles)

  16. Evaluation methodology • A dynamically scheduled 16 cluster modeled in Simplescalar-3.0 • Ring latencies • B wires ( 4 cycles) • PW wires ( 6 cycles) • L wires (2 cycles) D-cache I-Cache Cluster LSQ Cross bar Ring interconnect

  17. IPC improvements: L wires L wires improves performance by 4% on four cluster system and 7.1% on a sixteen cluster system

  18. Four cluster system: ED2 gains

  19. Sixteen Cluster system: ED2 gains

  20. Conclusions • Exposing the wire design space to the architecture • A case for micro-architectural wire management! • A low latency low bandwidth network alone helps improve performance by upto 7% • ED2 improvements of about 11% compared to a baseline processor with homogeneous interconnect • Entails hardware complexity

  21. Future work • A preliminary evaluation looks promising • Heterogeneous interconnect entails complexity • Design of heterogeneous clusters • Energy efficient interconnect

  22. Questions and Comments? Thank you!

  23. Backup

  24. L wires: Accelerating cache access • TLB access for page look up • Transmit a few bits of Virtual page number on L wires • Prefetch data our of L1$ and TLB • 18 L wires( 6 tag bits, 8 L1 index and 4 TLB index bits)

  25. Model parameters • Simplescalar-3.0 with separate integer and floating point queues • 32 KB 2 way Instruction cache • 32 KB 4 way Data cache • 128 entry 8 way I and D TLB

  26. Overview/Motivation: • Three wire implementations employed in this study • B wires: traditional • Optimal delay • Huge power consumption • L wires: • Faster than B wires • Lesser bandwidth • PW wires: • Reduced power consumption • Higher bandwidth compared to B wires • Increased delay through the wires

More Related